Re: [DNG] defective RAID

Author: Hendrik Boom
Date:
To: dng@lists.dyne.org
Subject: Re: [DNG] defective RAID

On Sun, Mar 26, 2017 at 07:03:53PM +0200, Didier Kryn wrote:
> Le 25/03/2017 20:17, Hendrik Boom a écrit :
> >I have two twinned RAIDs which are working just fine although the
> >second drive for both RAIDs is missing. After all, that's what it is
> >supposed to do -- work when things are broken..
> >
> >The RAIDs are mdadm-style Linux software RAIDs. One contains a /boot
> >partition; the other an LVM partition that contains all the other
> >partitions in the system, including the root partition, /usr, /home,
> >and the like.
> >
> >Both drives should contain the mdadm signature information, and the
> >same consistent file systems.
> >
> >Each RAID is spread, in duplicate, across the same two hard drives.
> >
> >
> >EXCEPT, of course, that one of the drives is now missing. It was
> >physically disconnected by accident while the machine was off, and
> >owing to circumstances, has remained disconnected for a significant
> >amunt of time.
> >
> >This means that the missing drive has everything needed to boot the
> >system, with valid mdadm signatures, and valid file systems, except,
> >of course, that its file system is obsolete.
> >
> >If I were to manage to reconnect the absent drive, how would the
> >boot-time RAID assembly work? (/boot is on the RAID). Would it be
> >able to figure out which of the two drives is up-to-date, and
> >therefore which one to consider defective and not use?
> >
> >Do I need to wipe the missing drive completely before I connect it?
> >(I have another machine to do this on, using a USB-to-SATA interface).
> >
> >
> Hi Hendrik.

>
> I several times had to replace hard drives in arrays of various raid > levels, and I'm very confident in mdadm. mdadm knows the partitions by their > uuids, not by their device names.

>
> As far as I can guess, it records the status of the array on every disk, > and I also imagine the broken partitions are marked as broken, which allows > it to look for the raid status on a good partition; this would explain why > its knowledge of the status is safe.

>
> This assumes the partitions are of type Linux-raid-autodetect; because, > if you have built a RAID1 with "normal" partitions (with some filesystem on > them), then I don't know how it can recover.

I can boot either with grub or with lilo. (Lilo boot is from a floppy disk).

lilo doesn't do RAID assemby, as far as I know. It just starts up
with a bunch of blocks at a fixed offset from an identified partition,
identified by UUID, and there's two of these. Now I guess at boot
time, it may always pick the right one, or always pick the wrong one,
or a random one.

Now it it picks the wrong one, it's likely not to be able to boot at
all, since it's highly likely that the kernel there is different from
the one on the right one. Or if the list of disk addresses is also on
the RAID, it will likely successbully find an obsolete kernel.

Obsolete kernel will still know how to do RAID assembly, though.
Which, as you say, will likely work flawlessly ad identify the
partitions on the obsolete drive as being the defectiiv ones. But
working with a kernel that's only a few months old is not likely to
cause trouble.

If it always boots from the wrong disk and fails, I can try swapping
the drives. If that fails, I can pull the obsolete drive and erase it
on another machine.

Thanks. There's light at the end of the tunnel and now I know it's
not the headlamp of an oncoming train.

-- hendrik

P.S. Of course I will take a full backup before I do any of this
stuff.

This message is part of the following thread:
	the complete thread tree sorted by date
	Hendrik Boom at
	karl at

Donate to Dyne.org