:: Re: [DNG] defective RAID
Top Page
Delete this message
Reply to this message
Author: Simon Hobson
Date:  
To: dng@lists.dyne.org
Subject: Re: [DNG] defective RAID
Hendrik Boom <hendrik@???> wrote:

> I have two twinned RAIDs which are working just fine although the
> second drive for both RAIDs is missing. After all, that's what it is
> supposed to do -- work when things are broken..
>
> The RAIDs are mdadm-style Linux software RAIDs. One contains a /boot
> partition; the other an LVM partition that contains all the other
> partitions in the system, including the root partition, /usr, /home,
> and the like.
>
> Both drives should contain the mdadm signature information, and the
> same consistent file systems.
>
> Each RAID is spread, in duplicate, across the same two hard drives.
>
>
> EXCEPT, of course, that one of the drives is now missing. It was
> physically disconnected by accident while the machine was off, and
> owing to circumstances, has remained disconnected for a significant
> amunt of time.
>
> This means that the missing drive has everything needed to boot the
> system, with valid mdadm signatures, and valid file systems, except,
> of course, that its file system is obsolete.
>
> If I were to manage to reconnect the absent drive, how would the
> boot-time RAID assembly work? (/boot is on the RAID). Would it be
> able to figure out which of the two drives is up-to-date, and
> therefore which one to consider defective and not use?


OK, been there, got the tee shirt :-)

If you boot the system with the second drive connected, (I think) you'll find yourself with two sets of raid volumes. The risk is that, depending on how the system is setup, it's "a bit arbitrary" which one gets mounted. Ideally you want to boot the system and then connect the second drive.
At this point, my memory gets a bit vague - lots of googling while "slightly stressed" (production system down).

IIRC you can't just add the partitions back into the arrays - it'll complain that the update counters are different. There's a counter which gets updated when the array is written to, and so when an array member is absent - the counters get out of sync and this can be used to detect the issue and not assemble an array from inconsistent members.
So I think you need to use the "delete metadata" option to mdadm to "clear" the partitions. Then you add it in, and it'll be rebuilt.
You may have to explicitly remove a device before you can re-add it.

Your /boot may be OK. It's typically not written to so it can just be assembled - the others will need to be rebuilt.

Just checking man mdadm, and adding a bit of vague memory recall ...
mdadm --detail /dev/sdxn will tell you ... well details ... about an array, specifically what devices it has and hasn't
mdadm --examine /dev/sdxn will tell you details, including this update counter, it's labelled "Events"
mdadm /dev/mdnnn --add /dev/sdxn will add a drive. It will automatically go into rebuild mode which will be shown in /proc/mdstat