1

Long story short, for my first thread here, I have a software RAID5 array set up as follow: 4 disk devices with a linux-RAID partition on each. Those disks are: /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sde1

/dev/md0 is the raid5 device with a ciphered LVM on it. I use cryptsetup to open the device, then vgscan and lvcan -a to map my volumes.

Yesterday, I found out that /dev/sdd1 was failing. Here are the steps I followed:

0. remove the failing disk

#  mdadm --remove /dev/md0 /dev/sdd1

1. perform a check on the faulty drive

mdadm --examine /dev/sdd1

I got the "could not read metadata" error.

2. tried to read the partition table

I used parted and discovered that my Linux-RAID partition was gone, and when I tried to re-create it (hoping to be able to re-add the drive) I got the "your device is not writable"

So, it's been clear: that hard drive is dead.

3. Extract the hard drive from my case (bad things follow)

So I tried to extract /dev/sdd1 from my case not knowing which of the 4 drives it was. So I unplugged one SATA cable to find out that I had just unplugged /dev/sde1 ; I replugged it and unplugged the following one, nice catch! it was /dev/sdd1

4. what have I done?! sad face

using :

# mdadm --detail /dev/md0

I realized that /dev/sde1 left the array marked as "removed". I tried to re-add it, not using --re-add, but :

mdadm --add /dev/md0 /dev/sde1

/proc/mdstat showed me the rebuilding process and mdadm --detail /dev/md0 displayed /dev/sde1 as "spare" ; I know I might have done something terrible here.

I tried to remove /dev/sde1 from the array and use --re-add but mdadm told me he couldn't do it and advise me to stop and reassemble the array

5. Where to go from here?

First thing first, I am waiting for a new hard drive to replace the faulty one. Once I will have it and will set it up as a new Linux-RAID partition device known as /dev/sdd1, I will have to stop the array (LVM volumes are not mounted no more, obviously, cryptsetup closed the ciphered device, yet mdadm has not been able to stop the array yet). I was thinking about rebooting the entire system and work from a clean start. Here is what I figued I should do:

# mdadm --stop /dev/md0
# mdadm --stop /dev/md0
# mdadm --examine /dev/sd*1
# mdadm --assemble --scan --run --verbose

I read that without --run option, mdadm wll refuse to scan the degraded array.

Best case scenario: /dev/sde1 is recognized by the re-assembling process and new /dev/sdd1 is used to repair the previous faulty one. I would not have lost any data and will be happy.

Worst, and most common, case scenario: Re-assembling the array fails to recover /dev/sde1 and I have to start from a blank new array.

Am I missing something here? What should I review from this procedure?

Best Regards from France

2
  • 1. stick to read only / copy-on-write overlay raid.wiki.kernel.org/index.php/… 2. re-create using said overlays unix.stackexchange.com/a/131927/30851 # if it wasn't marked spare, --stop followed by --assemble --force would have sorted it in as much that is possible after yanking the wrong drive. Commented May 14, 2018 at 15:44
  • greetings @frostschutz ; sadly, after the --add action on /dev/sde1, the drive was marked as spare by mdadm --detail /dev/md0 so I am guessing from your reply that I screwed my data Commented May 14, 2018 at 16:33

1 Answer 1

1

So, I managed to get a full recovery, thanks to this link

What I did is as follow:

  1. I replaced the faulty disk and restarted the server.
  2. Then, I formatted the new disk as a Linux-RAID partition type.

    # mdadm --examine /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sde1
    

Then, based on the link above, I (re)created the array, based on the infos given by the --examine command.

# mdadm --create /dev/md0 --level=5 --raid-devices=4 --chunk=512 --name=server:0 /dev/sda1 /dev/sdb1 missing /dev/sde1 --assume-clean

As stated on this link, the --assume-clean did the trick! It avoided the "spare" state from /dev/sde1 and used it as a active part of the new array.

Key thing upon re-creating the array from "existing" devices might be not to mess up with the chunk parameter, unless you will loose the data.

  1. I then added the new device to this new array:

    # mdadm --add /dev/md0 /dev/sde1
    

The server started rebuilding (took 6hrs for 10 Tb), and after this, I forced an integrity check on the whole array (which took 6 hrs aswell)

I recovered everything and I am quite relieved!

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .