Determine why drive was removed from software RAID array and rebuild safely

Question

I have a fairly large Linux software raid6 array, with 16 devices. Recently I noticed that a drive appears to have failed out of the array:

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md0 : active raid6 sdk[17] sdn[19] sdp[16] sdg[13] sdi[10] sdl[8] sdj[11] sdh[14] sde[0] sdf[12] sdo[15] sda[2] sdc[6] sdb[7] sdd[1]
      41021890560 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/15] [UUUUUUUUUUUUUUU_]
      bitmap: 22/22 pages [88KB], 65536KB chunk

unused devices: <none>

Checking the details further:

# mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : 
        Raid Level : raid6
        Array Size : 41021890560 (39121.52 GiB 42006.42 GB)
     Used Dev Size : 2930135040 (2794.39 GiB 3000.46 GB)
      Raid Devices : 16
     Total Devices : 15
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : 
             State : clean, degraded 
    Active Devices : 15
   Working Devices : 15
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : 
              UUID :
            Events : 1781105

    Number   Major   Minor   RaidDevice State
       0       8       64        0      active sync   /dev/sde
       1       8       48        1      active sync   /dev/sdd
       2       8        0        2      active sync   /dev/sda
      12       8       80        3      active sync   /dev/sdf
       6       8       32        4      active sync   /dev/sdc
       7       8       16        5      active sync   /dev/sdb
       8       8      176        6      active sync   /dev/sdl
      17       8      160        7      active sync   /dev/sdk
      10       8      128        8      active sync   /dev/sdi
      11       8      144        9      active sync   /dev/sdj
      13       8       96       10      active sync   /dev/sdg
      14       8      112       11      active sync   /dev/sdh
      16       8      240       12      active sync   /dev/sdp
      15       8      224       13      active sync   /dev/sdo
      19       8      208       14      active sync   /dev/sdn
       -       0        0       15      removed

Yes, I am using whole disks rather than partitions. I know now that this is not best practice, but I did not know this back in 2017 when I built the raid. So far it has not bitten me. I have replacement drives of the exact same make and model as the existing member drives.

So it looks like /dev/sdm was the drive removed. I haven't been able to determine why it was removed. There don't seem to be any messages in dmesg or /var/log that point to why this drive was kicked out. The drive passes quick and extended SMART tests.

Examining the drive with mdadm, I am a bit confused by the results:

# mdadm --examine /dev/sdm
/dev/sdm:
          Magic : 
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 
           Name : 
  Creation Time : 
     Raid Level : raid6
   Raid Devices : 16

 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
     Array Size : 41021890560 (39121.52 GiB 42006.42 GB)
  Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262064 sectors, after=944 sectors
          State : clean
    Device UUID : 

Internal Bitmap : 8 sectors from superblock
    Update Time : 
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : 8cfef706 - correct
         Events : 328936

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 15
   Array State : AAAAAAAAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

Despite its status as a "removed" device, it is showing as "Active device 15" here, and the Array State is showing 16 active devices. Is this something to be concerned about?

For comparison, here is the output of examining a different, working drive in the array:

# mdadm --examine /dev/sdl
/dev/sdl:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 
           Name :
  Creation Time : 
     Raid Level : raid6
   Raid Devices : 16

 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
     Array Size : 41021890560 (39121.52 GiB 42006.42 GB)
  Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=944 sectors
          State : clean
    Device UUID :

Internal Bitmap : 8 sectors from superblock
    Update Time : 
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 9bf69eb5 - correct
         Events : 1782505

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 6
   Array State : AAAAAAAAAAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)

I find it strange that this output now shows the missing drive.

Can someone tell me whether it would be safe to recover the array by removing the failed drive like this:

mdadm --manage /dev/md0 --remove /dev/sdm (this may do nothing as the drive is already "Removed")
Comment out the array from /etc/fstab so that it is not auto-mounted on boot
Shut down the machine
Replace the failed drive
Start back up and mdadm --manage /dev/md0 --add /dev/sdX where X is the letter of the new, clean drive
Check that the array is recovering and wait for it to complete

I have backups, but some of this output makes me nervous and losing the entire array would still be a big pain. Appreciate any help, thanks.

I'm not familiar with mdadm, but 16 drives in RAID6? That's no good. Let's say you've lost one drive. To rebuild, you'll have to do full reads on 15 drives simultaneously for hours. It will be quite stressful for them and chances of next failure will be increased. Hopefully you've mixed different models or at least batches, because failures in a single batch may be correlated. With disk capacities measured in TBs some say that anything less redundant than RAID1 is playing with fire. — gronostaj, Commented Sep 12, 2022 at 20:22
They are of different batches over several years, yes. They were not all purchased at the same time or from the same retailer. However they are all the exact same make and model, so that they can successfully be used as whole-drive array members. Not ideal, I know. I have successfully replaced 2 failed drives in this array before, as well as reshaped it several times to add new drives. The disks have seen a lot of I/O. — stiltzkin, Commented Sep 12, 2022 at 20:24

stiltzkin · Accepted Answer · 2022-09-15 21:03:17Z

I was able to successfully rebuild my array, although I'm still not exactly sure what error caused the drive in question to fail out. There were some generic UREs logged on the disk visible with smartctl -x despite the overall status of PASSED, which seems to be something of a false friend.

If it helps anyone, I followed these steps:

Ran mdadm --manage /dev/md0 --remove /dev/sdm but as I thought, this had no effect, as the disk was already automatically removed. Similarly mdadm --manage /dev/md0 --fail /dev/sdm had no effect, as the disk was in removed state.
Commented out my array from /etc/fstab to prevent it from being mounted on boot.
Shut down the system.
Removed the failed drive from its drive bay and replaced it with a new drive.
Started back up, validated that the HBA sees all 16 devices on boot.
Check for presence of a partition table on the new drive (which was also assigned /dev/sdm) with parted /dev/sdm print. The WD drives I am using do in fact ship with a GPT partition table from the factory and it showed up here. We need to get rid of this.
Destroy the partition table on the new disk with sgdisk --zap /dev/sdm. This is best practice when using whole disk members. If you are using partitions as array members, you would want to instead copy the partition table from a working drive over to the new drive in this step - details on that can be found here. Either way, make sure to select the correct disk(s) and in the correct order!
Add the new disk with mdadm --manage /dev/md0 --add /dev/sdm
Wait (in my case) approximately 6 stressful hours for the array to rebuild. Monitor status with cat /proc/mdstat.
Rejoice in a successful rebuild. Hopefully.

Stack Exchange Network

Determine why drive was removed from software RAID array and rebuild safely

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
ubuntu
hard-drive
raid
software-raid
.

Hot Network Questions

Determine why drive was removed from software RAID array and rebuild safely

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxubuntuhard-driveraidsoftware-raid.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
ubuntu
hard-drive
raid
software-raid
.