I have a fairly large Linux software raid6 array, with 16 devices. Recently I noticed that a drive appears to have failed out of the array:
$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid6 sdk[17] sdn[19] sdp[16] sdg[13] sdi[10] sdl[8] sdj[11] sdh[14] sde[0] sdf[12] sdo[15] sda[2] sdc[6] sdb[7] sdd[1]
41021890560 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/15] [UUUUUUUUUUUUUUU_]
bitmap: 22/22 pages [88KB], 65536KB chunk
unused devices: <none>
Checking the details further:
# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time :
Raid Level : raid6
Array Size : 41021890560 (39121.52 GiB 42006.42 GB)
Used Dev Size : 2930135040 (2794.39 GiB 3000.46 GB)
Raid Devices : 16
Total Devices : 15
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time :
State : clean, degraded
Active Devices : 15
Working Devices : 15
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name :
UUID :
Events : 1781105
Number Major Minor RaidDevice State
0 8 64 0 active sync /dev/sde
1 8 48 1 active sync /dev/sdd
2 8 0 2 active sync /dev/sda
12 8 80 3 active sync /dev/sdf
6 8 32 4 active sync /dev/sdc
7 8 16 5 active sync /dev/sdb
8 8 176 6 active sync /dev/sdl
17 8 160 7 active sync /dev/sdk
10 8 128 8 active sync /dev/sdi
11 8 144 9 active sync /dev/sdj
13 8 96 10 active sync /dev/sdg
14 8 112 11 active sync /dev/sdh
16 8 240 12 active sync /dev/sdp
15 8 224 13 active sync /dev/sdo
19 8 208 14 active sync /dev/sdn
- 0 0 15 removed
Yes, I am using whole disks rather than partitions. I know now that this is not best practice, but I did not know this back in 2017 when I built the raid. So far it has not bitten me. I have replacement drives of the exact same make and model as the existing member drives.
So it looks like /dev/sdm
was the drive removed. I haven't been able to determine why it was removed. There don't seem to be any messages in dmesg or /var/log that point to why this drive was kicked out. The drive passes quick and extended SMART tests.
Examining the drive with mdadm, I am a bit confused by the results:
# mdadm --examine /dev/sdm
/dev/sdm:
Magic :
Version : 1.2
Feature Map : 0x1
Array UUID :
Name :
Creation Time :
Raid Level : raid6
Raid Devices : 16
Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Array Size : 41021890560 (39121.52 GiB 42006.42 GB)
Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=944 sectors
State : clean
Device UUID :
Internal Bitmap : 8 sectors from superblock
Update Time :
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : 8cfef706 - correct
Events : 328936
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 15
Array State : AAAAAAAAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
Despite its status as a "removed" device, it is showing as "Active device 15" here, and the Array State is showing 16 active devices. Is this something to be concerned about?
For comparison, here is the output of examining a different, working drive in the array:
# mdadm --examine /dev/sdl
/dev/sdl:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID :
Name :
Creation Time :
Raid Level : raid6
Raid Devices : 16
Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
Array Size : 41021890560 (39121.52 GiB 42006.42 GB)
Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262056 sectors, after=944 sectors
State : clean
Device UUID :
Internal Bitmap : 8 sectors from superblock
Update Time :
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 9bf69eb5 - correct
Events : 1782505
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 6
Array State : AAAAAAAAAAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)
I find it strange that this output now shows the missing drive.
Can someone tell me whether it would be safe to recover the array by removing the failed drive like this:
mdadm --manage /dev/md0 --remove /dev/sdm
(this may do nothing as the drive is already "Removed")- Comment out the array from
/etc/fstab
so that it is not auto-mounted on boot - Shut down the machine
- Replace the failed drive
- Start back up and
mdadm --manage /dev/md0 --add /dev/sdX
where X is the letter of the new, clean drive - Check that the array is recovering and wait for it to complete
I have backups, but some of this output makes me nervous and losing the entire array would still be a big pain. Appreciate any help, thanks.