mdadm one disk sometimes is missing after reboot

Question

On a debian system my home directory is on a raid1 md array. After creation it worked fine for some time, but once the second drive disappeared from the array.

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda3[1]
      1843414335 blocks super 1.2 [2/1] [U_]

When I manually add the missing disk it is detected as spare and array is reconstructed. Until the next reboot...

I managed to fix it by recreating the array (with the same partitions). It lasts for several reboots and now again the same issue.

Both disks are new, SMART checks are ok.

I have checked dmesg and here is the full "failing" sequence: Good synced RAID :

Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
1 8 19 1 active sync /dev/sdb3

On reboot (more dmesg here: http://pastebin.com/q1Du95Tv ):

[ 8.175247] sda: sda1 sda2 sda3 sda4
...
[ 8.644777] md: md0 stopped.
[ 8.645248] md: bind<sda3>
[ 8.646198] md: raid1 personality registered for level 1
[ 8.646377] md/raid1:md0: active with 1 out of 2 mirrors
[ 8.646391] md0: detected capacity change from 0 to 42916118528
[ 8.646407] RAID1 conf printout:
[ 8.646409] --- wd:1 rd:2
[ 8.646411] disk 0, wo:0, o:1, dev:sda3
[ 8.648749] md0: unknown partition table
[ 8.753331] usb 4-3: new full-speed USB device number 7 using ohci-pci
[ 8.840857] sdb: sdb1 sdb2 sdb3 sdb4
[ 8.841175] sd 1:0:0:0: [sdb] Attached SCSI disk

After this:

Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
2 0 0 2 removed

mdadm -E result: http://pastebin.com/cp65mNQh

Wytze · Accepted Answer · 2016-06-22 17:13:59Z

I had the same problem. What I found strange was the numbering of the devices which was 0 and 2. I ended up with recreating the whole raid setup.

mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1

Now the numbers are 0 and 1 again. (according to mdadm --detail /dev/md0)

After that I modified my /etc/mdadm/mdadm.conf file a bit.

ARRAY /dev/md0 level=raid1 num-devices=2 devices=/dev/sda1,/dev/sdb1 UUID=<uuid of your raid setup>

I also added a small delay in the /boot/cmdline.txt with rootdelay=5

All together this seems to have fixed my problem.

Community · Accepted Answer · 2017-04-13 12:14:40Z

On Linux, /dev/sdXY names are not guaranteed to be persistent. That is, for a given set of physical disks, the device nodes are not guaranteed to appear in the same order from one boot to the next.

If you have a single disk, you can be near 100% certain that it'll always show up as the same device node (for example, /dev/sda).

If you have two disks, you can be quite certain that they will always show up in the same order (such that "/dev/sda on boot 1" is the same physical device as "/dev/sda on boot 2" for any two consecutive boots).

If you have a hundred disks, the probability that two disks will be swapped suddenly jumps quite high. That can be the result of any number of irregularities, but in the end, it boils down to that /dev/sdX device nodes are created in order of disk detection and that detection order is not guaranteed.

If you want to guarantee that the same disk will always be referred to by a given name, you should use a persistent identifier. The daemon that everyone has a love-hate relationship with, udev, makes this easy by normally being configured to create symbolic links in the /dev/disk/by-* directories which point various aspects of the disks (bus/manufacturer/model/serial quadruplet, bus topology location, WWN, ...) to the device node as detected by the kernel. You can also add custom rules to create any naming scheme you prefer. I hit something similar (but not exactly the same) myself, which was resolved by migrating to WWN names.

In large storage arrays, this is actually a very real problem, and is the reason for some rather strong advocacy for always using persistent device names when referring to storage.

Your problem should disappear (or at the very least be greatly reduced, barring actual disk problems) if you re-add the disks to your array as for example /dev/disk/by-id/wwn-*-part* instead of /dev/sdXY. Doing so should not have any negative impact on anything else.

Good point about device names but in my case when I recover the raid manually (I did it many times) I always find the second disk with the name sdb. I even tried to capture the exact moment when the array is not assembled and the disk still have the same name. If there will be no better answer I will try to recreate the raid giving device's UUID. However I think I saw somewhere that md uses UUID to detect devices. — Artiom, Commented Aug 28, 2014 at 8:38
So after different tests it appears to be more raid disassemble issue. The last thing I tried is to put raid=noautodetect kernel option. Now the raid is assembled by the "final" os (systemd I suppose), but it changed nothing. It can still "survive" several reboots but not more. I was checking it every boot. Disk naming didn't change. I think it will be only changed if I change disk configuration (like add or remove a physical disk). — Artiom, Commented Sep 22, 2014 at 10:40
I updated the initial question with new details. It is definitely not a disk naming issue, but a boot sequence issue. I still has raid=noautodetect in kernel option, but seems it didn't work. — Artiom, Commented Sep 29, 2014 at 8:06

Stack Exchange Network

mdadm one disk sometimes is missing after reboot

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
raid
mdadm
.

Hot Network Questions

mdadm one disk sometimes is missing after reboot

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxraidmdadm.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
raid
mdadm
.