Answering my own question some good 2 years later...
I ended up just exporting everything away from this server and formatting it, but faced the problem again recently and decided to tackle it head-on.
Everything I'm mentioning here is done with:
- CentOS 7 with
ml
kernel from ELRepo
.
- Both devices were added to a
raid1
during OS installation (like so) and all partitions were mounted on top of the raid1 (/boot
, /
and swap
). LVM or not doesn't seem to make a difference regarding this topic.
- As with default CentOS 7 installations, it comes with
GRUB2
instead of the legacy GRUB
, a big distinction as most of the documentation you find online pertains to v1.
- My setup DOES NOT contain an
EFI
partition. From what I found, EFI doesn't work on top of RAID, so big attention here as the procedure below won't work correctly if you do have EFI.
It became clear to me that the issue was that CentOS was only installing the bootloader (GRUB2) into the first physical device. Apparently it uses a tiny MSDOS
partition (is partition the correct terminology? Maybe a 'flag'?) that cannot run on top of RAID. Because of this, even simply swapping the disks cause the system to be unable to boot as even though the secondary disk does have a full copy of the system, it doesn't have the bootloader.
Now to explore the solution, which hopefully is straightforward enough: copy the bootloader from the main disk to the secondary disk.
I was unsure how to proceed as I didn't know exactly how the bootloader calls the devices. Maybe simply mirroring the bootloader from the first one would cause conflicts, so it was possible I had to invert the entries so it knows which device is actually which.
Exploring on this, I found a solution to use dd
to copy the bootloader onto the second disk. I had found a post from a user saying they always did that and always worked, but for me for whatever reason it didn't.
Trying to understand better how GRUB2 works, I found that it includes an 'install' tool that easily adds the bootloader to the desired device, and voila it was exactly what I needed!
All I had to do was:
grub2-install /dev/sdb
Then after swapping the disks, the system still booted as expected! Half of the mission accomplished.
Then after confirming both devices can boot, I just had to remove the failed disk (now 'sdb') from the array (documentation of this can be found online, it involves marking the device as failed and then removing it from the array) and swapping it for the new disk, and then finally adding that back to the mdadm array and hopefully it should start syncing automatically (you can watch the progress with watch cat /proc/mdstat
).
Also, don't forget to run grub2-install
again on the new device after you're done.
Hopefully this can help someone else facing the same issue.
If you have an EFI partition, you'd probably have to find a way to copy it to the beginning of the secondary device (dd
should do the job, though I've no idea what parameters you'd have to give it).