0

I want to build a NAS using mdadm for the RAID and btrfs for the bitrot detection. I have a fairly basic setup, combining 3 1TB disks with mdadm to a RAID5, than btrfs on top of that.

I know that mdadm cannot repair bitrot. It can only tell me when there are mismatches but it doesn't know which data is correct and which is faulty. When I tell mdadm to repair my md0 after I simulate bitrot, it always rebuilds the parity. Btrfs uses checksums so it knows which data is faulty, but it cannot repair the data since it cannot see the parity.

I can however run a btrfs scrub and read the syslog to get the offset of the data that did not match its checksum. I then can translate this offset to a disk and a offset on that disk, because I know the data start offset of md0 (2048 * 512), the chunk size (512K) and the layout (left-symmetric). The layout means that in my first layer the parity is on the third disk, in the second layer on the second disk, and in the third layer on the first disk.

Combining all this data and some more btrfs on disk-format knowledge, I can calculate exactly which chunk of which disk is the faulty one. However, I cannot find a way to tell mdadm to repair this specific chunk.

I already wrote a script which swaps the parity and the faulty chunk using the dd command, then starts a repair with mdadm and then swaps them back, but this is not a good solution and I would really want mdadm to mark this sector as bad and don't use it again. Since it started to rot, chances are high it will do it again.

My question is: is there any way to tell mdadm to repair a single chunk (which is not the parity) and possibly even mark a disk sector as bad? Maybe creating a read io error?

( And I know ZFS can do all this all by itself, but I don't want to use ECC memory )

Edit: this question / answer is about how btrfs RAID6 is unstable and how ZFS is much more stable / useable. That does not address my question about how to repair a single known faulty chunk with mdadm.

11
  • 1
    You do not need ECC memory to use ZFS.... I'd recommend you use ZFS.
    – Attie
    Commented Aug 6, 2018 at 13:27
  • 1
    "... I then can translate this offset to a disk and a offset on that disk ..." - you are planning on being far too hands-on with the storage... It will probably go wrong.
    – Attie
    Commented Aug 6, 2018 at 13:30
  • If I use ZFS without ECC, I could just as well not worry about bitrot protection. Both prevent very rare errors, but I want to do this right. As for being too hands on, you're right. But I don't see any better way. I know it's possible, netgear's ReadyNas and Synology combine mdadm and btrfs, and still keep bitrot protection. Commented Aug 6, 2018 at 13:36
  • 1
    You're not just protecting against bit rot, but also other things like read / write errors (e.g: high write). The ZFS / ECC issue has been hugely exaggerated and misunderstood - yes a running machine might benefit from ECC, but for data to suffer a number of rare issues will have to occur in just the right way. You would be better off using ZFS for the situation you've outlined... How would using BTRFS+MDADM+Scripts without ECC be less of an issue than ZFS without ECC?
    – Attie
    Commented Aug 6, 2018 at 13:53
  • 1
    Even though the question asked is not the same as Btrfs over mdadm raid6?, this is an XY problem that is fully addressed by the other question and answer.
    – Deltik
    Commented Aug 6, 2018 at 13:58

2 Answers 2

1

I cannot find a way to tell mdadm to repair this specific chunk.

That's because when there is silent data corruption, md does not have enough information to know which block is silently corrupted.

I invite you to read my answer to question #4 ("Why does md continue to use a device with invalid data?") here, which explains this in further detail.

To make matters worse for your proposed layout, if a parity block suffers from silent data corruption, the Btrfs layer above can't see it! When a disk with the corresponding data block fails and you try to replace it, md will use the corrupted parity and irreversibly corrupt your data. Only when that disk fails will Btrfs recognize the corruption, but you have already lost the data.

This is because md does not read from parity blocks unless the array is degraded.


So is there any way to tell mdadm to repair a single chunk (which is not the parity) and possibly even mark a disk sector as bad? Maybe creating a read io error?

For bad sectors that the hard drive figured out itself, md can cope with that easily because the bad sector is identified to md.

You can technically make a bad sector with hdparm --make-bad-sector, but how do you know which disk has the block affected by silent data corruption?

Consider this simplified example:

Parity formula: PARITY = DATA_1 + DATA_2

+--------+--------+--------+
| DATA_1 | DATA_2 | PARITY |
+--------+--------+--------+
|      1 |      1 |      2 | # OK
+--------+--------+--------+

Now let's corrupt each of the blocks silently with a value of 3:

+--------+--------+--------+
| DATA_1 | DATA_2 | PARITY |
+--------+--------+--------+
|      3 |      1 |      2 | # Integrity failed – Expected: PARITY = 4
|      1 |      3 |      2 | # Integrity failed – Expected: PARITY = 4
|      1 |      1 |      3 | # Integrity failed – Expected: PARITY = 2
+--------+--------+--------+

If you didn't have the first table to look at, how would you know which block was corrupted?
You can't know for sure.

This is why Btrfs and ZFS both checksum blocks. It takes a little more disk space, but this extra information lets the storage system figure out which block is lying.

From Jeff Bonwick's blog article "RAID-Z":

Whenever you read a RAID-Z block, ZFS compares it against its checksum. If the data disks didn't return the right answer, ZFS reads the parity and then does combinatorial reconstruction to figure out which disk returned bad data.

To do this with Btrfs on md, you would have to try recalculating each block until the checksum matches in Btrfs, a time-consuming process with no easy interface exposed to the user/script.


I know ZFS can do all this all by itself, but I don't want to use ECC memory

Neither ZFS nor Btrfs over md depends on or is even aware of ECC memory. ECC memory only catches silent data corruption in memory, so it's storage system-agnostic.

I've recommended ZFS over Btrfs for RAID-5 and RAID-6 (analogous to ZFS RAID-Z and RAID-Z2, respectively) before in Btrfs over mdadm raid6? and Fail device in md RAID when ATA stops responding, but I would like to take this opportunity to outline a few more advantages of ZFS:

  • When ZFS detects silent data corruption, it is automatically and immediately corrected on the spot without any human intervention.
  • If you need to rebuild an entire disk, ZFS will only "resilver" the actual data instead of needlessly running across the whole block device.
  • ZFS is an all-in-one solution to logical volumes and file systems, which makes it less complex to manage than Btrfs on top of md.
  • RAID-Z and RAID-Z2 are reliable and stable, unlike
    • Btrfs on md RAID-5/RAID-6, which only offers error detection on silently corrupted data blocks (plus silently corrupted parity blocks may go undetected until it's too late) and no easy way to do error correction, and
    • Btrfs RAID-5/RAID-6, which "has multiple serious data-loss bugs in it".
  • If I silently corrupted an entire disk with ZFS RAID-Z2, I would lose no data at all whereas on md RAID-6, I actually lost 455,681 inodes.
0
-1

I found a way to create a read error for mdadm.

With dmsetup you can create logical devices from tables.

Devices are created by loading a table that specifies a target for each sector (512 bytes)

From: manpage

In these tables you can specify offsets which should return an IO error, for example:

0 4096 linear /dev/sdb 0
4096 1 error
4097 2093055 linear /dev/sdb 4097

This creates a device (1GB) with an error at offset 4096*512.

1
  • Why on Earth would you do this? Creating an md device with intentional unfixable errors precludes successful array rebuilds, and in the bigger picture, fixating on hacks like this goes against the flow of data integrity, and tunnel vision leads to accidents. You mentioned, "I want to do this right", so you really should be using the decade-long established solution, ZFS.
    – Deltik
    Commented Aug 9, 2018 at 15:16

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .