Return to Answer

Wording improvement and example to explain how checksums are used to detect and repair silent data corruption

Source Link

edited Aug 6, 2018 at 15:38

19.7k
17
76
117

That's because when there is silent data corruption, md has no mechanismdoes not have enough information to identify aknow which block is silently corrupted data block.

You can technically make a bad sector with hdparm --make-bad-sector, but how do you know which disk has the data block affected by silent data corruption? It's not practical

Consider this simplified example:

Parity formula: PARITY = DATA_1 + DATA_2

+--------+--------+--------+
| DATA_1 | DATA_2 | PARITY |
+--------+--------+--------+
|      1 |      1 |      2 | # OK
+--------+--------+--------+

Now let's corrupt each of the blocks silently with a value of 3:

+--------+--------+--------+
| DATA_1 | DATA_2 | PARITY |
+--------+--------+--------+
|      3 |      1 |      2 | # Integrity failed – Expected: PARITY = 4
|      1 |      3 |      2 | # Integrity failed – Expected: PARITY = 4
|      1 |      1 |      3 | # Integrity failed – Expected: PARITY = 2
+--------+--------+--------+

If you didn't have the first table to calculatelook at, how would you know which block was corrupted?
You can't know for sure.

This is why Btrfs and ZFS both checksum blocks. It takes a little more disk space, but this extra information lets the storage system figure out which block is lying.

From Jeff Bonwick's blog article "RAID-Z":

Whenever you read a RAID-Z block, ZFS compares it against its checksum. If the data disks didn't return the right answer, ZFS reads the parity and then does combinatorial reconstruction to figure out which disk returned bad data.

To do this with Btrfs on md, you would still be out of luck if a parityhave to try recalculating each block were silently corrupteduntil the checksum matches in Btrfs, a time-consuming process with no easy interface exposed to the user/script.

When ZFS detects silent data corruption, it is automatically and immediately corrected on the spot without any human intervention.
If you need to rebuild an entire disk, ZFS will only "resilver" the actual data instead of needlessly running across the whole block device.
ZFS is an all-in-one solution to logical volumes and file systems, which makes it less complex to manage than Btrfs on top of md.
RAID-Z and RAID-Z2 are reliable and stable, unlike
- Btrfs on md RAID-5/RAID-6, which only offers error detection on silently corrupted data blocks (plus silently corrupted parity blocks may go undetected until it's too late) and no easy way to do error correction, and
- Btrfs RAID-565/RAID-6, which "has multiple serious data-loss bugs in it".
If I silently corrupted an entire disk with ZFS RAID-Z2, I would lose no data at all whereas on md RAID-6, I actually lost 455,681 inodes.

That's because when there is silent data corruption, md has no mechanism to identify a silently corrupted data block.

You can technically make a bad sector with hdparm --make-bad-sector, but how do you know which disk has the data block affected by silent data corruption? It's not practical to calculate, and you would still be out of luck if a parity block were silently corrupted.

When ZFS detects silent data corruption, it is automatically and immediately corrected on the spot without any human intervention.
If you need to rebuild an entire disk, ZFS will only "resilver" the actual data instead of needlessly running across the whole block device.
ZFS is an all-in-one solution to logical volumes and file systems, which makes it less complex to manage than Btrfs on top of md.
RAID-Z and RAID-Z2 are reliable and stable, unlike
- Btrfs on md RAID-5/RAID-6, which only offers error detection on silently corrupted data blocks (plus silently corrupted parity blocks may go undetected until it's too late) and no easy way to do error correction, and
- Btrfs RAID-56, which "has multiple serious data-loss bugs in it".
If I silently corrupted an entire disk with ZFS RAID-Z2, I would lose no data at all whereas on md RAID-6, I actually lost 455,681 inodes.

That's because when there is silent data corruption, md does not have enough information to know which block is silently corrupted.

You can technically make a bad sector with hdparm --make-bad-sector, but how do you know which disk has the block affected by silent data corruption?

Consider this simplified example:

Parity formula: PARITY = DATA_1 + DATA_2

+--------+--------+--------+
| DATA_1 | DATA_2 | PARITY |
+--------+--------+--------+
|      1 |      1 |      2 | # OK
+--------+--------+--------+

Now let's corrupt each of the blocks silently with a value of 3:

+--------+--------+--------+
| DATA_1 | DATA_2 | PARITY |
+--------+--------+--------+
|      3 |      1 |      2 | # Integrity failed – Expected: PARITY = 4
|      1 |      3 |      2 | # Integrity failed – Expected: PARITY = 4
|      1 |      1 |      3 | # Integrity failed – Expected: PARITY = 2
+--------+--------+--------+

If you didn't have the first table to look at, how would you know which block was corrupted?
You can't know for sure.

This is why Btrfs and ZFS both checksum blocks. It takes a little more disk space, but this extra information lets the storage system figure out which block is lying.

From Jeff Bonwick's blog article "RAID-Z":

Whenever you read a RAID-Z block, ZFS compares it against its checksum. If the data disks didn't return the right answer, ZFS reads the parity and then does combinatorial reconstruction to figure out which disk returned bad data.

To do this with Btrfs on md, you would have to try recalculating each block until the checksum matches in Btrfs, a time-consuming process with no easy interface exposed to the user/script.

When ZFS detects silent data corruption, it is automatically and immediately corrected on the spot without any human intervention.
If you need to rebuild an entire disk, ZFS will only "resilver" the actual data instead of needlessly running across the whole block device.
ZFS is an all-in-one solution to logical volumes and file systems, which makes it less complex to manage than Btrfs on top of md.
RAID-Z and RAID-Z2 are reliable and stable, unlike
- Btrfs on md RAID-5/RAID-6, which only offers error detection on silently corrupted data blocks (plus silently corrupted parity blocks may go undetected until it's too late) and no easy way to do error correction, and
- Btrfs RAID-5/RAID-6, which "has multiple serious data-loss bugs in it".
If I silently corrupted an entire disk with ZFS RAID-Z2, I would lose no data at all whereas on md RAID-6, I actually lost 455,681 inodes.

Source Link

created Aug 6, 2018 at 14:38

Deltik

19.7k
17
76
117

I cannot find a way to tell mdadm to repair this specific chunk.

That's because when there is silent data corruption, md has no mechanism to identify a silently corrupted data block.

I invite you to read my answer to question #4 ("Why does md continue to use a device with invalid data?") here, which explains this in further detail.

To make matters worse for your proposed layout, if a parity block suffers from silent data corruption, the Btrfs layer above can't see it! When a disk with the corresponding data block fails and you try to replace it, md will use the corrupted parity and irreversibly corrupt your data. Only when that disk fails will Btrfs recognize the corruption, but you have already lost the data.

This is because md does not read from parity blocks unless the array is degraded.

So is there any way to tell mdadm to repair a single chunk (which is not the parity) and possibly even mark a disk sector as bad? Maybe creating a read io error?

For bad sectors that the hard drive figured out itself, md can cope with that easily because the bad sector is identified to md.

I know ZFS can do all this all by itself, but I don't want to use ECC memory

Neither ZFS nor Btrfs over md depends on or is even aware of ECC memory. ECC memory only catches silent data corruption in memory, so it's storage system-agnostic.

I've recommended ZFS over Btrfs for RAID-5 and RAID-6 (analogous to ZFS RAID-Z and RAID-Z2, respectively) before in Btrfs over mdadm raid6? and Fail device in md RAID when ATA stops responding, but I would like to take this opportunity to outline a few more advantages of ZFS:

When ZFS detects silent data corruption, it is automatically and immediately corrected on the spot without any human intervention.
If you need to rebuild an entire disk, ZFS will only "resilver" the actual data instead of needlessly running across the whole block device.
ZFS is an all-in-one solution to logical volumes and file systems, which makes it less complex to manage than Btrfs on top of md.
RAID-Z and RAID-Z2 are reliable and stable, unlike
- Btrfs on md RAID-5/RAID-6, which only offers error detection on silently corrupted data blocks (plus silently corrupted parity blocks may go undetected until it's too late) and no easy way to do error correction, and
- Btrfs RAID-56, which "has multiple serious data-loss bugs in it".
If I silently corrupted an entire disk with ZFS RAID-Z2, I would lose no data at all whereas on md RAID-6, I actually lost 455,681 inodes.