2

Is there a file checksum designed specifically for recovering a single file (archive) with data corruption? Something simple like a hash that can be used to recover the file

I am trying to archive some backups of home and business files (not media files) by compressing them and dating them. Largest archive currently runs about 250GB. After an archive was created I did a MD5 checksum on it, transferred the archive to another drive, then used the MD5 to verify files were transferred correctly and stored the MD5 hashes with the archives for future verification. I plan on trying to archive these backups 1-2 times a year and store them on HDD and tapes as budget allows.

Current archive format is "Zipx" with highest settings.

Given the volume of information of about 1-2 TB a year currently, I forsee having some sort of data corruption to deal with; especially given these files are on consumer drives. Add in that backups end up getting transferred around from to drive to drive, to tape, and back again that an initial 250GB archive can actually be many terabytes of written and read data increasing the risk of data corruption. And verifying MD5s after each transfer adds alot of time as the MD5 check is I/O limited; a MD5 check on a 250GB archive takes a long time multiplied by all the archives and MD5s are bound to not get checked as often as they need to.

So the assumptions are:

  1. Data will get corrupted
  2. We will not know about it until after the fact.
  3. Due to budget restrictions and the lack of "mission critical", we do not have multiple copies of the exact same backup archives, only different iterations of backups.
  4. We want to minimize the copies of our backups while protecting against data corruption.
  5. If a file or two in an archive does get corrupted and we lose the data when we try to restore; life will go on. This is not a mission critical thing.
  6. The archives are a secondary backup and hopefully will not get used more than a couple times in a decade or less. A live backup exists uncompressed.

With these assumption how do we protect against the data corruption.

Storing the MD5 hash only allows someone to know whether the current data matches the original data or not. It does not allow someone, or help in any way, to repair the data. That is if I need to restore from backup and have data corruption on the file or files I need, an MD5 is effectively useless.

So is there a checksum that is specifically designed to not only verify data but repair it as well? Kind of like ECC for memory but for files?

Note: I did find parchive, but it does not seem to be current and reliably useable. While I may not like how they implemented things, in general parchive is exactly what I am looking for but cannot find. Does something parchive-like exist that is "production" ready?

Update: It looks as though some archive formats do support recovery although the only mainstream one seems to be WinRAR. It would prefereable to not get locked into a format simply for this one option as most achiving formats (75% +/- in the linked list) do not seem to support recovery.

2
  • ECC adds redundancy whilst compressors tend to reduce it to minimal. 1 bit error in compressed file is likely to alter several files. When MD5 differs, which one is faulty ? :)
    – levif
    Commented Nov 11, 2016 at 23:43
  • Kind of my point. Which is why I am looking for something that could rebuild the data outside of the archive to avoid issues given the archive vs raw files. It would have to work on the bit level vs file level. It seems what ever it is would use Reed-Solomon error correction. But nothing I find seems to be user friendly, simple, long standing, and/or ready for use. Everything seems old or unsupported, complicated, etc.
    – Damon
    Commented Nov 12, 2016 at 0:58

1 Answer 1

1

I made a set of tools for this purpose, using Reed-Solomon and called pyFileFixity (see here for the list of tools included). It works mostly from commandline but an experimental GUI is provided if you really want to try it (just use --gui at commandline).

I made this tool to be opensource and reliable, it's unit tested at 83% (branch coverage). The whole library is extensively commented, and I developped the Reed-Solomon codec myself, all in pure Python (so the whole project is standalone, there is no external library), thus it is future proof (as long as you have a Python 2 interpreter, but a Python 3 version is in the works). It should be production ready, I use it regularly myself and I had several positive feedbacks, and any additional feedback is very welcome!

The ecc format I devised should be VERY stable and resilient against corruption, as it is even possible to correct the ecc files (see repair_ecc.py and the index files). The project will give you everything to curate your data AND also to test your curation scheme (see filetamper.py and resilency_tester.py, you can test the whole curation scheme using a makefile-like file describing the curation scheme, so you can include your file conversions, zip compression, pyFileFixity ecc calculation or another ecc calculation scheme, etc. and test whether your curation pipeline can withstand some amount of random data corruption).

However, the limitation is that the calculations will take quite some time, the rate is currently ~1MB/s, although I have plans to use parallelism to quadruple the speed. Still, you can see this as a limitation, and unluckily I do not think there is any faster mature error correcting code (Reed-Solomon is pretty much the only mature one, LDPC is coming but not there yet).

An alternative, if you don't need to ensure WHOLE data integrity but rather most data integrity, is to use a non-solid archiving algorithm such as ZIP DEFLATE, and then to compute the ECC hash only on the header using header_ecc.py (provided in pyFileFixity). This will ensure that your archive will always be openable, and that most of the data inside will be uncompressible, but it won't be able to correct all data tampers.

There is also the DAR archive format, an alternative to TAR, which allows to compress in a non-solid fashion (so partial uncompression of corrupted archives is possible) and recovery hash calculation based on PAR2 and also offers catalog isolation (ie, meta-data such as directory tree saved separately as a backup). But honestly, I don't think you will gain much in terms of speed with PAR2, and you will lose a lot in terms of redundancy (the PAR2 format is also based on Reed-Solomon but it has lots of limitations which my library does not have, and also PAR2 is kind of dead...).

So you have to ponder whether it costs you more to duplicate data (storage space) or to calculate an ecc hash (CPU time and electricity consumption). In terms of storage, the ecc hash can be any size you want, but usually a 20%~30% is a LOT of protection (optical discs only have ~5%, hard drives have less, and it's already working very well!).

If you reconsider duplication as a viable alternative, you can also correct your data if you ensure you make at least 3 copies of your archive. You can then use a bitwise majority vote to recover from data corruption (pyFileFixity provide a python script to do it: replication_repair.py). This is not as resilient as an ecc code, even if the resiliency rate is the same: 3 copies provide you with 33% resiliency rate (ie, 2 redundant copy on 3 divided by 2, this is the theoretical limit), but the window of protection is only 1 "ecc" (rather "spare") byte for 3 bytes, whereas with a real ecc code using pyFileFixity or PAR2, the window is up to 255 bytes: you protect 255 bytes by assigning 168 bytes as ecc bytes (so you have (255-168)=87 bytes protected by 168 ecc bytes, at any point in any file). Indeed, the resilency rate = 0.5 * ratio of ecc bytes, so you need a ratio of 66% ecc bytes to get 33% resiliency rate. But in the end, you have duplication scheme that takes 2x size of original archive to achieve a window of 1/3 bytes protected, whereas ecc scheme takes only 0.66x additional space to achieve a 87/255 bytes protected. Intuitively, it means that:

  • for duplication scheme, if more than 1 byte is corrupted, the byte is lost.
  • whereas for ecc scheme, you need to get more than 87 bytes corrupted in a row to lose them. If the corrupted bytes are spread over the whole archive, it's not a problem, because the 87 bytes limit is per window of 255 consecutive bytes.

So to summarize, ecc schemes are almost always more reliable because they have a bigger window size, but duplication is the fastest and, usually, cheapest way to have file correction (because storage is cheap nowadays).

6
  • That is amazing. You are awesome. I did kind of give up and just resorted to trying out winRAR and will likely fork over $$$ to them as it technically does what we need. Although I appreciate your final take away plug in there for ECC sceme vs Duplication. I am big on simple. Admittedly we will start building a tape library soon (need the tape drive) and will do more "duplication" and less archiving. But for some things it is nice to have an archive option that is robust enough to not need arbitrary duplicates.
    – Damon
    Commented Dec 29, 2016 at 11:41
  • @Damon Glad you found my answer useful :) About WinRAR, be careful, the .rar format is solid only, so this means that if the recovery isn't successful, you won't be able to extract anything from the archive! You might try DAR which is the only format I know of that supports both non-solid and ecc (so if the recovery fails, you can still extract what you can from the archive, this is generally called "partial extraction"), but last time I tried DAR it was a bit complicated to make it work :/
    – gaborous
    Commented Dec 29, 2016 at 13:28
  • Thank you for the insight. It sounds like the real solution is duplicate for simplicity, and archive where necessary. I will definitely look into DAR. Thanks!
    – Damon
    Commented Dec 29, 2016 at 17:52
  • The WinRAR I am using has an option for a solid archive but it is not checked by default. It also has an option to add a recovery record theoretically to protect against corruption. I am just using the highest compression setting available with a recovery record. I just do not liked be locked into proprietary software if I do not need to be.
    – Damon
    Commented Dec 29, 2016 at 18:11
  • @Damon Yes I think this is a good strategy, duplication is fast and provide some repairing capabilities, plus it will force you to store on different mediums so that's good. If you want an extra layer of protection for virtually no cost, you can use pyFileFixity's header_ecc.py subtool to protect the magic bytes of your files. This should be fast enough to compute (because you will protect only the header so ~1KB/file), and will ensure that your files are always readable (even if corrupted, at least you won't get the "file is not an archive" error if you try to open them after corruption).
    – gaborous
    Commented Dec 29, 2016 at 18:14

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .