1

i want to duplicate the entire contents of wikimedia's dumps, monthly, and upload them to amazon glacier and/or tape backup and/or a third backup medium that can survive corporation collapse/electromagnetic pulse (if it will one day exist).

the problem is disk space, time is not so much an issue.

i've found that when i decompress the contents of a monthly database of simplewiki and recompress it using 7-zip, i end up with a 3% compression ratio.

i suspect that recompressing a year's worth of dumps would result in significantly higher compression as a solid archive, since it should only store the changes (or would it?) but for data integrity sake, this would probably be a bad idea without redundancy data, that 7-zip doesn't seem to support. i'm considering keeping two extra duplicates for this purpose and still using 7-zip.

the problem with decompressing the files for backup means i lose the hash/checksum stored on the original dump site.

i want to decompress a file and then reproduce the same hash/checksum by compressing it again using the same version of gzip/bzip2 that was used to compress it, along with the same options/method.

is this possible? do i need to 'fake' the modification timestamps? how do i determine what options were used?

5
  • Why not switch to checksumming uncompressed files? Compressors don't bother to give you same compressed stream, but they will always unpack it to the same stream as it was before. Commented Feb 20, 2021 at 7:00
  • i want the checksums to match the original server's record of checksums. the server doesn't provide the checksums of the uncompressed files. i've been experimenting with bzip2 and using the option -2 gets close to the filesize of the original, but not exact. ideally, i want to scrape all the .xml and .sql files as .html, and then be able to reproduce the original .xml and .sql files, but i understand this is unlikely, instead i'll settle for reproducing the original .bz2.
    – loud_flash
    Commented Feb 20, 2021 at 7:28
  • You can check original checksum ONCE, generate by-file uncompressed checksums, store them and rely on them. Commented Feb 20, 2021 at 12:06
  • @NikitaKipriyanov this will have to be what i do. in terms of the script that i will write, however, i will have to make sure the extracted contents are kept intact without errors. this isn't something i can do anytime soon, as we're talking about 150tb every 2 months, and i don't have that kind of storage, yet
    – loud_flash
    Commented Feb 20, 2021 at 21:30
  • my idea was that i could download all 150tb every month, and because they're wikis, most of the data would be the same with each successive backup, and only the changes would be stored. however, this isn't the case, compressing the last 5 backups of a wiki results in only slightly less storage than compressing each single backup on it's own, which is disappointing.
    – loud_flash
    Commented Feb 20, 2021 at 21:46

1 Answer 1

1

You somewhat miss the meaning of the word "same" and don't understand what exactly you compare and hash. Let's say you have two different compressed archive files, after unpacking they both return files, which compare to identity. Let's consider this unpacked contents "the same data". It's pretty possible, you already know it.

The direct consequence of the "sameness" of unpacked data is that hashes of that data will also compare to identity, i.e. will be "the same data".

Could this mean the archives themselves are "the same data"?

NO.

A compressed archive can be considered as a representation of data, somewhat it requires less computer memory to store, which is why we use it. A compressed data stream can be viewed as a set of instructions, which are to be processed by the special decompressor. It processes these instructions and hopefully reconstructs the original data.

For example, here is an idea of LZ77 algorithm: to produce a stream which directly contain raw charachers and also may contain some magic instruction which is understood as "look back N bytes and copy M bytes starting from there". Indeed, you can produce several different sets of such instructions, which still output the same data; for example, a string "abababab" can be stored as such, as "ab{-2,2}{-4,2}{-6,2}", "ab{-2,2}{-2,2}{-2,2}" or "ab{-2,6}" (this last one is very clever idea). Are those "compressed" representations the same? No, they are of different length and contain different numbers as parameters of "magic instruction"; overall they are different sets of instructions to finally return back the same original data. (Note, if you want to store such compressed streams to a file you'll use some efficient binary packing instead of curly brackets and ASCII digits. To keep you intrigued, I just described a half of good old ZIP algorithm.)

Another example is programming languages. Compare:

#include <stdio.h>
int main(void) {
    printf("Hello World!\n");
    return 0;
}
Begin
WriteLn("Hello World!")
End.

Do these programs produce the same output? Yes, sure. Are they the same? No, they are in different languages!


You have "same data" in archives, but archives themselves are sets of instructions and those instructions are different, despite the fact following them you'll end with the "same data". When you you checksum archives, you checksum those instructions themselves, not the result of following them. So you'll get different hashes.

A compressor is a state-of-art program which tries to find a set of optimal instructions which will still return back the given data. It is highly optimized to make all of computer it runs on. Because of this, the output depends on the environment: the particular set of instructions (a compressed file) it returns may depend on many factors, including the version of compressor program, the memory available during compression, the number of available processor cores and their type, and so on. Some compression algorithms even depend on true randomness! There is countless number of such environments, so it is extremely hard to find "the" environment like that in which the given archive file was produced to produce it again.

You are solving a wrong task. If you want to checksum/hash something, it must be something can recreate exactly. The compressed stream is not what you can recreate exactly.

8
  • gzip seems to only use a single thread, but i'm unsure about bzip2. so, you're saying unless i can reproduce the same environment that the original files were compressed with (same timestamp, memory available, cpu cores) that i can't reproduce the bzip2 or gzip file?
    – loud_flash
    Commented Feb 20, 2021 at 21:03
  • "Some compression algorithms even depend on true randomness!" and "The compressed stream is not what you can recreate exactly." must mean that i can't reproduce them, even with the same compression programs. perhaps it is indeed time to give up.
    – loud_flash
    Commented Feb 20, 2021 at 21:08
  • A particular version (with exactly same patches), CPU and OS you hardly can even know. Also if there is TAR in before GZIP, which turns a set of filesystem objects into a single file, and it puts metadata there too, so access modes, timestamps, directories, etc. are compressed too. The order in which files are inserted is important too. // Sometimes you can. Google "compression competition" to see how unusual compression algorithms could become. Commented Feb 21, 2021 at 5:38
  • how would i take advantage of repeated data? i was hoping i could store only the changes in the incremental monthly dumps. i've experimented with dedupe on a filesystem, but from what i've read, dedupe is at block level, so changes within individual large files don't make much (if it all) difference to storage used - however, if i extract all the .html files from the dumps, would this then use less storage?
    – loud_flash
    Commented Feb 22, 2021 at 0:18
  • Store diffs to unpacked data. Pack those diffs. Commented Feb 22, 2021 at 6:06

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .