2

I constantly transfer disk images and virtual machine images (usually 800GB to nearly 1 TB per file) to a cloud server via rclone using SSH, and I wonder how reliable are sha1sum and md5sum when it comes to verifying the integrity of very large files.

I found this: How can I verify that a 1TB file transferred correctly?

However it has something to do with performance rather than the reliability of the hashes generated.

Could there be a possibility that another file shares the same hashes generated considering there are so many distinct files out there?

So how reliable are MD5 and SHA-1 sums on very large files? Thanks.

I also found out this regarding collision: https://stackoverflow.com/questions/4032209/is-md5-still-good-enough-to-uniquely-identify-files

https://www.theregister.co.uk/2017/02/23/google_first_sha1_collision/

2
  • They are unless you are very unlucky or give into it a lot of effort (for SHA1). With MD5 the effort is significantly lower. If you are worried, go for SHA2 or SHA3 variations.
    – Jakuje
    Commented Mar 5, 2017 at 15:58
  • 1
    see also pigeonhole principle and birthday problem. for transfer verification purposes, either algorithm will work as a first step -- pigeonhole tells us a nonmatching sum is definitely not the same file, but does not prove that a matching sum is definitely the same.
    – quixotic
    Commented Mar 5, 2017 at 16:06

1 Answer 1

5

MD5 and SHA-1 are both fine to detect accidental damage/changes to files. The probability of an accidentally changed file having the same MD5 digest is one in 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456. The probability of an accidental SHA-1 collision is even smaller, one in 2^160. If we're talking about finding accidental matches among a collection of files (known as the birthday problem), you'd need about 2^64 = 18 billion billion before a MD5 collision becomes likely. Note that the size of the files does not matter; it's the number of files involved that matters.

But neither MD5 nor SHA-1 is sufficient to protect against malicious substitution of files, or to provide a reliable unique ID for files. For example, if you use either one, someone could give you one file, have you calculate the hash digest, then trick you by swapping it for another file with the same hash. Or submit two files with the same hash, which might confuse your system.

BTW, the accidental/malicious distinction is a bit loose. Suppose someone found the two PDFs that Google produced with the same SHA-1 hash, thought "That's cool! I should save these for later", and then tried to use your system to store and distribute them... thus breaking the system sort-of by accident. If something like that is conceivable, you're better off going with SHA-256 instead.

EDIT: BitErrant is similar to what I described in the last paragraph: it's an exploit agains BitTorrent, taking advantage of the fact that BitTorrent uses SHA-1 checksums as IDs for chunks of files.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .