What is the fastest way to create a checksum for large files in C#

Question

I have to sync large files across some machines. The files can be up to 6GB in size. The sync will be done manually every few weeks. I cant take the filename into consideration because they can change anytime.

My plan is to create checksums on the destination PC and on the source PC and then copy all files with a checksum, which are not already in the destination, to the destination. My first attempt was something like this:

using System.IO;
using System.Security.Cryptography;

private static string GetChecksum(string file)
{
    using (FileStream stream = File.OpenRead(file))
    {
        SHA256Managed sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(stream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

The Problem was the runtime:
- with SHA256 with a 1,6 GB File -> 20 minutes
- with MD5 with a 1,6 GB File -> 6.15 minutes

Is there a better - faster - way to get the checksum (maybe with a better hash function)?

Do you really need check the Checksum? How are you copying the files? If your on windows I would use the latest version of Robocopy ... — Mesh, Commented Apr 29, 2010 at 16:13
Nice tip here to only bother hashing if the file sizes are different between 2 candidate files stackoverflow.com/a/288756/74585 — Matthew Lock, Commented Dec 1, 2014 at 9:05

Nate Barbettini · Accepted Answer · 2015-10-05 03:46:12Z

133

The problem here is that SHA256Managed reads 4096 bytes at a time (inherit from FileStream and override Read(byte[], int, int) to see how much it reads from the filestream), which is too small a buffer for disk IO.

To speed things up (2 minutes for hashing 2 Gb file on my machine with SHA256, 1 minute for MD5) wrap FileStream in BufferedStream and set reasonably-sized buffer size (I tried with ~1 Mb buffer):

// Not sure if BufferedStream should be wrapped in using block
using(var stream = new BufferedStream(File.OpenRead(filePath), 1200000))
{
    // The rest remains the same
}

edited Oct 5, 2015 at 3:46

Nate Barbettini

52.8k28 gold badges135 silver badges149 bronze badges

answered Jul 24, 2009 at 13:41

Anton Gogolev

115k39 gold badges203 silver badges291 bronze badges

4

OK - this made the diffence - hashing the 1.6GB file with MD5 took 5.2 seconds on my box (QuadCode @2.6 GHz, 8GB Ram) - even faster as the native implementaion...
– crono
Commented Jul 24, 2009 at 14:19
4

i don't get it. i just tried this suggestion but the difference is minimal to nothing. 1024mb file w/o buffering 12-14 secs, with buffering also 12-14 secs - i understand that reading hundreds of 4k blocks will produce more IO but i ask myself if the framework or the native APIs below the framework do not handle this already..
– Christian Casutt
Commented Feb 20, 2010 at 7:48
19

A little late to the party, but for FileStreams there is no longer any need to wrap the stream in a BufferedStream as it is nowadays already done in the FileStream itself. Source
– Reyhn
Commented Nov 2, 2016 at 10:47
I was just going through this issue with smaller files (<10MB, but taking forever to get an MD5). Even though I use .Net 4.5, switching to this method with the BufferedStream cut the hash time down from about 8.6 seconds to <300 ms for an 8.6MB file
– Taegost
Commented Jul 6, 2017 at 13:39
I used a BufferedStream /w 512 kB instead of 1024 kB. The 1.8 GB file was solved in 30 seconds.
– Hugo Woesthuis
Commented Nov 19, 2017 at 15:46

| Show 2 more comments

Binary Worrier · Accepted Answer · 2009-07-24 13:26:40Z

77

Don't checksum the entire file, create checksums every 100mb or so, so each file has a collection of checksums.

Then when comparing checksums, you can stop comparing after the first different checksum, getting out early, and saving you from processing the entire file.

It'll still take the full time for identical files.

answered Jul 24, 2009 at 13:26

Binary Worrier

51.5k20 gold badges139 silver badges186 bronze badges

2

I like the idea, but it will not work in my scenario because I will end up with a lot of unchanged files over the time.
– crono
Commented Jul 24, 2009 at 14:06
2

how do you checksum every 100mb of a file?
– Smith
Commented Aug 10, 2016 at 16:46
1

Not a good idea when using checksum for security reasons, because attacker can just change that bytes you have excluded.
– b.kiener
Commented Aug 8, 2018 at 10:49
3

+1 This is an excellent idea when you are performing a one-to-one comparison. Unfortunately, I'm using the MD5 hash as an index to look for unique files among many duplicates (many-to-many checks).
– Nathan Goings
Commented Aug 23, 2018 at 20:15
2

@b.kiener No byte is excluded. You misunderstood him.
– Soroush Falahati
Commented Jan 23, 2019 at 17:08

| Show 2 more comments

StayOnTarget · Accepted Answer · 2019-09-03 13:24:15Z

58

As Anton Gogolev noted, FileStream reads 4096 bytes at a time by default, But you can specify any other value using the FileStream constructor:

new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 16 * 1024 * 1024)

Note that Brad Abrams from Microsoft wrote in 2004:

there is zero benefit from wrapping a BufferedStream around a FileStream. We copied BufferedStream’s buffering logic into FileStream about 4 years ago to encourage better default performance

source

edited Sep 3, 2019 at 13:24

StayOnTarget

12.6k10 gold badges57 silver badges98 bronze badges

answered Jan 17, 2015 at 13:42

Tal Aloni

1,49914 silver badges14 bronze badges

Add a comment |

Christian Birkl · Accepted Answer · 2009-07-24 13:37:00Z

24

Invoke the windows port of md5sum.exe. It's about two times as fast as the .NET implementation (at least on my machine using a 1.2 GB file)

public static string Md5SumByProcess(string file) {
    var p = new Process ();
    p.StartInfo.FileName = "md5sum.exe";
    p.StartInfo.Arguments = file;            
    p.StartInfo.UseShellExecute = false;
    p.StartInfo.RedirectStandardOutput = true;
    p.Start();
    p.WaitForExit();           
    string output = p.StandardOutput.ReadToEnd();
    return output.Split(' ')[0].Substring(1).ToUpper ();
}

answered Jul 24, 2009 at 13:37

Christian Birkl

5264 silver badges8 bronze badges

3

WOW - using md5sums.exe from pc-tools.net/win32/md5sums makes it really fast. 1681457152 bytes, 8672 ms = 184.91 MB/sec -> 1,6GB ~ 9 seconds This will be fast enough for my purpose.
– crono
Commented Jul 24, 2009 at 13:59

Add a comment |

Community · Accepted Answer · 2017-05-23 10:31:20Z

18

Ok - thanks to all of you - let me wrap this up:

using a "native" exe to do the hashing took time from 6 Minutes to 10 Seconds which is huge.
Increasing the buffer was even faster - 1.6GB file took 5.2 seconds using MD5 in .Net, so I will go with this solution - thanks again

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Jul 24, 2009 at 14:26

crono

3,6433 gold badges27 silver badges24 bronze badges

Add a comment |

Bobrovsky · Accepted Answer · 2012-10-07 19:45:02Z

10

I did tests with buffer size, running this code

using (var stream = new BufferedStream(File.OpenRead(file), bufferSize))
{
    SHA256Managed sha = new SHA256Managed();
    byte[] checksum = sha.ComputeHash(stream);
    return BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
}

And I tested with a file of 29½ GB in size, the results were

10.000: 369,24s
100.000: 362,55s
1.000.000: 361,53s
10.000.000: 434,15s
100.000.000: 435,15s
1.000.000.000: 434,31s
And 376,22s when using the original, none buffered code.

I am running an i5 2500K CPU, 12 GB ram and a OCZ Vertex 4 256 GB SSD drive.

So I thought, what about a standard 2TB harddrive. And the results were like this

10.000: 368,52s
100.000: 364,15s
1.000.000: 363,06s
10.000.000: 678,96s
100.000.000: 617,89s
1.000.000.000: 626,86s
And for none buffered 368,24

So I would recommend either no buffer or a buffer of max 1 mill.

edited Oct 7, 2012 at 19:45

Bobrovsky

14.1k20 gold badges84 silver badges135 bronze badges

answered Oct 7, 2012 at 19:38

Anders

5671 gold badge7 silver badges23 bronze badges

1

I dont get it. How can this test contradict the accepted answer from Anton Gogolev?
– buddybubble
Commented Jun 18, 2014 at 11:41
Can you add description of each field in your data?
– videoguy
Commented Sep 28, 2015 at 17:43

Add a comment |

Romil Kumar Jain · Accepted Answer · 2019-03-16 13:46:33Z

I know that I am late to party but performed test before actually implement the solution.

I did perform test against inbuilt MD5 class and also md5sum.exe. In my case inbuilt class took 13 second where md5sum.exe too around 16-18 seconds in every run.

    DateTime current = DateTime.Now;
    string file = @"C:\text.iso";//It's 2.5 Gb file
    string output;
    using (var md5 = MD5.Create())
    {
        using (var stream = File.OpenRead(file))
        {
            byte[] checksum = md5.ComputeHash(stream);
            output = BitConverter.ToString(checksum).Replace("-", String.Empty).ToLower();
            Console.WriteLine("Total seconds : " + (DateTime.Now - current).TotalSeconds.ToString() + " " + output);
        }
    }

Fabske · Accepted Answer · 2020-01-22 00:59:24Z

4

You can have a look to XxHash.Net ( https://github.com/wilhelmliao/xxHash.NET )
The xxHash algorythm seems to be faster than all other.
Some benchmark on the xxHash site : https://github.com/Cyan4973/xxHash

PS: I've not yet used it.

answered Jan 22, 2020 at 0:59

Fabske

2,12619 silver badges34 bronze badges

Add a comment |

Pasi Savolainen · Accepted Answer · 2009-07-24 13:56:37Z

2

You're doing something wrong (probably too small read buffer). On a machine of undecent age (Athlon 2x1800MP from 2002) that has DMA on disk probably out of whack (6.6M/s is damn slow when doing sequential reads):

Create a 1G file with "random" data:

# dd if=/dev/sdb of=temp.dat bs=1M count=1024    
1073741824 bytes (1.1 GB) copied, 161.698 s, 6.6 MB/s

# time sha1sum -b temp.dat
abb88a0081f5db999d0701de2117d2cb21d192a2 *temp.dat

1m5.299s

# time md5sum -b temp.dat
9995e1c1a704f9c1eb6ca11e7ecb7276 *temp.dat

1m58.832s

This is also weird, md5 is consistently slower than sha1 for me (reran several times).

answered Jul 24, 2009 at 13:56

Pasi Savolainen

2,4901 gold badge22 silver badges35 bronze badges

Yes - I will try to increase the buffer - like Anton Gogolev sugested. I ran it through a "native" MD5.exe which took 9 seconds witth a 1,6 GB file.
– crono
Commented Jul 24, 2009 at 14:04

Add a comment |

Collectives™ on Stack Overflow

What is the fastest way to create a checksum for large files in C#

9 Answers 9

Not the answer you're looking for? Browse other questions tagged
c#
.net
large-files
checksum
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

Not the answer you're looking for? Browse other questions tagged c#.netlarge-fileschecksum or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c#
.net
large-files
checksum
or ask your own question.