39

Can we say that a truncated md5 hash is still uniformly distributed?

To avoid misinterpretations: I'm aware the chance of collisions is much greater the moment you start to hack off parts from the md5 result; my use-case is actually interested in deliberate collisions. I'm also aware there are other hash methods that may be better suited to use-cases of a shorter hash (including, in fact, my own), and I'm definitely looking into those.

But I'd also really like to know whether md5's uniform distribution also applies to chunks of it. (Consider it a burning curiosity.)

Since mediawiki uses it (specifically, the left-most two hex-digits as characters of the result) to generate filepaths for images (e.g. /4/42/The-image-name-here.png) and they're probably also interested in an at least near-uniform distribution, I imagine the answer is 'yes', but I don't actually know.

4
  • While we're here, anyone have good link to a proof of the uniformity of non-truncated md5 sums?
    – naught101
    Commented Nov 24, 2013 at 9:41
  • @naught101: Since this question is rather old (by internet measure) and has an accepted answer, it's unlikely to get much more exposure from people who could answer your question - maybe make your own question? :)
    – pinkgothic
    Commented Nov 25, 2013 at 13:50
  • 1
    Nice follow-up: ECDF plot from a truncated MD5
    – BenMorel
    Commented Feb 19, 2019 at 13:41
  • @Benjamin Awesome, thank you for the link!
    – pinkgothic
    Commented Feb 19, 2019 at 16:08

2 Answers 2

34

Yes, not exhibiting any bias is a design requirement for a cryptographic hash. MD5 is broken from a cryptographic point of view however the distribution of the results was never in question.

If you still need to be convinced, it's not a huge undertaking to hash a bunch of files, truncate the output and use ent ( http://www.fourmilab.ch/random/ ) to analyze the result.

1
  • Much appreciated - this is exactly the sort of answer I was looking for.
    – pinkgothic
    Commented Nov 20, 2011 at 19:23
13

I wrote a little php-program to answer this question. It's not very scientific, but it shows the distribution for the first and the last 8 bits of the hashvalues using the natural numbers as hashtext. After about 40.000.000 hashes the difference between the highest and the lowest counts goes down to 1%, so I'd say the distribution is ok. I hope the code is more precise in explaining what was computed :-) Btw, with a similar program I found that the last 8 bits seem to be distributed slightly better than the first.

<?php
// Setup count-array:
for ($y=0; $y<16; $y++) {
  for ($x=0; $x<16; $x++) {
    $count[dechex($x).dechex($y)] = 0;
  }
}

$text = 1; // The text we will hash.
$hashCount = 0;
$steps = 10000;

while (1) {
  // Calculate & count a bunch of hashes:
  for ($i=0; $i<$steps; $i++) {   
    $hash = md5($text);
    $count[substr($hash, 0, 2)]++;
    $count[substr($hash, -2)]++;
    $text++;
  }
  $hashCount += $steps;

  // Output result so far:
  system("clear");
  $min = PHP_INT_MAX; $max = 0;
  for ($y=0; $y<16; $y++) {
    for ($x=0; $x<16; $x++) {  
      $n = $count[dechex($x).dechex($y)];
      if ($n < $min) $min = $n;
      if ($n > $max) $max = $n;
      print $n."\t";
    }
    print "\n";
  }
  print "Hashes: $hashCount, Min: $min, Max: $max, Delta: ".((($max-$min)*100)/$max)."%\n";
} 
?>
1
  • 1
    This is fantastic. Thank you! (I suppose I could/should have done this myself, really!)
    – pinkgothic
    Commented Feb 19, 2012 at 12:54

Not the answer you're looking for? Browse other questions tagged or ask your own question.