144

As I understand it when Git assigns a SHA1 hash to a file this SHA1 is unique to the file based on its contents.

As a result if a file moves from one repository to another the SHA1 for the file remains the same as its contents have not changed.

How does Git calculate the SHA1 digest? Does it do it on the full uncompressed file contents?

I would like to emulate assigning SHA1's outside of Git.

3

13 Answers 13

265

This is how Git calculates the SHA1 for a file (or, in Git terms, a "blob"):

sha1("blob " + filesize + "\0" + data)

So you can easily compute it yourself without having Git installed. Note that "\0" is the NULL-byte, not a two-character string.

For example, the hash of an empty file:

sha1("blob 0\0") = "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391"

$ touch empty
$ git hash-object empty
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

Another example:

sha1("blob 7\0foobar\n") = "323fae03f4606ea9991df8befbb2fca795e648fa"

$ echo "foobar" > foo.txt
$ git hash-object foo.txt 
323fae03f4606ea9991df8befbb2fca795e648fa

Here is a Python implementation:

from hashlib import sha1
def githash(data):
    s = sha1()
    s.update("blob %u\0" % len(data))
    s.update(data)
    return s.hexdigest()
9
  • Is this answer assuming Python 2? When I try this on Python 3 I get a TypeError: Unicode-objects must be encoded before hashing exception on the first s.update() line.
    – Mark Booth
    Commented Jun 10, 2013 at 0:05
  • 3
    With python 3 you need to encode the data: s.update(("blob %u\0" % filesize).encode('utf-8')) to avoid the TypeError.
    – Mark Booth
    Commented Jun 10, 2013 at 20:35
  • Encoding as utf-8 will work, but probably better to just build it from a byte string in the first place (the utf-8 encoding works because none of the unicode characters are non-ASCII).
    – torek
    Commented Apr 15, 2016 at 22:05
  • One additional thing worth mentioning is that git hash-object also seems to replace "\r\n" with "\n" inthe contents of data. It might very well strip the "\r"'s entirely, I didn't check that.
    – user420667
    Commented May 24, 2016 at 17:03
  • 1
    I put a Python 2 + 3 (both in one) implementation of a file and tree hash generator up here: github.com/chris3torek/scripts/blob/master/githash.py (the tree hasher reads a directory tree).
    – torek
    Commented Nov 14, 2016 at 9:37
19

A little goodie: in shell

echo -en "blob ${#CONTENTS}\0$CONTENTS" | sha1sum
4
  • 1
    I'm comparing echo -en "blob ${#CONTENTS}\0$CONTENTS" | sha1sum to the output of git hash-object path-to-file and they produce different results. However, echo -e ... produces the correct results, except there is a trailing - ( git hash-object produces no trailing characters). Is this something I should worry about? Commented Feb 10, 2015 at 21:09
  • 2
    @FrustratedWithFormsDesigner: The trailing - is used by sha1sum if it computed the hash from stdin and not from a file. Nothing to worry about. Weird thing though about the -n, that should suppress the newline normally appended by echo. Does your file by any chance have an empty last line, which you forgot to add in your CONTENTS variable?
    – knittl
    Commented Feb 11, 2015 at 6:38
  • Yes, you're correct. And I'd thought that the output of sha1sum should only be the hash, but it's not hard to remove it with sed or something. Commented Feb 11, 2015 at 14:33
  • @FrustratedWithFormsDesigner: You'll get the same output if you use cat file | sha1sum instead of sha1sum file (more processes and piping though)
    – knittl
    Commented Feb 11, 2015 at 14:37
10

You can make a bash shell function to calculate it quite easily if you don't have git installed.

git_id () { printf 'blob %s\0' "$(ls -l "$1" | awk '{print $5;}')" | cat - "$1" | sha1sum | awk '{print $1}'; }
1
  • 1
    A bit shorter: (stat --printf="blob %s\0" "$1"; cat "$1") | sha1sum -b | cut -d" " -f1.
    – sschuberth
    Commented Apr 19, 2016 at 15:40
4

Take a look at the man page for git-hash-object. You can use it to compute the git hash of any particular file. I think that git feeds more than just the contents of the file into the hash algorithm, but I don't know for sure, and if it does feed in extra data, I don't know what it is.

2
/// Calculates the SHA1 for a given string
let calcSHA1 (text:string) =
    text 
      |> System.Text.Encoding.ASCII.GetBytes
      |> (new System.Security.Cryptography.SHA1CryptoServiceProvider()).ComputeHash
      |> Array.fold (fun acc e -> 
           let t = System.Convert.ToString(e, 16)
           if t.Length = 1 then acc + "0" + t else acc + t) 
           ""
/// Calculates the SHA1 like git
let calcGitSHA1 (text:string) =
    let s = text.Replace("\r\n","\n")
    sprintf "blob %d%c%s" (s.Length) (char 0) s
      |> calcSHA1

This is a solution in F#.

6
  • I still have problems with umlauts: calcGitSHA1("ü").ShouldBeEqualTo("0f0f3e3b1ff2bc6722afc3e3812e6b782683896f") But my function gives 0d758c9c7bc06c1e307f05d92d896aaf0a8a6d2c. Any ideas how git hash-object handles umlauts?
    – forki23
    Commented Feb 24, 2010 at 11:38
  • it should handle the blob as a bytestream, that means ü has probably length 2 (unicode), F♯’s Length property will return length 1 (because it's only one visible character)
    – knittl
    Commented Feb 24, 2010 at 11:47
  • But System.Text.Encoding.ASCII.GetBytes("ü") returns a byte array with 1 element.
    – forki23
    Commented Feb 24, 2010 at 11:52
  • Using UTF8 and 2 as string length gives an byte array: [98; 108; 111; 98; 32; 50; 0; 195; 188] and therefor a SHA1 of 99fe40df261f7d4afd1391fe2739b2c7466fe968. Which is also not the git SHA1.
    – forki23
    Commented Feb 24, 2010 at 12:24
  • 1
    You must never apply digests to character strings. Instead you must apply them to byte strings (byte arrays) which you may obtain by converting a character string to bytes using an explicit encoding.
    – dolmen
    Commented Aug 6, 2011 at 9:33
2

Full Python3 implementation:

import os
from hashlib import sha1

def hashfile(filepath):
    filesize_bytes = os.path.getsize(filepath)

    s = sha1()
    s.update(b"blob %u\0" % filesize_bytes)

    with open(filepath, 'rb') as f:
        s.update(f.read())

    return s.hexdigest() 
1
  • 2
    What you really want is ASCII encoding. UTF8 only works here because it is compatible with ASCII and "blob x\0" only contains characters with code <= 127. Commented Oct 31, 2014 at 6:08
1

In Perl:

#!/usr/bin/env perl
use Digest::SHA1;

my $content = do { local $/ = undef; <> };
print Digest::SHA1->new->add('blob '.length($content)."\0".$content)->hexdigest(), "\n";

As a shell command:

perl -MDigest::SHA1 -E '$/=undef;$_=<>;say Digest::SHA1->new->add("blob ".length()."\0".$_)->hexdigest' < file
1

And in Perl (see also Git::PurePerl at http://search.cpan.org/dist/Git-PurePerl/ )

use strict;
use warnings;
use Digest::SHA1;

my @input = &lt;&gt;;

my $content = join("", @input);

my $git_blob = 'blob' . ' ' . length($content) . "\0" . $content;

my $sha1 = Digest::SHA1->new();

$sha1->add($git_blob);

print $sha1->hexdigest();
1

Using Ruby, you could do something like this:

require 'digest/sha1'

def git_hash(file)
  data = File.read(file)
  size = data.bytesize.to_s
  Digest::SHA1.hexdigest('blob ' + size + "\0" + data)
end
1

A little Bash script that should produce identical output to git hash-object:

#!/bin/sh
( 
    echo -en 'blob '"$(stat -c%s "$1")"'\0';
    cat "$1" 
) | sha1sum | cut -d\  -f 1
1

You can apply the same on files as well

$ echo "foobar" > foo.txt
$ echo "$(cat foo.txt)"|(read f; echo -en "blob "$((${#f}+1))"\0$f\n" )|openssl sha1
323fae03f4606ea9991df8befbb2fca795e648fa
0

In JavaScript

const crypto = require('crypto')
const bytes = require('utf8-bytes')

function sha1(data) {
    const shasum = crypto.createHash('sha1')
    shasum.update(data)
    return shasum.digest('hex')
}

function shaGit(data) {
    const total_bytes = bytes(data).length
    return sha1(`blob ${total_bytes}\0${data}`)
}
-4

It is interesting to note that obviously Git adds a newline character to the end of the data before it will be hashed. A file containing nothing than "Hello World!" gets a blob hash of 980a0d5..., which the same as this one:

$ php -r 'echo sha1("blob 13" . chr(0) . "Hello World!\n") , PHP_EOL;'
1
  • 4
    That newline is being added by your text editor, not by git hash-object. Note that doing echo "Hello World!" | git hash-object --stdin gives 980a0d5..., while using echo -n gives a hash of c57eff5... instead.
    – bdesham
    Commented Oct 28, 2013 at 21:38

Not the answer you're looking for? Browse other questions tagged or ask your own question.