3

I am confused as to how SHA-1 hashes are calculated for commits, trees, and blobs. As per this article, commit hashes are calculated based on following factors:

  1. The source tree of the commit (which unravels to all the subtrees and blobs)
  2. The parent commit sha1
  3. The author info
  4. The committer info (right, those are different!)
  5. The commit message

Are the same factors involved for tree and blob hashes as well?

6
  • No. Blobs don’t have trees, parent commits, authors, committers, or messages. Trees don’t have parent commits, authors, committers, or messages.
    – Ry-
    Commented Jul 2, 2017 at 14:17
  • Playing the devil's advocate, why does it matter to you how SHA-1 hashes are calculated, assuming that an algorithm is used which makes it reasonably unlikely that a collision would ever occur? Commented Jul 2, 2017 at 14:20
  • @Ryan Trees do have trees (directories) and blobs but not parent, authors, committers, or messages.
    – appu
    Commented Jul 2, 2017 at 14:58
  • @TimBiegeleisen It matters to me to really understand how git command actually affects the object storage. IMHO, if one doesn't know how things are formed then u really can't be at ease with it, atleast I don't.
    – appu
    Commented Jul 2, 2017 at 15:05
  • @appu: Copy and paste error, sorry.
    – Ry-
    Commented Jul 2, 2017 at 15:12

2 Answers 2

12

Git is sometimes called a "content-addressable filesystem". The hashes are the addresses, and they are based on the contents of the various objects. So, in order to know what the hash is based on, we only need to know the contents of the various objects.

Blobs

A blob is simply a stream of octets. Nothing more. It is akin to the concept of file content in a Unix filesystem.

So, the hash of a blob is based solely on its contents, a blob has no metadata.

Trees

A tree associates names and permissions with other objects (blobs or trees). A tree is simply a list of quadruples (permission, type, hash, name). For example, a tree may look like this:

100644 blob a906cb2a4a904a152e80877d4088654daad0c859 README
100644 blob 8f94139338f9404f26296befa88755fc2598c289 Rakefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0 lib

Note the third entry which is itself a tree.

A tree is analogous to a directory special file in a Unix filesystem.

Again, the hash is based on the contents of the tree, which means on the names, permissions, types, and hashes of its leaves.

Commits

A commit records a snapshot of a tree in time together with some metadata and how the snapshot came to be. A commit consists of:

  • a list of hashes of (any number of) parent commits (including zero)
  • a hash of a tree
  • a commit message
  • commit metadata (commit date and committer name)
  • authoring metadata (authoring date and author name)

The hash of a commit is based on those.

Tags

Tags aren't objects in the sense above. They are not part of the object store and don't have a hash. They are references to objects. (Note: any object can be tagged, not just commits, although that is the normal use case.)

Annotated Tags

An annotated tag is different: it is part of the object store.

An annotated tag stores:

  • a hash of a commit
  • a tag message
  • tagging metadata (tagger name and tagging date)

As with all other objects, the hash is calculated based on all of them and nothing more.

Signed tags

A signed tag is like an annotated tag, but adds a cryptographic signature.

Notes

Notes allow you to associate an arbitrary commit with an arbitrary Git object.

The storage of notes is a little more complicated. Actually, a note is just a commit (containing a tree containing blobs containing the contents of the note). Git creates a special branch for notes and the association between the note commit and its "annotee object" happens there. I am not familiar with exactly how.

However, since a note is just a commit, and the association happens externally, the hash of a note is just the same as any other commit.


Storage Format

The storage format contains a simple header. The content that is actually stored (and hashed) is the header followed by a NULL octet followed by the object contents.

The header contains the type and the length of the object contents, encoded in ASCII. So, the blob which contains the string Hello, World encoded in ASCII would look like this:

blob 12\0Hello, World

And that is what is hashed and stored.

Other types of objects have a more structured format, so a tree object would start off with a header tree <length of content in octets>\0 followed by a strictly defined, structured, serialized representation of a tree.

The same for commits, and so on.

Most formats are textual formats, based on simple ASCII. For example, the size is not encoded as a binary integer, but as a decimal integer with each digit represented as the corresponding ASCII character.

Compression

After the hash is computed, the octet stream corresponding to the object including the header is compressed using zlib-deflate, and the resulting octet stream is stored in a file based on the hash; per default in the directory

.git/objects/<first two characters of the hash>/<remaining hash>

Packs

The above storage format is called the loose object format, because every object is stored individually. There is a more efficient storage format (which is also used as the network transmission format), called a packfile.

Packfiles are an important speed and storage optimization, but they are rather complex, so I am not going to describe them in detail.

As a first approximation, a packfile consists of all the uncompressed objects concatenated into a single file and a second file, which contains an index of where in the packfile which object resides. The packfile as a whole is then compressed, which allows a better compression ratio, since the algorithm can also find redundancies between objects and not just within a single object. (E.g. if you have two revisions of a blob which are almost identical … which is kind of the norm in an SCM.)

It doesn't use zlib-deflate, rather it uses a binary delta compression algorithm. It also uses certain heuristics for how to place the objects in the packfile so that objects which are likely to have large similarity are placed closely together. (The delta algorithm cannot actually see the whole packfile at once, that would consume too much memory, rather it operates on a sliding window over the packfile; the heuristics try to ensure that similar objects land within the same window.) Some of those heuristics are: look at the names a tree associates with blobs and try to keep the ones with the same names close together, try to keep the ones with the same file extension close together, try to keep subsequent revisions close together and so on.

Poking around

Loose (i.e. not packed) objects are just zlib-deflated. un-deflate them and just look at them to see how they are structured. Note that the uncompressed octet stream is exactly what is being hashed; the objects are stored compressed but hashed before they are compressed.

Here's a simple Perl one-liner to un-deflate (is that inflate?) a stream:

perl -MCompress::Zlib -e 'undef $/; print uncompress(<>)'
2
  • One minor correction: notes are stored as commits, containing trees containing files. refs/notes/commits points to the tip-most of a chain of such commits. The tree associated with this tip-most commit has one file per commit-that-has-a-note: if there is a note for commit badf00..., for instance, there is a file named ba/df00... or ba/df/00... or similar in the "tip note". That file (a blob) contains the notes for commit badf00.... The degree of fan-out here depends on the number of noted commits.
    – torek
    Commented Jul 2, 2017 at 16:27
  • 1
    A tag doesn't need to point to a commit. Any Git object can be tagged.
    – ElpieKay
    Commented Jul 3, 2017 at 1:13
3

I think that the best way to understand the content of each type of git objects is to explore them yourself.

You could do it easily using the command :

git cat-file -p <a_sha1>

Start with the sha1 of a commit. You will get sha1s of trees, take one and apply always the same command to go all the way down to finish with a blob.

You will see every times the content that is stored in the git object in the database.

The only other thing that you should know is that the content is prefixed by the type of object, the length of the content and then compressed.

Not the answer you're looking for? Browse other questions tagged or ask your own question.