4

Each Git commit object points to a tree object. Does every commit-tree object store all its entries with it or does it only add new entries and only contain deltas from commit's parent?

e.g. Linux source code has 1M commits and thousands of objects (master has 70,000). If every commit object contains all the objects' entries, it would take huge space in the long run. Also it is a lot of processing and transfer even when a single line of change is committed/pushed.

I understand the Git philosophy is to store snapshots and not deltas for files, but in that case, only the changed file is getting stored.

In the below example, 70951b429e0e1191a8c1d9e34248cd76453ef544 contains (or shows up as containig) all the 5 files, even if only a single file was added.

[test]$ls
a.txt  b.txt  c.txt  d.txt
[test]$echo r5 > e.txt
[test]$git add -A && git commit -m "r5"
[master 51f6941] r5
[test]$git cat-file -p 51f6941
tree 70951b429e0e1191a8c1d9e34248cd76453ef544
[test]$git cat-file -p 70951b429e0e1191a8c1d9e34248cd76453ef544
100644 blob 9a6c8d12dea8859b821b2ba705f7efd6cc914aa5    a.txt
100644 blob 9a6c8d12dea8859b821b2ba705f7efd6cc914aa5    b.txt
100644 blob b6693b64f528de38cde5533acd781fde743bc3df    c.txt
100644 blob 91174caefafdc81d34e302874c86c6e4d5212075    d.txt
100644 blob 29f4cfc46ba3a0bde55bce8f44ac3590e2108da4    e.txt
1

3 Answers 3

7

Every commit holds—logically, anyway—a complete snapshot of every file (well, every file that is in the commit).

If you pick a commit, e.g., by its hash ID, and run git checkout on that commit, your work-tree is filled from the files that are in that commit. That is, your work-tree takes on that snapshot. Switch away from that commit to some other commit that has, say, three fewer files, and Git removes those three files (and updates the remaining ones if/as needed).

If every commit object contains all the objects' entries, it would take huge space in the long run.

Except ... it doesn't. There are two amazing (or not really that amazing) feats of cleverness involved.

The first one shows up right here:

[test]$git cat-file -p 70951b429e0e1191a8c1d9e34248cd76453ef544
100644 blob 9a6c8d12dea8859b821b2ba705f7efd6cc914aa5    a.txt
100644 blob 9a6c8d12dea8859b821b2ba705f7efd6cc914aa5    b.txt
100644 blob b6693b64f528de38cde5533acd781fde743bc3df    c.txt
100644 blob 91174caefafdc81d34e302874c86c6e4d5212075    d.txt
100644 blob 29f4cfc46ba3a0bde55bce8f44ac3590e2108da4    e.txt

Note that blob hash ID 9a6c8d12dea8859b821b2ba705f7efd6cc914aa5 shows up twice: once for a.txt and once for b.txt.

There is only one copy of the contents of both a.txt and b.txt. From this we can conclude that whatever is in a.txt, and in b.txt, the contents are the same.

So if you commit 100 files, then make a new commit in which 99 files are the same as 99 of the previous commit's files, you've just re-used 99 blob objects. They did not have to be stored again.

Git automatically de-duplicates file contents this way.

The second bit of cleverness happens later. Initially, all objects are stored as zlib-compressed files (files in .git/objects/, though you should not count on this). If you change a few bytes in a file and use git add and the new blob object is not a 100% exact match for some already-existing blob object, you get a new one of these objects. These are called loose objects, internally.

When there are enough loose objects lying around, or sooner if needed, Git packs the loose objects into a pack file. At this time, objects that can be profitably delta-compressed, usually are. This compression is the really clever code.

When you use git fetch or git push, Git will figure out which objects need to be transferred over the network and build a so-called thin pack. This is where you see the counting and compressing objects messages. Git then sends the thin-pack over the wire; the Git at the other end fixes up the thin pack, to make it a regular (fat) pack. When there are too many pack files, Git will repack the pack files, taking you from many *.pack and *.idx files down to just a few (or one) again.

(There have been some occasional bugs here. There was a recent fix to deal with large numbers of pack files. There are several older bugs where too many loose objects get left around. An occasional manual git gc is sometimes helpful to work around these bugs, but using git gc too often can be counterproductive.)

4

A tree object itself is always complete. It represents one level of a directory =hierarchy. So if you have a directory src and directories inside that called foo and bar, each with contents, you'll have tree objects for the top level, for src, for src/foo, and for src/bar.

However, the actual data in the files is stored as blobs. If a file doesn't change, Git doesn't store a new copy of it: it just references the existing blob object. This is true also for trees, so if you just change a file in src/foo, you get new tree objects for the top level, src, and src/foo, but not src/bar.

Now, when Git packs objects, it takes each object and deltifies it against other objects of similar size and type. So if you've modified only one entry in a tree, then the tree will likely be packed such that it mostly refers to another tree and only includes literal data for the new entry. Similarly, small changes to a file are also packed in a deltified way, so a small change to a file will result in a reference to another copy of that file plus a small amount of literal content.

This is just the packed form; if Git needs to read an actual object, it resolves each delta and pulls it into memory so that it can read the data. Loose objects are stored compressed, but not deltified. Packing is done periodically by using git gc.

2

Does every commit-tree object store all its entries with it or does it only add new entries and only contain deltas from commit's parent?

Git separates storage deltas from revision deltas. Objects reconstituted from however they're storage-compressed are full snapshots.

Git will pack the object database when it looks like there's big wins available; after that trees (like everything else) are almost entirely delta-compressed, just... not necessarily against their parents. The goal is storage compression. Git looks much farther afield than just parents.

Not the answer you're looking for? Browse other questions tagged or ask your own question.