Every commit holds—logically, anyway—a complete snapshot of every file (well, every file that is in the commit).
If you pick a commit, e.g., by its hash ID, and run git checkout
on that commit, your work-tree is filled from the files that are in that commit. That is, your work-tree takes on that snapshot. Switch away from that commit to some other commit that has, say, three fewer files, and Git removes those three files (and updates the remaining ones if/as needed).
If every commit object contains all the objects' entries, it would take huge space in the long run.
Except ... it doesn't. There are two amazing (or not really that amazing) feats of cleverness involved.
The first one shows up right here:
[test]$git cat-file -p 70951b429e0e1191a8c1d9e34248cd76453ef544
100644 blob 9a6c8d12dea8859b821b2ba705f7efd6cc914aa5 a.txt
100644 blob 9a6c8d12dea8859b821b2ba705f7efd6cc914aa5 b.txt
100644 blob b6693b64f528de38cde5533acd781fde743bc3df c.txt
100644 blob 91174caefafdc81d34e302874c86c6e4d5212075 d.txt
100644 blob 29f4cfc46ba3a0bde55bce8f44ac3590e2108da4 e.txt
Note that blob hash ID 9a6c8d12dea8859b821b2ba705f7efd6cc914aa5
shows up twice: once for a.txt
and once for b.txt
.
There is only one copy of the contents of both a.txt
and b.txt
. From this we can conclude that whatever is in a.txt
, and in b.txt
, the contents are the same.
So if you commit 100 files, then make a new commit in which 99 files are the same as 99 of the previous commit's files, you've just re-used 99 blob objects. They did not have to be stored again.
Git automatically de-duplicates file contents this way.
The second bit of cleverness happens later. Initially, all objects are stored as zlib-compressed files (files in .git/objects/
, though you should not count on this). If you change a few bytes in a file and use git add
and the new blob object is not a 100% exact match for some already-existing blob object, you get a new one of these objects. These are called loose objects, internally.
When there are enough loose objects lying around, or sooner if needed, Git packs the loose objects into a pack file. At this time, objects that can be profitably delta-compressed, usually are. This compression is the really clever code.
When you use git fetch
or git push
, Git will figure out which objects need to be transferred over the network and build a so-called thin pack. This is where you see the counting
and compressing objects
messages. Git then sends the thin-pack over the wire; the Git at the other end fixes up the thin pack, to make it a regular (fat) pack. When there are too many pack files, Git will repack the pack files, taking you from many *.pack
and *.idx
files down to just a few (or one) again.
(There have been some occasional bugs here. There was a recent fix to deal with large numbers of pack files. There are several older bugs where too many loose objects get left around. An occasional manual git gc
is sometimes helpful to work around these bugs, but using git gc
too often can be counterproductive.)