How does Git store files?

Question

I just started learning Git and to do so, I started reading the Git Community Book, and in this book they say that SVN and CVS store the difference between files and that Git stores a snapshot of all the files.

But I didn't really get what they mean by snapshot. Does Git really make a copy of all the files in each commit? Because that's what I understood from their explanation.

Here's a brilliant post that explains in detail how git works. What you're looking for is probably the § about the object database. — greg0ire, Commented Nov 20, 2011 at 0:04
Excellent article that contains links to other great resources. I've had fun with these for a couple of hours. — mihai, Commented Mar 11, 2015 at 14:07
I found this really nice article describing git from inside out: maryrosecook.com/blog/post/git-from-the-inside-out — Anubis, Commented Jan 15, 2016 at 2:38
@Sumudu The post 'git from the inside out' by Mary rose cook is indeed brilliant. – drlolly 2 hours ago Edit — drlolly, Commented Dec 7, 2021 at 16:31

VonC · Accepted Answer · 2020-12-20 02:15:51Z

Git does include for each commit a full copy of all the files, except that, for the content already present in the Git repo, the snapshot will simply point to said content rather than duplicate it.
That also means that several files with the same content are stored only once.

So a snapshot is basically a commit, referring to the content of a directory structure.

Some good references are:

git.github.io/git-reference

You tell Git you want to save a snapshot of your project with the git commit command and it basically records a manifest of what all of the files in your project look like at that point

Lab 12 illustrates how to get previous snapshots

The progit book has the more comprehensive description of a snapshot:

The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data.
Conceptually, most other systems store information as a list of file-based changes. These systems (CVS, Subversion, Perforce, Bazaar, and so on) think of the information they keep as a set of files and the changes made to each file over time

Git doesn’t think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini filesystem.
Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot.
To be efficient, if files have not changed, Git doesn’t store the file again—just a link to the previous identical file it has already stored.
Git thinks about its data more like as below:

This is an important distinction between Git and nearly all other VCSs. It makes Git reconsider almost every aspect of version control that most other systems copied from the previous generation. This makes Git more like a mini filesystem with some incredibly powerful tools built on top of it, rather than simply a VCS.

See also:

Jan Hudec adds this important comment:

While that's true and important on the conceptual level, it is NOT true at the storage level.
Git does use deltas for storage.
Not only that, but it's more efficient in it than any other system. Because it does not keep per-file history, when it wants to do delta compression, it takes each blob, selects some blobs that are likely to be similar (using heuristics that includes the closest approximation of previous version and some others), tries to generate the deltas and picks the smallest one. This way it can (often, depends on the heuristics) take advantage of other similar files or older versions that are more similar than the previous. The "pack window" parameter allows trading performance for delta compression quality. The default (10) generally gives decent results, but when space is limited or to speed up network transfers, git gc --aggressive uses value 250, which makes it run very slow, but provide extra compression for history data.

Does anyone know the computer science term for the Git-like storage pattern, aka hash-based value store? (or something similar) — Joannes Vermorel, Commented Nov 24, 2014 at 12:00
These two sections of the documentation seem somewhat contradicting: git-scm.com/book/en/v2/Getting-Started-Git-Basics, git-scm.com/book/en/v2/Git-Internals-Packfiles, it would be nice that they were clearer about how they make git efficient on a higher/layman level explanation — dukeofgaming, Commented Jan 3, 2015 at 23:31
In the context of the OP's actual question, the first paragraph seems really misleading. It's not until you get to the final paragraph that we learn that, oh yes, fact Git does "store [...] differences between files. Really wish that info was flagged up top and not buried so deep. That said, thanks at least including the real story somewhere in your answer ;) — Josh O'Brien, Commented Jan 27, 2015 at 20:50
"Git does store for each commit a full copy of all the files". Given the last paragraph, that is simply wrong! — Rastapopoulos, Commented Jun 11, 2018 at 14:15
@VonC You say "does have a full copy of all files per commit", but the next sentence contradicts that... I think it is obvious to anyone who uses Git that Git can reconstitute a copy of a file in a given commit. — Rastapopoulos, Commented Jun 29, 2018 at 16:27

Peter Mortensen · Accepted Answer · 2023-12-04 23:16:13Z

Git logically stores each file under its SHA-1 hash value. This means if you have two files with exactly the same content in a repository (or if you rename a file), only one copy is stored.

But this also means that when you modify a small part of a file and commit, another copy of the file is stored. The way Git solves this is using pack files. Once in a while, all the “loose” files (actually, not just files, but objects containing commit and directory information too) from a repository are gathered and compressed into a pack file. The pack file is compressed using zlib. And similar files are also delta-compressed.

The same format is also used when pulling or pushing (at least with some protocols), so those files don't have to be recompressed again.

The result of this is that a Git repository, containing the whole uncompressed working copy, uncompressed recent files and compressed older files is usually relatively small, two times smaller than the size of the working copy. And this means it's smaller than an SVN repository with the same files, even though SVN doesn't store the history locally.

Alexis Määttä Vinkler · Accepted Answer · 2023-12-05 12:33:54Z

OP: What does snapshot in Git mean? Is is true that Git makes a copy of all the files in each commit?

What does snapshot in Git mean?

In Git, all commits are immutable snapshots of your project (ignored files excluded) at a specific point in time. This means that each and every commit contains a unique representation of your entire project, not just the modified or added files (deltas), at the time of commit. Apart from references to the actual files, each commit is also infused with relevant metadata such as commit message, author (incl. time stamp), committer (incl. timestamp), and references to parent commit(s); all of which are immutable!

Since the commit (or commit object as it is formally called) is immutable in its entirety, trying to modify any of its content isn't possible. Commits can never be tampered with or modified once they are created!

How Git store files internally

From the Pro Git book we learn that:

Git is a content-addressable filesystem. Great. What does that mean? It means that at the core of Git is a simple key-value data store. What this means is that you can insert any kind of content into a Git repository, for which Git will hand you back a unique key you can use later to retrieve that content.

So let's look at below illustration to figure out what above statement really means, and how Git store data (and particularly files) internally.

A simple commit history containing three commits, including an overview of how the actual data (files and directories) are stored inside Git. On the left hand side the actual snapshot is displayed, with the "delta change" compared to previous commit highlighted in green. On the far right are the internal objects used for storage.

Git makes use of three main objects in its internal storage:

Commit object (High-level snapshot container)
Tree object (Low-level filename/directory container)
Blob object (Low-level file content container)

To store a file inside Git in a general sense (e.g. content + filename/directory) one blob and a tree is needed; the blob to store just the file content, and the tree to store the filename/directory referencing the blob. To construct nested directories, multiple trees are used; a tree can hence reference both blobs and trees. From a high-level perspective you don't have to worry about blobs and trees as Git creates them automatically as part of the commit process.

Note: Git computes all hashes (keys) bottom up, starting with the blobs, moving passed any sub trees, ultimately arriving at the root tree – feeding the keys as input to its direct parents. This process produces the structure visualized above which is known in mathematics and computer science as a directed ascyclical graph (DAG), e.g. all references moves in one direction only without any cyclical dependencies.

Analyzing the visualized example a bit further

By scrutinizing above history we see that for the initial C0 commit two empty files were added, src/index.js and .gitignore – but only one blob got created! That's because Git only stores unique content, and since the content of the two empty files obviously resulted in the same hash: e69de – only one entry was needed. However, as their filenames and paths differed two trees got created to keep track of this. Each tree returning a unique hash (key) computed based on the paths and blobs it's referencing.

Continuing upwards to the second commit C1, we see that only the .gitignore file got updated producing a new blob (e51ac) containing that data. As far as the root tree goes it still makes use of the same sub tree reference for the src/index.js file. However, the root tree is also a brand new object with a new hash (key) simply because the underlying .gitignore reference changed.

In the final C2 commit only the src/index.js file got updated and a new blob (257cc) emerged – forcing the creation of a new sub tree (5de32), and ultimately a new root tree (07eff).

In summary

Everytime a new commit is created, a snapshot of your entire project is recorded and stored to the internal database following a DAG data structure. Whenever a commit is checked out, your Working Tree is reconstructed to reflect the same state as the underlying snapshot is referencing through the root tree.

Source: The Above excerpt is taken from this full-length post on the subject (published on my own blog): Immutable Snapshots - One of Git's Core Concepts

A post of which you are presumably the author? This needs to be revealed. See e.g. Limits for self-promotion in answers. — Peter Mortensen, Commented Dec 4, 2023 at 23:21
No worries, I've made it explicit now. It was never my intention to mislead anyone. — Alexis Määttä Vinkler, Commented Dec 5, 2023 at 12:35

Collectives™ on Stack Overflow

How does Git store files?

3 Answers 3

What does snapshot in Git mean?

How Git store files internally

Analyzing the visualized example a bit further

In summary

Not the answer you're looking for? Browse other questions tagged
git
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

What does snapshot in Git mean?

How Git store files internally

Analyzing the visualized example a bit further

In summary

Not the answer you're looking for? Browse other questions tagged git or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
git
or ask your own question.