0

In a relational database system, a table is often implemented as a file, or less often a database is implemented as a file.

Git can be viewed as a database system. In Git, what is implemented as a file: a blob, a tree, a commit, a repository?

2
  • 1
    Please read the Git Plumbing and Porcelain chapter of the Git manual. Your current question is too broad IMO. Commented Jan 7, 2019 at 5:25
  • Also, it isn't true that database tables are often implemented as files. In SQL Server, the unit of storage is an 8KB page. Commented Jan 7, 2019 at 5:27

2 Answers 2

3

A Git repository is a collection of files, with "objects" being packfiles (compressed) or loose (files uncompressed).
See "Git repository layout":

As explained in Git Basics:

You will see these hash values all over the place in Git because it uses them so much. In fact, Git stores everything in its database not by file name but by the hash value of its contents.

The Git directory is where Git stores the metadata and object database for your project. This is the most important part of Git, and it is what is copied when you clone a repository from another computer.

The working tree is a single checkout of one version of the project. These files are pulled out of the compressed database in the Git directory and placed on disk for you to use or modify.

Note: the very first commit for Git itself (commit e83c516) mentioned:

There are two object abstractions: the "object database", and the "current directory cache".

The object database is literally just a content-addressable collection of objects.
All objects are named by their content, which is approximated by the SHA1 hash of the object itself.
Objects may refer to other objects (by referencing their SHA1 hash), and so you can build up a hierarchy of objects.

There are several kinds of objects in the content-addressable collection database. They are all in deflated with zlib, and start off with a tag of their type, and size information about the data.
The SHA1 hash is always the hash of the compressed object, not the original one.

2

Git can be viewed as a database system ...

That's actually a reasonable high-level view. However, when using this approach, Git has at least two key-value stores (and some optional additional ones that we'll ignore here). One takes names—what Git calls references, which have specialized forms such as branch and tag names—and turns them into hash ID values. The other database takes hash ID keys and turns them into objects.

A commit is literally an object of type "commit". Every commit refers to one (single) object of type "tree", which represents the saved snapshot. The tree refers in turn to additional sub-trees and/or to "blob" objects, which represent file content, or, for symbolic links, the target of the link.

In Git, what is implemented as a file: a blob, a tree, a commit ...

The answer here is both yes and no. As VonC said, these are all just objects. There are four types, those being the three we've named so far, plus "annotated tag". Each object is either stored as a loose object, in which case it is in a file under the .git/objects/ directory, or it is stored as a packed object. Pack files are stored in .git/objects/pack/ (as, at a minimum, pairs of files: a "pack index", and content corresponding to that index). One pack file stores many objects, with delta encoding so that objects can be extracted by, in part, extracting parts of other objects.

(The format of pack files is complex and there have been several revisions.)

The file name of a loose object is its hash ID key, represented in hexadecimal, with the first two letters split from the remaining 38 by a / separator, so that given a hash ID of 1234567... the object is stored in .git/objects/12/34567....

In relatively rare situations, you can have the same object stored as a loose object and in one or more pack files. However, since the object's name is a hash of its content, all copies should match. When Git de-compresses an object, it re-computes the hash as it goes, and declares the data valid if and only if the resulting hash matches the key by which Git looked up the object in the first place. Otherwise, the contents have been corrupted, presumably via storage-media failure.

(If any one copy is corrupted, you can try all the other copies, but normally it's easier just to go to a separate clone. Git will verify the hash integrity on every clone operation as well. Git does not come with fancy repair tools to attempt to locate a secondary copy, but it does have tools to explode individual pack files, so you can do it manually.)

Not the answer you're looking for? Browse other questions tagged or ask your own question.