74

I'm trying to understand how Git calculates the hash of refs.

$ git ls-remote https://github.com/git/git  

....
29932f3915935d773dc8d52c292cadd81c81071d    refs/tags/v2.4.2
9eabf5b536662000f79978c4d1b6e4eff5c8d785    refs/tags/v2.4.2^{}
....

Clone the repo locally. Check the refs/tags/v2.4.2^{} ref by sha

$ git cat-file -p 9eabf5b536662000f79978c4d1b6e4eff5c8d785 

tree 655a20f99af32926cbf6d8fab092506ddd70e49c
parent df08eb357dd7f432c3dcbe0ef4b3212a38b4aeff
author Junio C Hamano <[email protected]> 1432673399 -0700
committer Junio C Hamano <[email protected]> 1432673399 -0700

Git 2.4.2

Signed-off-by: Junio C Hamano <[email protected]>

Copy the decompressed content so that we can hash it. (AFAIK Git uses the uncompressed version when it's hashing)

git cat-file -p 9eabf5b536662000f79978c4d1b6e4eff5c8d785 > fi

Let's SHA-1 the content using Git's own hash command

git hash-object fi
3cf741bbdbcdeed65e5371912742e854a035e665

Why is the output not [9e]abf5b536662000f79978c4d1b6e4eff5c8d785? I understand the first two characters (9e) are the length in hex. How should I hash the content of fi so that I can get the Git ref abf5b536662000f79978c4d1b6e4eff5c8d785?

2
  • 1
    (1) git hash-object is adding files, not other object types. Obviously type is appended to hashed content somehow. And I bet 9e is not length, the whole line is a hash as sha1 algorithm returns it.
    – max630
    Commented Feb 16, 2016 at 12:08
  • 3
    SHA=9eabf5b536662000f79978c4d1b6e4eff5c8d785; git cat-file -p $SHA | git hash-object -t $(git cat-file -t $SHA) --stdin. Read: You need git hash-object -t commit fi
    – Tino
    Commented Apr 10, 2020 at 23:19

3 Answers 3

40

As described in "How is git commit sha1 formed ", the formula is:

(printf "<type> %s\0" $(git cat-file <type> <ref> | wc -c); git cat-file <type> <ref>)|sha1sum

In the case of the commit 9eabf5b536662000f79978c4d1b6e4eff5c8d785 (which is v2.4.2^{}, and which referenced a tree) :

(printf "commit %s\0" $(git cat-file commit 9eabf5b536662000f79978c4d1b6e4eff5c8d785 | wc -c); git cat-file commit 9eabf5b536662000f79978c4d1b6e4eff5c8d785 )|sha1sum

That will give 9eabf5b536662000f79978c4d1b6e4eff5c8d785.

As would:

(printf "commit %s\0" $(git cat-file commit v2.4.2{} | wc -c); git cat-file commit v2.4.2{})|sha1sum

(still 9eabf5b536662000f79978c4d1b6e4eff5c8d785)

Similarly, computing the SHA1 of the tag v2.4.2 would be:

(printf "tag %s\0" $(git cat-file tag v2.4.2 | wc -c); git cat-file tag v2.4.2)|sha1sum

That would give 29932f3915935d773dc8d52c292cadd81c81071d.

5
  • I'm not sure why but I get different data $ (printf "tree %s\0" $(git cat-file tree 9eabf5b536662000f79978c4d1b6e4eff5c8d785 | wc -c); git cat-file tree 9eabf5b536662000f79978c4d1b6e4eff5c8d785 )|sha1sum 655a20f99af32926cbf6d8fab092506ddd70e49c Commented Feb 16, 2016 at 14:04
  • You mixed commit and tree: use the same type
    – VonC
    Commented Feb 16, 2016 at 14:06
  • Then you have to get the same sha1. Does it work when using the dereferenced tag? v2.4.2{}
    – VonC
    Commented Feb 16, 2016 at 14:07
  • I think the issue is due using <type> and tree . This works (using -pretty and commit). Any idea why it works using commit if it's a 'tree' ? (printf "commit %s\0" $(git cat-file -p 9eabf5b536662000f79978c4d1b6e4eff5c8d785 | wc -c); git cat-file -p 9eabf5b536662000f79978c4d1b6e4eff5c8d785 )|sha1sum 9eabf5b536662000f79978c4d1b6e4eff5c8d785 Commented Feb 16, 2016 at 14:09
  • 1
    @Theuserwithnohat My fault: I tested it in my repo and it worked indeed with commit (which, in your case, referenced the tree 655a20f99af32926cbf6d8fab092506ddd70e49c). I have updated the answer accordingly.
    – VonC
    Commented Feb 16, 2016 at 14:15
38

This video by John Williams gives an overview of what data goes into the calculation of a Git commit hash. Here's a screenshot from the video:

Git tree

Reimplementing the commit hash without Git

To get a deeper understanding of this aspect of Git, I reimplemented the steps that produce a Git commit hash in Rust, without using Git. It currently works for getting the hash when committing a single file. The answers here were helpful in achieving this, thanks.

The source code of this answer is available here. Execute it with cargo run.

These are the individual pieces of data we need to compute to arrive at a Git commit hash:

  1. The object ID of the file, which involves hashing the file contents with SHA-1. In Git, hash-object provides this ID.
  2. The object entries that go into the tree object. In Git, you can get an idea of those entries with ls-tree, but their format in the tree object is slightly different: [mode] [file name]\0[object ID]
  3. The hash of the tree object which has the form: tree [size of object entries]\0[object entries]. In Git, get the tree hash with: git cat-file commit HEAD | head -n1
  4. The commit hash by hashing the data you see with cat-file. This includes the tree object hash and commit information like author, time, commit message, and the parent commit hash if it's not the first commit.

Each step depends on the previous one. Let's start with the first.

Get the object ID of the file

The first step is to reimplement Git's hash-object, as in git hash-object your_file.

We create the object hash from our file by concatenating and hashing these data:

  • The string "blob " at the beginning (mind the trailing space), followed by
  • the number of bytes in the file, followed by
  • a null byte, expressed with \0 in printf and Rust, followed by
  • the file content.

In Bash:

file_name="your_file";
printf "blob $(wc -c < "$file_name")\0$(cat "$file_name")" | sha1sum

In Rust:

// Get the object ID
fn git_hash_object(file_content: &[u8]) -> Vec<u8> {
    let file_size = file_content.len().to_string();
    let hash_input = [
        "blob ".as_bytes(),
        file_size.as_bytes(),
        b"\0",
        file_content,
    ]
    .concat();
    to_sha1(&hash_input)
}

I'm using crate sha1 version 0.10.5 in to_sha1:

fn to_sha1(hash_me: &[u8]) -> Vec<u8> {
    use sha1::{Digest, Sha1};

    let mut hasher = Sha1::new();
    hasher.update(hash_me);
    hasher.finalize().to_vec()
}

Get the object entry of the file

Object entries are part of Git's tree object. Tree objects represent files and directories.

Object entries for files have this form: [mode] [file name]\0[object ID]

We assume the file is a regular, non-executable file, which translates to mode 100644 in Git. See this for more on modes.

This Rust function takes the result of the previous function git_hash_object as the parameter object_id:

fn object_entry(file_name: &str, object_id: &[u8]) -> Vec<u8> {
    // It's a regular, non-executable file
    let mode = "100644";

    // [mode] [file name]\0[object ID]
    let object_entry = [
        mode.as_bytes(),
        b" ",
        file_name.as_bytes(),
        b"\0",
        object_id,
    ]
    .concat();

    object_entry
}

I tried to write the equivalent of object_entry in Bash, but Bash variables cannot contain null bytes. There are probably ways around that limitation, but I decided for now that if I can't have variables in Bash, the code would get quite difficult to understand. Edits providing a readable Bash equivalent are welcome.

Get the tree object hash

As mentioned above, tree objects represent files and directories in Git. You can see the hash of your tree object by running, for example, git cat-file commit HEAD | head -n1.

The tree object has this form: tree [size of object entries]\0[object entries]

In our case we only have a single object_entry, calculated in the previous step:

fn tree_object_hash(object_entry: &[u8]) -> String {
    let object_entry_size = object_entry.len().to_string();

    let tree_object = [
        "tree ".as_bytes(),
        object_entry_size.as_bytes(),
        b"\0",
        object_entry,
    ]
    .concat();

    to_hex_str(&to_sha1(&tree_object))
}

Where to_hex_str is defined as:

// Converts bytes to their hexadecimal representation.
fn to_hex_str(bytes: &[u8]) -> String {
    bytes.iter().map(|byte| format!("{byte:02x}")).collect()
}

In a Git repo, you can look at the contents of the tree object with ls-tree. For example, running git ls-tree HEAD will produce lines like these:

100644 blob b8c0d74ef5ccd3dab583add7b3f5367efe4bf823    your_file

While those lines contain the data of an object entry (the mode, the object ID, and the file name), they are in a different order and include a tab character as well as the string "blob" which is input to the object ID, not the object entry. Object entries have this form: [mode] [file name]\0[object ID]

Get the commit hash

The last step creates the commit hash.

The data we hash using SHA-1 includes:

  • Tree object hash from the previous step.
  • Hash of the parent commit if the commit is not the very first one in the repo.
  • Author name and authoring date.
  • Committer name and committing date.
  • Commit message.

You can see all of that data with git cat-file commit HEAD, for example:

tree a76b2df314b47956268b0c39c88a3b2365fb87eb
parent 9881a96ab93a3493c4f5002f17b4a1ba3308b58b
author Matthias Braun <[email protected]> 1625338354 +0200
committer Matthias Braun <[email protected]> 1625338354 +0200

Second commit (that's the commit message)

You might have guessed that 1625338354 is a timestamp. In this case it's the number of seconds since the Unix epoch. You can convert from the date and time format of git log, such as "Sat Jul 3 20:52:34 2021", to Unix epoch seconds with date:

date --date='Sat Jul 3 20:52:34 2021' +"%s"

The time zone is denoted as +0200 in this example.

Based on the output of cat-file, you can create the Git commit hash using this Bash command (which uses git cat-file, so it's no reimplementation):

cat_file_output=$(git cat-file commit HEAD);
printf "commit $(wc -c <<< "$cat_file_output")\0$cat_file_output\n" | sha1sum

The Bash command illustrates that—similar to the steps before—what we hash is:

  • A leading string, "commit " in this step, followed by
  • the size of a bunch of data. Here it's the output of cat-file which is detailed above. Followed by
  • a null byte, followed by
  • the data itself (output of cat-file) with a line break at the end.

In case you kept score: Creating a Git commit hash involves using SHA-1 at least three times.

Below is the Rust function for creating the Git commit hash. It uses the tree_object_hash produced in the previous step and a struct CommitMetaData which contains the rest of the data you see when calling git cat-file commit HEAD. The function also takes care of whether the commit has a parent commit or not.

fn commit_hash(commit: &CommitMetaData, tree_object_hash: &str) -> Vec<u8> {
    let author_line = format!(
        "{} {}",
        commit.author_name_and_email, commit.author_timestamp_and_timezone
    );
    let committer_line = format!(
        "{} {}",
        commit.committer_name_and_email, commit.committer_timestamp_and_timezone
    );

    // If it's the first commit, which has no parent,
    // the line starting with "parent" is omitted
    let parent_commit_line = match commit.parent_commit_hash {
        Some(parent_commit_hash) => format!("\nparent {parent_commit_hash}"),
        None => "".to_string(),
    };
    let git_cat_file_str = format!(
        "tree {}{}\nauthor {}\ncommitter {}\n\n{}\n",
        tree_object_hash, parent_commit_line, author_line, committer_line, commit.commit_message
    );

    let git_cat_file_len = git_cat_file_str.len().to_string();

    let commit_object = [
        "commit ".as_bytes(),
        git_cat_file_len.as_bytes(),
        b"\0",
        git_cat_file_str.as_bytes(),
    ]
    .concat();

    // Return the Git commit hash
    to_sha1(&commit_object)
}

Here's CommitMetaData:

#[derive(Debug, Copy, Clone)]
pub struct CommitMetaData<'a> {
    pub(crate) author_name_and_email: &'a str,
    pub(crate) author_timestamp_and_timezone: &'a str,
    pub(crate) committer_name_and_email: &'a str,
    pub(crate) committer_timestamp_and_timezone: &'a str,
    pub(crate) commit_message: &'a str,
    // All commits after the first one have a parent commit
    pub(crate) parent_commit_hash: Option<&'a str>,
}

This function creates CommitMetaData where author and committer info are identical, which will be convenient when we run the program later:

pub fn simple_commit<'a>(
    author_name_and_email: &'a str,
    author_timestamp_and_timezone: &'a str,
    commit_message: &'a str,
    parent_commit_hash: Option<&'a str>,
) -> CommitMetaData<'a> {
    CommitMetaData {
        author_name_and_email,
        author_timestamp_and_timezone,
        committer_name_and_email: author_name_and_email,
        committer_timestamp_and_timezone: author_timestamp_and_timezone,
        commit_message,
        parent_commit_hash,
    }
}

Putting it all together

As a summary and reminder, creating a Git commit hash consists of getting:

  1. The object ID of the file, which involves hashing the file contents with SHA-1. In Git, hash-object provides this ID.
  2. The object entries that go into the tree object. In Git, you can get an idea of those entries with ls-tree, but their format in the tree object is slightly different: [mode] [file name]\0[object ID]
  3. The hash of the tree object which has the form: tree [size of object entries]\0[object entries]. In Git, get the tree hash with: git cat-file commit HEAD | head -n1
  4. The commit hash by hashing the data you see with cat-file. This includes the tree object hash and commit information like author, time, commit message, and the parent commit hash if it's not the first commit.

In Rust:

pub fn get_commit_hash(
    file_name: &str,
    file_content: &[u8],
    commit: &CommitMetaData
) -> String {
    let file_object_id = git_hash_object(file_content);
    let object_entry = object_entry(file_name, &file_object_id);
    let tree_object_hash = tree_object_hash(&object_entry);

    let commit_hash = commit_hash(commit, &tree_object_hash);
    to_hex_str(&commit_hash)
}

With the functions above, you can create a file's Git commit hash in Rust, without Git:

use std::{fs, io};

fn main() -> io::Result<()> {
    let file_name = "your_file";
    let file_content = fs::read(file_name)?;

    let first_commit = simple_commit(
        "Firstname Lastname <[email protected]>",
        // Timestamp calculated using: date --date='Wed Jun 23 18:02:18 2021' +"%s"
        "1624464138 +0200",
        "Message of first commit",
        // No parent commit hash since this is the first commit
        None,
    );

    let first_commit_hash = get_commit_hash(file_name, &file_content, &first_commit);
    Ok(println!("Git commit hash: {first_commit_hash}"))
}

To create the hash of the second commit, you take the hash of the first commit and put it into the CommitMetaData of the second commit:

let second_commit = simple_commit(
    "Firstname Lastname <[email protected]>",
    "1625388354 +0200",
    "Message of second commit",
    // The first commit is the parent of the second commit
    Some(first_commit_hash),
);

Apart from the other answers here and their links, these were some useful resources in creating my limited reimplementation:

  • Reimplementation of git hash-object in JavaScript.
  • Format of a Git tree object, this is the next place I'd look if I wanted to make my reimplementation more complete: To work with commits involving more than one file.
2
  • You mentioned that tree size is calculated by the size of the object entries. By that you litteraly mean the sum of all the objects characters? For example, if a tree has two objects which header is blob 10\0, then the tree size would be 20? Commented Oct 19, 2023 at 18:16
  • You'd sum the bytes of each object entry not the file sizes. Note that I haven't tested that, I only implemented it for a single object entry. "if a tree has two objects which header is blob 10\0, then the tree size would be 20?" I don't think so, since each object entry has the format [mode] [file name]\0[object ID] and the number after "blob" refers to the file size which is part of the data that gets hashed to become the object ID. Commented Oct 31, 2023 at 8:47
12

There's bit of confusion here. Git uses different types of objects: blobs, trees and commits. The following command:

git cat-file -t <hash>

Tells you the type of object for a given hash. So in your example, the hash 9eabf5b536662000f79978c4d1b6e4eff5c8d785 corresponds to a commit object.

Now, as you figured out yourself, running this:

git cat-file -p 9eabf5b536662000f79978c4d1b6e4eff5c8d785

Gives you the content of the object according to its type (in this instance, a commit).

But, this:

git hash-object fi

...computes the hash for a blob whose content is the output of the previous command (in your example), but it could be anything else (like "hello world!"). Here try this:

echo "blob 277\0$(cat fi)" | shasum

The output is the same as the previous command. This is basically how Git hashes a blob. So by hashing fi, you are generating a blob object. But as we have seen, 9eabf5b536662000f79978c4d1b6e4eff5c8d785 is a commit, not a blob. So, you cannot hash fi as it is in order to get the same hash.

A commit's hash is based on several other informations which makes it unique (such as the committer, the author, the date, etc). The following article tells you exactly what a commit hash is made of:

The anatomy of a git commit

So you could get the same hash by providing all the data specified in the article with the exact same values as those used in the original commit.

This might be helpful as well:

Git from the bottom up

2
  • 2
    echo "blob 277\0$(cat fi)" | shasum produced different results than git hash-object fi for me for two reasons: First, I didn't know that 277 refers to the size of fi and the size of my particular fi is not equal to 277. Second, the GNU coreutils version of echo adds a newline and doesn't escape \0 to mean the NUL byte (echo -en fixes this). The following command produces the same result as git hash-object fi: printf "blob $(wc -c < fi)\0$(cat fi)" | sha1sum. Commented Jun 23, 2021 at 18:44
  • So it is "blob ${filesizeinbytes}\0${filecontent}". This command worked fine in th git bash under windows: stat --printf="blob %s\0" $FILENAME | cat - $FILENAME | sha1sum. Commented Sep 21, 2022 at 11:48

Not the answer you're looking for? Browse other questions tagged or ask your own question.