Git index and commit is very slow

Question

I have a local git repository and use git add file1 file2 file3... to add my modifications to the git index. Afterwards I use a plain git commit. Each command takes around 3 to 6 seconds. My repository has around 150.000 commits.

I already executed git gc since I assumed it would perform some garbage collection. The SSD is quite fast. I am wondering which screws I can turn in git to accelerate the execution of these two commands. Any advice?

git version 2.21.0 (Apple Git-122.2)

My system is a Mac Pro with MacOS 10.14.6. I use a SSD with APFS. No anti virus software (or any other interfering scanning software) is installed.

The number of commits shouldn't make a difference, but the number of files being changed might. How many files are you adding? Do both git add and git commit take 3 to 6 seconds, or is it just git commit? Are other commends like git log or git diff also very slow? — Schwern, Commented Jan 19, 2020 at 20:01
You can run GIT_TRACE_PERFORMANCE=1 git <cmd> to see where Git spends its time. — Henri Menke, Commented Jan 19, 2020 at 22:05
And also update git because each version introduced perf improvements especially for big repositories... — Philippe, Commented Jan 19, 2020 at 22:36
What sort of files are you adding? Git will compress objects using zlib whenever you make a commit, and if your files are binary objects then this compression is going to burn some CPU cyles. For this reason git really should only be used with text files. If you're adding binary objects like images, media, application binaries/archives etc, you should use git-lfs so that git can store these objects uncompressed in a separate location. — daveruinseverything, Commented Jan 20, 2020 at 0:42

VonC · Accepted Answer · 2023-01-20 20:53:04Z

First, update to the latest Git 2.25: performance issues are resolved with each new version.

To investigate performance issues, set the GIT_TRACE2_PERF environment variable to 1 and run the git command. See this SO answer for details about the trace2 feature and how to interpret the output table.

(In Bash you can set a variable and run a command in the same line)

GIT_TRACE2_PERF=1 git commit -m "test"

In Windows command-prompt, you'll need to use SET:

SET GIT_TRACE2_PERF=1
git commit -m "test"

Or, in a CMD in one line:

cmd /V /C "SET "GIT_TRACE2_PERF=1" && git commit -m "test""

For example, on Windows, you'll see output that looks like this:

C:\git\me\foobar>SET GIT_TRACE2_PERF=1

C:\git\me\foobar>git status
17:23:13.056175 common-main.c:48             | d0 | main                     | version      |     |           |           |              | 2.31.1.windows.1
17:23:13.056175 common-main.c:49             | d0 | main                     | start        |     |  0.003356 |           |              | git.exe status
17:23:13.065174 ..._win32_process_info.c:118 | d0 | main                     | data_json    | r0  |  0.012053 |  0.012053 | process      | windows/ancestry:["git.exe","cmd.exe","explorer.exe"]
17:23:13.066174 repository.c:130             | d0 | main                     | def_repo     | r1  |           |           |              | worktree:C:/git/me/foobar
17:23:13.067174 git.c:448                    | d0 | main                     | cmd_name     |     |           |           |              | status (status)
17:23:13.068174 read-cache.c:2324            | d0 | main                     | region_enter | r1  |  0.015462 |           | index        | label:do_read_index .git/index
17:23:13.069175 cache-tree.c:598             | d0 | main                     | region_enter | r1  |  0.015809 |           | cache_tree   | ..label:read
17:23:13.069175 cache-tree.c:600             | d0 | main                     | region_leave | r1  |  0.016021 |  0.000212 | cache_tree   | ..label:read
17:23:13.069175 read-cache.c:2284            | d0 | main                     | data         | r1  |  0.016056 |  0.000594 | index        | ..read/version:2
17:23:13.069175 read-cache.c:2286            | d0 | main                     | data         | r1  |  0.016065 |  0.000603 | index        | ..read/cache_nr:3808
17:23:13.069175 read-cache.c:2329            | d0 | main                     | region_leave | r1  |  0.016072 |  0.000610 | index        | label:do_read_index .git/index

Note the wall-clock times in the leftmost column, total time since start in the 7th column, and total time for each sub-operation in the 8th column.

Note that with Git 2.40 (Q1 2023), you can accelerating writing to the index!

Git 2.40 introduces an optional configuration to allow skipping the trailing hash that protects the index file from bit flipping.

See commit 17194b1, commit da9acde, commit ee1f0c2, commit 1687150 (06 Jan 2023) by Derrick Stolee (derrickstolee).
^{(Merged by Junio C Hamano -- gitster -- in commit ffd9238, 16 Jan 2023)}

hashfile: allow skipping the hash function

^{Co-authored-by: Kevin Willford}
^{Signed-off-by: Kevin Willford}
^{Signed-off-by: Derrick Stolee}

The hashfile API is useful for generating files that include a trailing hash of the file's contents up to that point.
Using such a hash is helpful for verifying the file for corruption-at-rest, such as a faulty drive causing flipped bits.

Git's index file includes this trailing hash, so it uses a 'struct hashfile' to handle the I/O to the file.
This was very convenient to allow using the hashfile methods during these operations.

However, hashing the file contents during write comes at a performance penalty.
It's slower to hash the bytes on their way to the disk than without that step.
This problem is made worse by the replacement of hardware-accelerated SHA1 computations with the software-based sha1dc computation.

This write cost is significant, and the checksum capability is likely not worth that cost for such a short-lived file.
The index is rewritten frequently and the only time the checksum is checked is during 'git fsck'^(man).
Thus, it would be helpful to allow a user to opt-out of the hash computation.

We first need to allow Git to opt-out of the hash computation in the hashfile API.
The buffered writes of the API are still helpful, so it makes sense to make the change here.

Introduce a new 'skip_hash' option to 'struct hashfile'.
When set, the update_fn and final_fn members of the_hash_algo are skipped.
When finalizing the hashfile, the trailing hash is replaced with the null hash.

This use of a trailing null hash would be desireable in either case, since we do not want to special case a file format to have a different length depending on whether it was hashed or not.
When the final bytes of a file are all zero, we can infer that it was written without hashing, and thus that verification is not available as a check for file consistency.
This also means that we could easily toggle hashing for any file format we desire.

A version of this patch has existed in the microsoft/git fork since 2017 (the linked commit was rebased in 2018, but the original dates back to January 2017).
Here, the change to make the index use this fast path is delayed until a later change.

And:

read-cache: add index.skipHash config option

^{Signed-off-by: Derrick Stolee}

The previous change allowed skipping the hashing portion of the hashwrite API, using it instead as a buffered write API.
Disabling the hashwrite can be particularly helpful when the write operation is in a critical path.

One such critical path is the writing of the index.
This operation is so critical that the sparse index was created specifically to reduce the size of the index to make these writes (and reads) faster.

This trade-off between file stability at rest and write-time performance is not easy to balance.
The index is an interesting case for a couple reasons:

Writes block users.
Writing the index takes place in many user- blocking foreground operations. The speed improvement directly impacts their use. Other file formats are typically written in the background (commit-graph, multi-pack-index) or are super-critical to correctness (pack-files).

Index files are short lived.
It is rare that a user leaves an index for a long time with many staged changes. Outside of staged changes, the index can be completely destroyed and rewritten with minimal impact to the user.

Following a similar approach to one used in the microsoft/git fork, add a new config option (index.skipHash) that allows disabling this hashing during the index write.
The cost is that we can no longer validate the contents for corruption-at-rest using the trailing hash.

We load this config from the repository config given by istate->repo, with a fallback to the_repository if it is not set.

While older Git versions will not recognize the null hash as a special case, the file format itself is still being met in terms of its structure.
Using this null hash will still allow Git operations to function across older versions.

The one exception is 'git fsck'^(man) which checks the hash of the index file.
This used to be a check on every index read, but was split out to just the index in a33fc72 (read-cache: force_verify_index_checksum, 2017-04-14, Git v2.13.0-rc1 -- merge) (read-cache: force_verify_index_checksum, 2017-04-14) and released first in Git 2.13.0.
Document the versions that relaxed these restrictions, with the optimistic expectation that this change will be included in Git 2.40.0.

Here, we disable this check if the trailing hash is all zeroes.
We add a warning to the config option that this may cause undesirable behavior with older Git versions.

As a quick comparison, I tested 'git update-index'^(man) --force-write with and without index.skipHash=true on a copy of the Linux kernel repository.
Benchmark 1: with hash
    Time (mean ± σ):      46.3 ms ±  13.8 ms    [User: 34.3 ms, System: 11.9 ms]
    Range (min … max):    34.3 ms …  79.1 ms    82 runs

Benchmark 2: without hash
    Time (mean ± σ):      26.0 ms ±   7.9 ms    [User: 11.8 ms, System: 14.2 ms]
    Range (min … max):    16.3 ms …  42.0 ms    69 runs

Summary
    'without hash' ran
      1.78 ± 0.76 times faster than 'with hash'
These performance benefits are substantial enough to allow users the ability to opt-in to this feature, even with the potential confusion with older 'git fsck' versions.

git config now includes in its man page:

index.skipHash

When enabled, do not compute the trailing hash for the index file. This accelerates Git commands that manipulate the index, such as git add, git commit, or git status.

Instead of storing the checksum, write a trailing set of bytes with value zero, indicating that the computation was skipped.

If you enable index.skipHash, then Git clients older than 2.13.0 will refuse to parse the index and Git clients older than 2.40.0 will report an error during git fsck.

Collectives™ on Stack Overflow

Git index and commit is very slow

1 Answer 1

`hashfile`: allow skipping the hash function

`read-cache`: add index.skipHash config option

`index.skipHash`

Not the answer you're looking for? Browse other questions tagged
git
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

hashfile: allow skipping the hash function

read-cache: add index.skipHash config option

index.skipHash

Not the answer you're looking for? Browse other questions tagged git or ask your own question.

Linked

Related

`hashfile`: allow skipping the hash function

`read-cache`: add index.skipHash config option

`index.skipHash`

Not the answer you're looking for? Browse other questions tagged
git
or ask your own question.