262

How often should you use git-gc?

The manual page simply says:

Users are encouraged to run this task on a regular basis within each repository to maintain good disk space utilization and good operating performance.

Are there some commands to get some object counts to find out whether it's time to gc?

2

11 Answers 11

231

It depends mostly on how much the repository is used. With one user checking in once a day and a branch/merge/etc operation once a week you probably don't need to run it more than once a year.

With several dozen developers working on several dozen projects each checking in 2-3 times a day, you might want to run it nightly.

It won't hurt to run it more frequently than needed, though.

What I'd do is run it now, then a week from now take a measurement of disk utilization, run it again, and measure disk utilization again. If it drops 5% in size, then run it once a week. If it drops more, then run it more frequently. If it drops less, then run it less frequently.

3
  • 23
    Manual says "Some git commands run git gc --auto after performing operations that could create many loose objects." Anyone know which commands actually run it? Commented Jul 8, 2014 at 15:36
  • 2
    A large git rebase is an obvious example, since many commits are rewritten into a new history - leaving lots of old commits in your repo which are part of the current branch anymore
    – mafrosis
    Commented Nov 5, 2014 at 6:11
  • 30
    "It won't hurt to run it more frequently than needed"... I don't entirely agree. As Aristotle points out, dangling commits can make a good backup mechanism. Commented Nov 22, 2014 at 16:10
125

Note that the downside of garbage-collecting your repository is that, well, the garbage gets collected. As we all know as computer users, files we consider garbage right now might turn out to be very valuable three days in the future. The fact that git keeps most of its debris around has saved my bacon several times – by browsing all the dangling commits, I have recovered much work that I had accidentally canned.

So don’t be too much of a neat freak in your private clones. There’s little need for it.

OTOH, the value of data recoverability is questionable for repos used mainly as remotes, eg. the place all the devs push to and/or pulled from. There, it might be sensible to kick off a GC run and a repacking frequently.

1
  • 46
    FWIW not all loose objects are garbage collected, only those older than 2 week by default (cf. git gc --help, specifically the --prune option). There is also mention of gc.reflogExpire, which leads me to believe that any commitish you've visited in the last 90 days will not be collected. (My git version: v1.7.6)
    – RobM
    Commented Dec 6, 2011 at 18:28
34

Recent versions of git run gc automatically when required, so you shouldn't have to do anything. See the Options section of man git-gc(1): "Some git commands run git gc --auto after performing operations that could create many loose objects."

3
  • 21
    I just ran it for the first time on a several-year-old repository, and my .git went from 16M to 2.9M, an 82% reduction in size. It therefore still seems useful to manually run the command. Commented Apr 16, 2015 at 22:00
  • @DarshanRivkaWhittle had you updated git in those several years? Commented Apr 3, 2019 at 19:57
  • 4
    @std''OrgnlDave Yeah, I was always running whatever version was current on Arch. I just ran it again, maybe for the first time since my last comment (thanks to your comment reminding me), and my .git went from 81M to 13M. I must not run any of the commands that run gc --auto, I guess. Commented Apr 3, 2019 at 20:46
23

If you're using Git-Gui, it tells you when you should worry:

This repository currently has approximately 1500 loose objects.

The following command will bring a similar number:

$ git count-objects

Except, from its source, git-gui will do the math by itself, actually counting something at .git/objects folder and probably brings an approximation (I don't know tcl to properly read that!).

In any case, it seems to give the warning based on an arbitrary number around 300 loose objects.

3
  • Indeed it does warn, but upon letting it run gc, most of the time gc wont do a thing. So relying on git gui to do it, is to wait for more than 6000something loose objects with always having to click on either run gc and wait for a minute or cancel :/ Probably someone should fix git gui in a way that it checks max loose object count and not bother to show the dialog until the count reaches the limit.
    – mlatu
    Commented Feb 13, 2014 at 10:30
  • Yes @mlatu I agree. When I wrote this I just wanted to bring attention to it. Both Git-Gui and count-objects are not exactly good answers to the question here... But they should be!
    – cregox
    Commented Feb 13, 2014 at 11:05
  • i didnt mean that this is a bad answer, just wanted to point out that most of the time git gui does nothing. though i suppose git gc doesnt do much either, except when there is enough to do or you used the aggressive switch.
    – mlatu
    Commented Feb 13, 2014 at 13:47
8

Drop it in a cron job that runs every night (afternoon?) when you're sleeping.

1
  • Should I done it for the all github repositories?
    – alper
    Commented Jan 17, 2021 at 13:18
8

You can do it without any interruption, with the new (Git 2.0 Q2 2014) setting gc.autodetach.

See commit 4c4ac4d and commit 9f673f9 (Nguyễn Thái Ngọc Duy, aka pclouds):

gc --auto takes time and can block the user temporarily (but not any less annoyingly).
Make it run in background on systems that support it.
The only thing lost with running in background is printouts. But gc output is not really interesting.
You can keep it in foreground by changing gc.autodetach.


Since that 2.0 release, there was a bug though: git 2.7 (Q4 2015) will make sure to not lose the error message.
See commit 329e6e8 (19 Sep 2015) by Nguyễn Thái Ngọc Duy (pclouds).
(Merged by Junio C Hamano -- gitster -- in commit 076c827, 15 Oct 2015)

gc: save log from daemonized gc --auto and print it next time

While commit 9f673f9 (gc: config option for running --auto in background - 2014-02-08) helps reduce some complaints about 'gc --auto' hogging the terminal, it creates another set of problems.

The latest in this set is, as the result of daemonizing, stderr is closed and all warnings are lost. This warning at the end of cmd_gc() is particularly important because it tells the user how to avoid "gc --auto" running repeatedly.
Because stderr is closed, the user does not know, naturally they complain about 'gc --auto' wasting CPU.

Daemonized gc now saves stderr to $GIT_DIR/gc.log.
Following gc --auto will not run and gc.log printed out until the user removes gc.log
.

8

This quote is taken from; Version Control with Git

Git runs garbage collection automatically:

• If there are too many loose objects in the repository

• When a push to a remote repository happens

• After some commands that might introduce many loose objects

• When some commands such as git reflog expire explicitly request it

And finally, garbage collection occurs when you explicitly request it using the git gc command. But when should that be? There’s no solid answer to this question, but there is some good advice and best practice.

You should consider running git gc manually in a few situations:

• If you have just completed a git filter-branch . Recall that filter-branch rewrites many commits, introduces new ones, and leaves the old ones on a ref that should be removed when you are satisfied with the results. All those dead objects (that are no longer referenced since you just removed the one ref pointing to them) should be removed via garbage collection.

• After some commands that might introduce many loose objects. This might be a large rebase effort, for example.

And on the flip side, when should you be wary of garbage collection?

• If there are orphaned refs that you might want to recover

• In the context of git rerere and you do not need to save the resolutions forever

• In the context of only tags and branches being sufficient to cause Git to retain a commit permanently

• In the context of FETCH_HEAD retrievals (URL-direct retrievals via git fetch ) because they are immediately subject to garbage collection

2
  • 2
    I have unreachable commits in my tree (as a result of git commit --amend). This can be verified with git log --reflog. I pushed a branch to the remote repository and checked my tree again; the unreachable commits were still there. Apparently git gc was not run when this push happened. … ?
    – chharvey
    Commented Feb 27, 2016 at 4:33
  • From atlassian.com/git/tutorials/git-gc:gc.reflogExpire: An optional variable that defaults to 90 days. It is used to set how long records in a branches reflog should be preserved. Commented Nov 29, 2021 at 23:04
7

I use git gc after I do a big checkout, and have a lot of new object. it can save space. E.g. if you checkout a big SVN project using git-svn, and do a git gc, you typically save a lot of space

1
  • Is this still true? Even in '08 HDD space was cheap, using that as a justification to run it seems pointless
    – Thymine
    Commented Jul 20, 2018 at 21:43
6

You don't have to use git gc very often, because git gc (Garbage collection) is run automatically on several frequently used commands:

git pull
git merge
git rebase
git commit

Source: git gc best practices and FAQS

1
  • Do these command run git gc locally on the client side only (e.g. your laptop), or do they somehow also trigger a GC on the server? (I suppose it's only locally, on the system where you run these commands)
    – Henk Poley
    Commented Jun 20 at 8:34
4

I use when I do a big commit, above all when I remove more files from the repository.. after, the commits are faster

0

Just for a bit of another point of view, note that you can have repos where you DO NOT WANT to do garbage-collection, automatic or otherwise (used as reference repositories, possibly local clones, etc.) because some other repository uses this git index and may become invalid if objects disappear or files they are in get different names.

This may be a fairly typical situation on a space-conscious CI farm with some single repository used as a baseline (maybe even over NFS or similar) to spawn build workspaces for many different test/build scenarios. There you can git config gc.auto false in the repository to avoid mishaps, and use domain-specific scripting to only GC when you know it is safe to (e.g. no builds running => no agents to corrupt mid-flight) or even never.

Conversely, you may want to use a common reference repository and then detach workspace repos after instantiating the particular commit they would build (this copies just the needed objects, possibly sped up by shallowness/depth settings for that workspace) to make them independent and so reducing the time-window when it is critical to not-GC the main repository.

Some reasons to do this trickery include:

  • Using a CI farm with slow link to the SCM platform (e.g. reaching out to GitHub, etc. from a corporate LAN) so that you only suffer the long-ish git clone or similar operations (and eat the uplink traffic which may be costly in corporate setups) once per build and not for each scenario;
  • Be sure the commit you want to build is available to all agents during this build (if someone force-pushes to the original repo/branch on the SCM platform, as often happens during PR preparations from private forks, a direct checkout from it may be impossible by the time the build agent is ready to do the work because the SCM platform claims the commit hash does not exist), or for named branch builds - to ensure that the same tip commit is used in all scenarios of the same build (and yes, some teams do not shy away from redefining a git tag over time, too);
  • As a continuation of the above - your build scenario might in fact prepare and archive a tarball of the git repository (garbage-collected and all), and distribute it to build agents as a temporary artifact for faster workspace instantiation. Such approach is more useful when the agents are not on the same build host or even same LAN.

Source/Disclaimer: lessons learned while making https://github.com/networkupstools/jenkins-dynamatrix/blob/master/src/org/nut/dynamatrix/DynamatrixStash.groovy and similar projects

Not the answer you're looking for? Browse other questions tagged or ask your own question.