21

We have a git repo containing both source code and binaries. The bare repo has now reached ~9GB, and cloning it takes ages. Most of the time is spent in "remote: Compressing objects". After a commit with a new version of one of the bigger binaries, a fetch takes a long time, also spent compressing objects on the server.

After reading git pull without remotely compressing objects I suspect delta compression of binary files is what hurts us as well, but I'm not 100% sure how to go about fixing this.

What are the exact steps to fix the bare repo on the server? My guess:

  • Add entries like '*.zip -delta' for all extensions I want to into .git/info/attributes
  • Run 'git repack', but with what options? Would -adF repack everything, and leave me with a repo where no delta compression has ever been done on the specified file types?
  • Run 'git prune'. I thought this was done automatically, but running it when I played around with a bare clone of said repo decreased the size by ~2GB
  • Clone the repo, add and commit a .gitattributes with the same entries as I added in .git/info/attributes on the bare repo

Am I on to something?

Update:

Some interesting test results on this. Today I started a bare clone of the problematic repo. Our not-so-powerful-server with 4GB ram ran out of memory and started swapping. After 3 hours I gave up...

Then I instead cloned a bare repo from my up-to-date working copy. Cloning that one between workstations took ~5 minutes. I then pushed it up to the server as a new repo. Cloning that repo took only 7 minutes.

If I interpret this correctly, a better packed repo performs much better, even without disabling the delta-compression for binary files. I guess this means the steps above are indeed what I want to do in the short term, but in addition I need to find out how to limit the amount of memory git is allowed to use for packing/compression on the server so I can avoid the swapping.

In case it matters: The server runs git 1.7.0.4 and the workstations run 1.7.9.5.

Update 2:

I did the following steps on my testrepo, and think I will chance to do them on the server (after a backup)

  • Limit memory usage when packing objects

    git config pack.windowMemory 100m
    git config pack.packSizeLimit 200m

  • Disable delta compression for some extensions

    echo '*.tar.gz -delta' >> info/attributes
    echo '*.tar.bz2 -delta' >> info/attributes
    echo '*.bin -delta' >> info/attributes
    echo '*.png -delta' >> info/attributes

  • Repack repository and collect garbage

    git repack -a -d -F --window-memory 100m --max-pack-size 200m
    git gc

Update 3:

Some unexpected side effects after this operation: Issues after trying to repack a git repo for improved performance

6
  • 3
    would storing the binaries elsewhere be an option? Git really sucks with big binaries, which has been acknowledged. That is why there are separate products for that...
    – eis
    Commented Sep 18, 2012 at 19:49
  • When we started with git we added uC-binaries, our rootfs and toolchain, to be able to get a complete snapshot of the past just by checking out a git revision. We didn't know enough about git to foresee the sluggishness. I plan to fix this properly (been looking at git-annex, but didn't know about git-bigfiles), but as a short term solution I would like to improve the performance of the current repo as best I can.
    – anr78
    Commented Sep 18, 2012 at 19:59
  • I feel its better practice to store your dev environment/toolchain in a virtual machine (if you absolutely must store different versions of your dev environment just store a new disk image outside your repo).
    – Amir Rubin
    Commented Sep 18, 2012 at 20:28
  • git annex (git-annex.branchable.com) is a possible solution.
    – Frank
    Commented May 28, 2013 at 21:52
  • "echo '*.tar.gz -delta' >> info/gitattributes" should probably be "echo '*.tar.gz -delta' >> info/attributes" Commented Apr 29, 2015 at 15:26

2 Answers 2

6

While your questions asks on how to make your current repo more efficient, I don't think that's feasible.

Follow the advice of the crowd:

  1. Move your big binaries out of your repo
  2. Move your dev environment to a virtual machine image: https://www.virtualbox.org/
  3. Use this Python script to clean your repo of those large binary blobs (I used it to on my repo and it worked great) https://gist.github.com/1433794
1
  • I absolutely agree on that strategy for the more permanent fix. Rather than using a vm for the dev environment, I consider storing versions on a server, and just let a file in the repo point to the current one. But, are you sure the current repo can't be made more efficient? If I understand the post I linked to, it should be possible to make it a bit better. If I can get rid of the "remote: Compressing objects" only for future fetches (not the initial clone), that in itself would help.
    – anr78
    Commented Sep 18, 2012 at 20:56
0

You should use a different mechanism for storing the big binaries, if they are generated from something you could just not store them, just the code which generates them, otherwise I suggest moving all of them to a single directory and managing that with rsync or svn depending on your needs.

2
  • Sound advice, but doesn't apply to our case. The biggest (and most problematic) binary is a tar.bz2'ed rootfs that it takes hours to build.
    – anr78
    Commented Sep 18, 2012 at 20:52
  • 3
    I suppose very few of the files on that rootfs actually get changes with each build so it might be smarter in that case not to compress them but to add them to the repo directly instead (just in case this was not clear enough, add the whole directory you're adding to tar instead of the resulting tar.bz2 file), this way your diff should be smaller, because git does not handle diff-ing binaries well.
    – xception
    Commented Sep 19, 2012 at 4:07

Not the answer you're looking for? Browse other questions tagged or ask your own question.