Fixing up a git repo that is slowed because of big binary files

Question

We have a git repo containing both source code and binaries. The bare repo has now reached ~9GB, and cloning it takes ages. Most of the time is spent in "remote: Compressing objects". After a commit with a new version of one of the bigger binaries, a fetch takes a long time, also spent compressing objects on the server.

After reading git pull without remotely compressing objects I suspect delta compression of binary files is what hurts us as well, but I'm not 100% sure how to go about fixing this.

What are the exact steps to fix the bare repo on the server? My guess:

Add entries like '*.zip -delta' for all extensions I want to into .git/info/attributes
Run 'git repack', but with what options? Would -adF repack everything, and leave me with a repo where no delta compression has ever been done on the specified file types?
Run 'git prune'. I thought this was done automatically, but running it when I played around with a bare clone of said repo decreased the size by ~2GB
Clone the repo, add and commit a .gitattributes with the same entries as I added in .git/info/attributes on the bare repo

Am I on to something?

Update:

Some interesting test results on this. Today I started a bare clone of the problematic repo. Our not-so-powerful-server with 4GB ram ran out of memory and started swapping. After 3 hours I gave up...

Then I instead cloned a bare repo from my up-to-date working copy. Cloning that one between workstations took ~5 minutes. I then pushed it up to the server as a new repo. Cloning that repo took only 7 minutes.

If I interpret this correctly, a better packed repo performs much better, even without disabling the delta-compression for binary files. I guess this means the steps above are indeed what I want to do in the short term, but in addition I need to find out how to limit the amount of memory git is allowed to use for packing/compression on the server so I can avoid the swapping.

In case it matters: The server runs git 1.7.0.4 and the workstations run 1.7.9.5.

Update 2:

I did the following steps on my testrepo, and think I will chance to do them on the server (after a backup)

Limit memory usage when packing objects

git config pack.windowMemory 100m
git config pack.packSizeLimit 200m
Disable delta compression for some extensions

echo '*.tar.gz -delta' >> info/attributes
echo '*.tar.bz2 -delta' >> info/attributes
echo '*.bin -delta' >> info/attributes
echo '*.png -delta' >> info/attributes
Repack repository and collect garbage

git repack -a -d -F --window-memory 100m --max-pack-size 200m
git gc

Update 3:

Some unexpected side effects after this operation: Issues after trying to repack a git repo for improved performance

would storing the binaries elsewhere be an option? Git really sucks with big binaries, which has been acknowledged. That is why there are separate products for that... — eis, Commented Sep 18, 2012 at 19:49
When we started with git we added uC-binaries, our rootfs and toolchain, to be able to get a complete snapshot of the past just by checking out a git revision. We didn't know enough about git to foresee the sluggishness. I plan to fix this properly (been looking at git-annex, but didn't know about git-bigfiles), but as a short term solution I would like to improve the performance of the current repo as best I can. — anr78, Commented Sep 18, 2012 at 19:59
I feel its better practice to store your dev environment/toolchain in a virtual machine (if you absolutely must store different versions of your dev environment just store a new disk image outside your repo). — Amir Rubin, Commented Sep 18, 2012 at 20:28
git annex (git-annex.branchable.com) is a possible solution. — Frank, Commented May 28, 2013 at 21:52
"echo '*.tar.gz -delta' >> info/gitattributes" should probably be "echo '*.tar.gz -delta' >> info/attributes" — Thomas Ferris Nicolaisen, Commented Apr 29, 2015 at 15:26

Amir Rubin · Accepted Answer · 2012-09-18 20:36:22Z

6

While your questions asks on how to make your current repo more efficient, I don't think that's feasible.

Follow the advice of the crowd:

Move your big binaries out of your repo
Move your dev environment to a virtual machine image: https://www.virtualbox.org/
Use this Python script to clean your repo of those large binary blobs (I used it to on my repo and it worked great) https://gist.github.com/1433794

answered Sep 18, 2012 at 20:36

Amir Rubin

8607 silver badges11 bronze badges

I absolutely agree on that strategy for the more permanent fix. Rather than using a vm for the dev environment, I consider storing versions on a server, and just let a file in the repo point to the current one. But, are you sure the current repo can't be made more efficient? If I understand the post I linked to, it should be possible to make it a bit better. If I can get rid of the "remote: Compressing objects" only for future fetches (not the initial clone), that in itself would help.
– anr78
Commented Sep 18, 2012 at 20:56

Add a comment |

xception · Accepted Answer · 2012-09-18 20:26:55Z

0

You should use a different mechanism for storing the big binaries, if they are generated from something you could just not store them, just the code which generates them, otherwise I suggest moving all of them to a single directory and managing that with rsync or svn depending on your needs.

answered Sep 18, 2012 at 20:26

xception

4,2671 gold badge18 silver badges28 bronze badges

Sound advice, but doesn't apply to our case. The biggest (and most problematic) binary is a tar.bz2'ed rootfs that it takes hours to build.
– anr78
Commented Sep 18, 2012 at 20:52
3

I suppose very few of the files on that rootfs actually get changes with each build so it might be smarter in that case not to compress them but to add them to the repo directly instead (just in case this was not clear enough, add the whole directory you're adding to tar instead of the resulting tar.bz2 file), this way your diff should be smaller, because git does not handle diff-ing binaries well.
– xception
Commented Sep 19, 2012 at 4:07

Add a comment |

Collectives™ on Stack Overflow

Fixing up a git repo that is slowed because of big binary files

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
git
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged git or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
git
or ask your own question.