1

One of my projects has a folder with a hundred thousand tiny MP3 files (a few spoken words).

In the repo, so far I've only committed the code section of the project, and I'm scratching my head about what to do about the assets.

Of course git is meant to track text-based files, but we all track a few other assets when they are part of a project.

I am hesitating between two options and would love some input.

  1. Track all the things.

    Under this option, git will track the 100K asset files. The benefit is that once they're up on GitHub there will be very few changes (a renamed file here and there.)

    My worry is how git will handle the assets for each subsequent commit. Each time I commit, will git recompute the hashes of all the asset files to compare them with the last commit? If so, each commit will take ages.

  2. Don't track assets, but add a large archive file to Releases

    Under this option, git won't track the mp3s, but I'll create GitHub releases where I'll upload a 7z file of the assets in the binaries section of the release. The downside I see is that if I rename a single MP3 file, the next release will be a wasteful duplication of material.

2
  • 1
    Maybe Git LFS would solve this problem?
    – gronostaj
    Commented Jul 28, 2020 at 15:44
  • Reading about it now. This could be it, not sure I fully understand. Giving it a try right now, the test will see the second commit, if git spins for half an hour looking for hash differences.
    – zx81
    Commented Jul 28, 2020 at 16:24

1 Answer 1

2

Currently chromium.git has 315k files, gentoo.git has 90k files, and linux.git has 70k files. So these file counts are not exactly unusual and Git already has optimizations for it.

Mainly, the ".git/index" of a Git checkout (working directory) stores a bit more information about files in the working tree – it keeps track of the file's inode number, inode change time, and file modification time. If all these parameters are identical, then 'git add' will just assume the actual files are identical, too.

(Yes, the slow part would actually be 'git add', as that's where file scanning and importing is done – the subsequent 'git commit' just dumps the already-collected index information.)

If your files are organized in deeply nested directories, you might find these Git's config options useful (though they only affect the local checkout and not the actual history database): index.version=4 or feature.manyFiles=true.

Git doesn't care much whether it's storing binary files or not. The actual problem with MP3 files is that they're already compressed, meaning that one file is very different from another and therefore Git's usual storage optimizations will have some difficulties. However, this shouldn't cause any issues if the files are very rarely changed.


If the files are small and many, I don't think you will gain much from Git LFS or git-annex – it might actually make things slower as each file will need to be downloaded in a separate request to the server.

3
  • Quick clarification before accepting your answer. First off thank you very much, found your knowledge of the internals enlightening. If I understand, I shouldn't worry: the first git add will be a big hit time-wise, then the subsequent git status and git commit won't need to re-hash the files, as you explained that timestamps are used. Is that essentially it?
    – zx81
    Commented Jul 28, 2020 at 20:16
  • Subsequent add/status will skip hashing files based on timestamps. (The speed of actually checking timestamps will depend on OS and filesystem.) The commit operation never hashes any files at all; it directly just writes the existing "staged" hashes from the index to the database. Commented Jul 28, 2020 at 20:27
  • Again, big thanks for your answer (accepted). Wishing you a great week.
    – zx81
    Commented Jul 28, 2020 at 22:46

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .