Tree contains duplicate file entries

Question

After some issues with our hosting, we decided to move our Git repository to GitHub. So I cloned the repository and tried pushing that to GitHub. However, I stumbled upon some errors we have never encountered before:

 C:\repositories\appName [master]> git push -u origin master
 Counting objects: 54483, done.
 Delta compression using up to 2 threads.
 Compressing objects: 100% (18430/18430), done.
 error: object 9eac1e639bbf890f4d1d52e04c32d72d5c29082e:contains duplicate file entries
 fatal: Error in object
 fatal: sha1 file '<stdout>' write error: Invalid arguments
 error: failed to push some refs to 'ssh://[email protected]/User/Project.git'

When I run fsck:

C:\repositories\appName [master]> git fsck --full
Checking object directories: 100% (256/256), done.
error in tree 0db4b3eb0e0b9e3ee41842229cdc058f01cd9c32: contains duplicate file entries
error in tree 9eac1e639bbf890f4d1d52e04c32d72d5c29082e: contains duplicate file entries
error in tree 4ff6e424d9dd2e3a004d62c56f99e798ac27e7bf: contains duplicate file entries
Checking objects: 100% (54581/54581), done.

When I run ls-tree with the bad SHA1:

C:\repositories\appName [master]> git ls-tree 9eac1e639bbf890f4d1d52e04c32d72d5c29082e
160000 commit 5de114491070a2ccc58ae8c8ac4bef61522e0667  MenuBundle
040000 tree 9965718812098a5680e74d3abbfa26f527d4e1fb    MenuBundle

I tried all of the answers already given on this StackOverflow question, but haven't had any success. Is there any way I can prevent this repository and its history from being doomed?

You might already have tried those, but the suggestions in stackoverflow.com/q/10931954/6309 look promising. — VonC, Commented Nov 1, 2012 at 14:33
this looks like you screwed your submodule setup. what didn't work with the linked topic? creating new tree objects to replace the broken ones should be the solution. — Stefan, Commented Jan 10, 2013 at 9:57
I've seen something similar when dealing with repositories in windows. In windows File.txt and file.txt are the same file. Do you have anything like that in your history? — Zeki, Commented Feb 13, 2014 at 21:09
It is impossible to help further without more information. Specifically answers to questions in the comments above. — onionjake, Commented Apr 3, 2014 at 17:13
Could this be a duplicate of this question stackoverflow.com/questions/10931954/… — Pavel Nikolov, Commented Apr 7, 2014 at 15:09

Community · Accepted Answer · 2017-05-23 12:23:56Z

Method 1.

Do the git fsck first.

$ git fsck --full
error in tree bb81a5af7e9203f36c3201f2736fca77ab7c8f29: contains duplicate file entries

If this won't fix the problem, you're in trouble. You can either ignore the problem, restore the repository from the backup, or move the files into new repository. If you having trouble pushing the repo into github, try changing the repository to different one or check: Can't push to GitHub error: pack-objects died of signal 13 and Can't push new git repository to github.

The below methods are only for advanced git users. Please do the backup before starting. The fix is not guaranteed by the following steps and it can make it even worse, so do it for your own risk or education purposes.

Method 2.

Use git ls-tree to identify duplicate files.

$ git read-tree bb81a5af7e9203f36c3201f2736fca77ab7c8f29 # Just a hint.
$ git ls-tree bb81a5af7e9203f36c3201f2736fca77ab7c8f29 # Try also with: --full-tree -rt -l
160000 commit def08273a99cc8d965a20a8946f02f8b247eaa66  commerce_coupon_per_user
100644 blob 89a5293b512e28ffbaac1d66dfa1428d5ae65ce0    commerce_coupon_per_user
100644 blob 2f527480ce0009dda7766647e36f5e71dc48213b    commerce_coupon_per_user
100644 blob dfdd2a0b740f8cd681a6e7aa0a65a0691d7e6059    commerce_coupon_per_user
100644 blob 45886c0eda2ef57f92f962670fad331e80658b16    commerce_coupon_per_user
100644 blob 9f81b5ca62ed86c1a2363a46e1e68da1c7b452ee    commerce_coupon_per_user

As you can see, it contains the duplicated file entries (commerce_coupon_per_user)!

$ git show bb81a5af7e9203f36c3201f2736fca77ab7c8f29
tree bb81a5af7e9203f36c3201f2736fca77ab7c8f29

commerce_coupon_per_user
commerce_coupon_per_user
commerce_coupon_per_user
commerce_coupon_per_user
commerce_coupon_per_user
commerce_coupon_per_user

Again, you can see the duplicated file entries (commerce_coupon_per_user)!

You may try to use git show for each listed blob and check the content if each file.

Then keep running ls-tree for that invalid ls-tree object across your different git clones to see if you can track the valid object, or if all are broken.

git ls-tree bb81a5af7e9203f36c3201f2736fca77ab7c8f29

If you found the valid object containing non-duplicated file entries, save it into the file and re-create by using `git mktree` and `git replace`, e.g.

remote$ git ls-tree bb81a5af7e9203f36c3201f2736fca77ab7c8f29 > working_tree.txt
$ cat working_tree.txt | git mktree
NEWTREEbb81a5af7e9203f36c3201f2736fca77ab7c8f29
$ git replace bb81a5af7e9203f36c3201f2736fca77ab7c8f29 NEWTREE4b825dc642cb6eb9a060e54bf8d69288fbee4904

If this won't help, you can undo the change by:

$ git replace -d NEWTREE4b825dc642cb6eb9a060e54bf8d69288fbee4904

Method 3.

When you know which file/dir entry is duplicated, you may try to remove that file and re-create it later on. In example:

$ find . -name commerce_coupon_per_user # Find the duplicate entry.
$ git rm --cached `find . -name commerce_coupon_per_user` # Add -r for the dir.
$ git commit -m'Removing invalid git entry for now.' -a
$ git gc --aggressive --prune # Deletes loose objects! Please do the backup before just in case.

Read more:

git gc: cleaning up after yourself

Method 4.

Check your commit for invalid entries.

Lets check our tree again.

$ git ls-tree bb81a5af7e9203f36c3201f2736fca77ab7c8f29 --full-tree -rt -l
160000 commit def08273a99cc8d965a20a8946f02f8b247eaa66  commerce_coupon_per_user
100644 blob 89a5293b512e28ffbaac1d66dfa1428d5ae65ce0     270    commerce_coupon_per_user
....
$ git show def08273a99cc8d965a20a8946f02f8b247eaa66
fatal: bad object def08273a99cc8d965a20a8946f02f8b247eaa66
$ git cat-file commit def08273a99cc8d965a20a8946f02f8b247eaa66
fatal: git cat-file def08273a99cc8d965a20a8946f02f8b247eaa66: bad file

It seems the above commit is invalid, lets scan our git log for this commit using one of the following commands to check what's going on:

$ git log -C3 --patch | less +/def08273a99cc8d965a20a8946f02f8b247eaa66
$ git log -C3 --patch | grep -C10 def08273a99cc8d965a20a8946f02f8b247eaa66

commit 505446e02c68fe306aec5b0dc2ccb75b274c75a9
Date:   Thu Jul 3 16:06:25 2014 +0100

    Added dir.

new file mode 160000
index 0000000..def0827
--- /dev/null
+++ b/sandbox/commerce_coupon_per_user
@@ -0,0 +1 @@
+Subproject commit def08273a99cc8d965a20a8946f02f8b247eaa66

In this particular case, our commit points to the bad object, because it was commited as part of git subproject which doesn't exist anymore (check git submodule status).

You may exclude that invalid object from the ls-tree and re-create tree without this bad object by e.g.:

$ git ls-tree bb81a5af7e9203f36c3201f2736fca77ab7c8f29 | grep -v def08273a99cc8d965a20a8946f02f8b247eaa66 | git mktree
b964946faf34468cb2ee8e2f24794ae1da1ebe20

$ git replace bb81a5af7e9203f36c3201f2736fca77ab7c8f29 b964946faf34468cb2ee8e2f24794ae1da1ebe20

$ git ls-tree bb81a5af7e9203f36c3201f2736fca77ab7c8f29 # Re-test.
$ git fsck -full

Note: The old object should still throw the duplicate file entries, but if you've now duplicates in the new tree, then you need to remove more stuff from that tree. So:

$ git replace # List replace objects.
bb81a5af7e9203f36c3201f2736fca77ab7c8f29
$ git replace -d bb81a5af7e9203f36c3201f2736fca77ab7c8f29 # Remove previously replaced object.

Now lets try to remove all commits and blobs from that tree, and replace is again:

$ git ls-tree bb81a5af7e9203f36c3201f2736fca77ab7c8f29 | grep -ve commit -e blob | git mktree
4b825dc642cb6eb9a060e54bf8d69288fbee4904
$ git replace bb81a5af7e9203f36c3201f2736fca77ab7c8f29 4b825dc642cb6eb9a060e54bf8d69288fbee4904

Now you have empty tree for that invalid entry.

$ git status # Check if everything is fine.
$ git show 4b825dc642cb6eb9a060e54bf8d69288fbee4904 # Re-check
$ git ls-tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904 --full-tree # Re-check

If you have some weird changes for stage, reset your repository by:

$ git reset HEAD --hard

If you'll have the following error:

HEAD is now at 5a4ed8e Some message at bb81a5af7e9203f36c3201f2736fca77ab7c8f29

Do the rebase and remove that commit (by changing pick to edit):

$ git rebase -i
$ git commit -m'Fixed invalid commit.' -a
rebase in progress; onto 691f725
You are currently editing a commit while rebasing branch 'dev' on '691f725'.
$ git rebase --continue
$ git reset --hard
$ git reset HEAD --hard
$ git reset origin/master --hard

Method 5.

Try removing and squashing invalid commits containing invalid objects.

$ git rebase -i HEAD~100 # 100 commits behind HEAD, increase if required.

Read more: Git Tools - Rewriting History and How do I rebase while skipping a particular commit?

Method 6.

Identifying the invalid git objects by the following methods for manual removal:

for uncompressed objects (*please remove first two characters, as git uses it for the directory name):
```
$ find . -name 81a5af7e9203f36c3201f2736fca77ab7c8f29
```

for compressed objects

$ find . -name \*.idx -exec cat {} \; | git show-index | grep bb81a5af7e9203f36c3201f2736fca77ab7c8f29
# Then you need to find the file manually.
$ git unpack-objects $FILE # Expand the particular file.
$ git unpack-objects < .git/objects/pack/pack-*.pack # Expand all.

See: How to unpack all objects of a git repository?

VonC · Accepted Answer · 2022-07-12 08:27:58Z

Note: Git 2.1 will add two option to git replace which can be useful when modifying a corrupted entry in a git repo:

commit 4e4b125 by Christian Couder (chriscool)
```
--edit <object>
```

Edit an object's content interactively. The existing content for <object> is pretty-printed into a temporary file, an editor is launched on the file, and the result is parsed to create a new object of the same type as <object>.
A replacement ref is then created to replace <object> with the newly created object.
See git-var for details about how the editor will be chosen.

And commit 2deda62 by Jeff King (peff):

replace: add a --raw mode for --edit

One of the purposes of "git replace --edit" is to help a user repair objects which are malformed or corrupted.
Usually we pretty-print trees with "ls-tree", which is much easier to work with than the raw binary data.

However, some forms of corruption break the tree-walker, in which case our pretty-printing fails, rendering "--edit" useless for the user.

This patch introduces a "--raw" option, which lets you edit the binary data in these instances.

Knowing how Jeff is used to debug Git (like in this case), I am not too surprised to see this option.

Note that before Git 2.27 (Q2 2020), "git fsck" ensured that the paths recorded in tree objects were sorted and without duplicates, but it failed to notice a case where a blob is followed by entries that sort before a tree with the same name.

This has been corrected.

See commit 9068cfb (10 May 2020) by René Scharfe (rscharfe).
^{(Merged by Junio C Hamano -- gitster -- in commit 0498840, 14 May 2020)}

fsck: report non-consecutive duplicate names in trees

^{Suggested-by: Brandon Williams}
^{Original-test-by: Brandon Williams}
^{Signed-off-by: René Scharfe}
^{Reviewed-by: Luke Diamand}

Tree entries are sorted in path order, meaning that directory names get a slash ('/') appended implicitly.

Git fsck checks if trees contains consecutive duplicates, but due to that ordering there can be non-consecutive duplicates as well if one of them is a directory and the other one isn't.

Such a tree cannot be fully checked out.

Find these duplicates by recording candidate file names on a stack and check candidate directory names against that stack to find matches.

With Git 2.30 (Q1 2021), the logic to deal with a repack operation that ended up creating the same packfile has been simplified.

See commit 2fcb03b (17 Nov 2020), and commit 704c4a5 (16 Nov 2020) by Taylor Blau (ttaylorr).
See commit 63f4d5c (16 Nov 2020) by Jeff King (peff).
^{(Merged by Junio C Hamano -- gitster -- in commit 39d38a5, 03 Dec 2020)}

builtin/repack.c: don't move existing packs out of the way

^{Helped-by: Jeff King}
^{Signed-off-by: Taylor Blau}

When 'git repack'^(man) creates a pack with the same name as any existing pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and then renames the new one into place.

Eventually, it would be nice to have 'git repack'^(man) allow for writing a multi-pack index at the critical time (after the new packs have been written / moved into place, but before the old ones have been deleted). Guessing that this option might be called '--write-midx', this makes the following situation (where repacks are issued back-to-back without any new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx  
In the second repack, the existing packs are overwritten verbatim with the same rename-to-old sequence. At that point, the current MIDX is invalidated, since it refers to now-missing packs. So that code wants to be run after the MIDX is re-written. But (prior to this patch) the new MIDX can't be written until the new packs are moved into place. So, we have a circular dependency.

This is all hypothetical, since no code currently exists to write a MIDX safely during a 'git repack^(man) ' (the 'GIT_TEST_MULTI_PACK_INDEX' does so unsafely). Putting hypothetical aside, though: why do we need to rename existing packs to be prefixed with 'old-' anyway?

This behavior dates all the way back to 2ad47d6 ("git-repack: Be careful when updating the same pack as an existing one.", 2006-06-25, Git v1.4.1 -- merge). 2ad47d6 is mainly concerned about a case where a newly written pack would have a different structure than its index. This used to be possible when the pack name was a hash of the set of objects. Under this naming scheme, two packs that store the same set of objects could differ in delta selection, object positioning, or both. If this happened, then any such packs would be unreadable in the instant between copying the new pack and new index (i.e., either the index or pack will be stale depending on the order that they were copied).

But since 1190a1a ("pack-objects: name pack files after trailer hash", 2013-12-05, Git v1.9-rc0 -- merge), this is no longer possible, since pack files are named not after their logical contents (i.e., the set of objects), but by the actual checksum of their contents.
So, this old- behavior can safely go, which allows us to avoid our circular dependency above.

In addition to avoiding the circular dependency, this patch also makes 'git repack'^(man) a lot simpler, since we don't have to deal with failures encountered when renaming existing packs to be prefixed with 'old-'.

This patch is mostly limited to removing code paths that deal with the 'old' prefixing, with the exception of files that include the pack's name in their own filename, like .idx, .bitmap, and related files. The exception is that we want to continue to trust what pack-objects wrote. That is, it is not the case that we pretend as if pack-objects didn't write files identical to ones that already exist, but rather that we respect what pack-objects wrote as the source of truth. That cuts two ways:

If pack-objects produced an identical pack to one that already exists with a bitmap, but did not produce a bitmap, we remove the bitmap that already exists. (This behavior is codified in t7700.14).

If pack-objects produced an identical pack to one that already exists, we trust the just-written version of the corresponding .idx, .promisor, and other files over the ones that already exist. This ensures that we use the most up-to-date versions of this files, which is safe even in the face of format changes in, say, the .idx file (which would not be reflected in the .idx file's name).

When rebuilding the multi-pack index file reusing an existing one, we used to blindly trust the existing file and ended up carrying corrupted data into the updated file, which has been corrected with Git 2.33 (Q3 2021).

See commit f89ecf7, commit ec1e28e, commit 15316a4, commit f9221e2 (23 Jun 2021) by Taylor Blau (ttaylorr).
^{(Merged by Junio C Hamano -- gitster -- in commit 3b57e72, 16 Jul 2021)}

midx: report checksum mismatches during 'verify'

^{Suggested-by: Derrick Stolee}
^{Signed-off-by: Taylor Blau}

'git multi-pack-index verify'^(man) inspects the data in an existing MIDX for correctness by checking that the recorded object offsets are correct, and so on.

But it does not check that the file's trailing checksum matches the data that it records.
So, if an on-disk corruption happened to occur in the final few bytes (and all other data was recorded correctly), we would:

get a clean result from 'git multi-pack-index verify', but

be unable to reuse the existing MIDX when writing a new one (since we now check for checksum mismatches before reusing a MIDX)

Teach the 'verify' sub-command to recognize corruption in the checksum by calling midx_checksum_valid().

With Git 2.34 (Q4 2021), "git repack"^(man) has been taught to generate multi-pack reachability bitmaps.

See commit e861b09 (06 Oct 2021) by Jeff King (peff).
See commit 324efc9 (01 Oct 2021), and commit 6d08b9d, commit 1d89d88, commit 5f18e31, commit a169166, commit 90f838b, commit 08944d1, commit 6fb22ca, commit 56d863e (28 Sep 2021) by Taylor Blau (ttaylorr).
^{(Merged by Junio C Hamano -- gitster -- in commit 0b69bb0, 18 Oct 2021)}

builtin/repack.c: support writing a MIDX while repacking

^{Signed-off-by: Taylor Blau}

Teach git repack^(man) a new --write-midx option for callers that wish to persist a multi-pack index in their repository while repacking.

There are two existing alternatives to this new flag, but they don't cover our particular use-case.
These alternatives are:

Call 'git multi-pack-index write'^(man) after running 'git repack', or

Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running 'git repack'.

The former works, but introduces a gap in bitmap coverage between repacking and writing a new MIDX (since the repack may have deleted a pack included in the existing MIDX, invalidating it altogether).

Introduce a new option which eliminates this race by teaching git repack to generate the MIDX at the critical point: after the new packs have been written and moved into place, but before the redundant packs have been removed.

This option is compatible with git repack's '--bitmap' option (it changes the interpretation to be: "write a bitmap corresponding to the MIDX after one has been generated").

The MIDX code does not handle this, so avoid trying to generate a MIDX covering zero packs in the first place.

git repack now includes in its man page:

This option has no effect if multiple packfiles are created, unless writing a MIDX (in which case a multi-pack bitmap is created).

And still git repack now includes in its man page:

-m

--write-midx

Write a multi-pack index (see git multi-pack-index) containing the non-redundant packs.

With Git 2.38 (Q3 2022), the collection of what is referenced by objects in promisor packs have been optimized to inspect these objects in the in-pack order.

That will make the git fsck from kenorb's answer much faster.

See commit 18c08ab (16 Jun 2022) by Jeff King (peff).
^{(Merged by Junio C Hamano -- gitster -- in commit 2b970bc, 11 Jul 2022)}

is_promisor_object(): walk promisor packs in pack-order

^{Signed-off-by: Jeff King}

When we generate the list of promisor objects, we walk every pack with a .promisor file and examine its objects for any links to other objects.
By default, for_each_packed_object() will go in pack .idx order.

This is the worst case with respect to our delta base cache.
If we have a delta chain of A->B->C->D, then visiting A may require reconstructing both B and C, unless we also visited B recently, in which case we may have cached its value.

Because .idx order is based on sha1, it's random with respect to the actual object contents and deltas, and thus we're unlikely to get many cache hits.

If we instead traverse in pack order, then we get the optimal case: packs are written to keep delta families together, and to place bases before their children.

Even on a modest repository like git.git, this has a noticeable speedup on p5600.4, which runs "fsck" on a partial clone with blob:none (so lots of trees which need to be walked, and which delta well):
Test       HEAD^               HEAD
-------------------------------------------------------
5600.4:    17.87(17.83+0.04)   15.42(15.35+0.06) -13.7% 
On a larger repository like linux.git, the speedup is even more pronounced:
Test       HEAD^                 HEAD
-----------------------------------------------------------
5600.4:    322.47(322.01+0.42)   186.41(185.76+0.63) -42.2%  
Any other operations that call is_promisor_object(), like "rev-list --exclude-promisor-objects", would similarly benefit, but the invocations in p5600 don't actually trigger any such cases.

Note that we may pay a small price to build a rev-index in-memory to do the pack-order traversal.
But it's still a big net win, and even that small cost goes away if you are using pack.writeReverseIndex.

Community · Accepted Answer · 2017-05-23 10:28:15Z

2

The only solution I have ran across is to use git-replace and git-mktree. Its not the easiest solution in the world but it does work.

Look at this link for a reference guide.

git tree contains duplicate file entries

edited May 23, 2017 at 10:28

CommunityBot

11 silver badge

answered Apr 7, 2014 at 18:31

user3447739

667 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Tree contains duplicate file entries

3 Answers 3

`replace`: add a `--raw` mode for `--edit`

`fsck`: report non-consecutive duplicate names in trees

`builtin/repack.c`: don't move existing packs out of the way

`midx`: report checksum mismatches during 'verify'

`builtin/repack.c`: support writing a MIDX while repacking

`-m`

`--write-midx`

`is_promisor_object()`: walk promisor packs in pack-order

Not the answer you're looking for? Browse other questions tagged
git
object
duplicates
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

replace: add a --raw mode for --edit

fsck: report non-consecutive duplicate names in trees

builtin/repack.c: don't move existing packs out of the way

midx: report checksum mismatches during 'verify'

builtin/repack.c: support writing a MIDX while repacking

-m

--write-midx

is_promisor_object(): walk promisor packs in pack-order

Not the answer you're looking for? Browse other questions tagged gitobjectduplicates or ask your own question.