Note: Git 2.1 will add two option to git replace
which can be useful when modifying a corrupted entry in a git repo:
Edit an object's content interactively. The existing content for <object>
is pretty-printed into a temporary file, an editor is launched on the file, and the result is parsed to create a new object of the same type as <object>
.
A replacement ref is then created to replace <object>
with the newly created object.
See git-var
for details about how the editor will be chosen.
And commit 2deda62 by Jeff King (peff
):
replace
: add a --raw
mode for --edit
One of the purposes of "git replace --edit
" is to help a user repair objects which are malformed or corrupted.
Usually we pretty-print trees with "ls-tree
", which is much easier to work with than the raw binary data.
However, some forms of corruption break the tree-walker, in which case our pretty-printing fails, rendering "--edit
" useless for the user.
This patch introduces a "--raw
" option, which lets you edit the binary data in these instances.
Knowing how Jeff is used to debug Git (like in this case), I am not too surprised to see this option.
Note that before Git 2.27 (Q2 2020), "git fsck
" ensured that the paths recorded in tree objects were sorted and without duplicates, but it failed to notice a case where a blob is followed by entries that sort before a tree with the same name.
This has been corrected.
See commit 9068cfb (10 May 2020) by René Scharfe (rscharfe
).
(Merged by Junio C Hamano -- gitster
-- in commit 0498840, 14 May 2020)
fsck
: report non-consecutive duplicate names in trees
Suggested-by: Brandon Williams
Original-test-by: Brandon Williams
Signed-off-by: René Scharfe
Reviewed-by: Luke Diamand
Tree entries are sorted in path order, meaning that directory names get a slash ('/') appended implicitly.
Git fsck checks if trees contains consecutive duplicates, but due to that ordering there can be non-consecutive duplicates as well if one of them is a directory and the other one isn't.
Such a tree cannot be fully checked out.
Find these duplicates by recording candidate file names on a stack and check candidate directory names against that stack to find matches.
With Git 2.30 (Q1 2021), the logic to deal with a repack operation that ended up creating the same packfile has been simplified.
See commit 2fcb03b (17 Nov 2020), and commit 704c4a5 (16 Nov 2020) by Taylor Blau (ttaylorr
).
See commit 63f4d5c (16 Nov 2020) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit 39d38a5, 03 Dec 2020)
builtin/repack.c
: don't move existing packs out of the way
Helped-by: Jeff King
Signed-off-by: Taylor Blau
When 'git repack
'(man) creates a pack with the same name as any existing pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}
' and then renames the new one into place.
Eventually, it would be nice to have 'git repack
'(man) allow for writing a multi-pack index at the critical time (after the new packs have been written / moved into place, but before the old ones have been deleted). Guessing that this option might be called '--write-midx
', this makes the following situation (where repacks are issued back-to-back without any new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with the same rename-to-old sequence. At that point, the current MIDX is invalidated, since it refers to now-missing packs. So that code wants to be run after the MIDX is re-written. But (prior to this patch) the new MIDX can't be written until the new packs are moved into place. So, we have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX safely during a 'git repack
(man) ' (the 'GIT_TEST_MULTI_PACK_INDEX
' does so unsafely). Putting hypothetical aside, though: why do we need to rename existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 ("git-repack
: Be careful when updating the same pack as an existing one.", 2006-06-25, Git v1.4.1 -- merge). 2ad47d6 is mainly concerned about a case where a newly written pack would have a different structure than its index. This used to be possible when the pack name was a hash of the set of objects. Under this naming scheme, two packs that store the same set of objects could differ in delta selection, object positioning, or both. If this happened, then any such packs would be unreadable in the instant between copying the new pack and new index (i.e., either the index or pack will be stale depending on the order that they were copied).
But since 1190a1a ("pack-objects
: name pack files after trailer hash", 2013-12-05, Git v1.9-rc0 -- merge), this is no longer possible, since pack files are named not after their logical contents (i.e., the set of objects), but by the actual checksum of their contents.
So, this old-
behavior can safely go, which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes 'git repack
'(man) a lot simpler, since we don't have to deal with failures encountered when renaming existing packs to be prefixed with 'old-
'.
This patch is mostly limited to removing code paths that deal with the 'old' prefixing, with the exception of files that include the pack's name in their own filename, like .idx
, .bitmap
, and related files. The exception is that we want to continue to trust what pack-objects wrote. That is, it is not the case that we pretend as if pack-objects didn't write files identical to ones that already exist, but rather that we respect what pack-objects wrote as the source of truth. That cuts two ways:
- If pack-objects produced an identical pack to one that already exists with a bitmap, but did not produce a bitmap, we remove the bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already exists, we trust the just-written version of the corresponding
.idx
, .promisor
, and other files over the ones that already exist. This ensures that we use the most up-to-date versions of this files, which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx
file's name).
When rebuilding the multi-pack index file reusing an existing one, we used to blindly trust the existing file and ended up carrying corrupted data into the updated file, which has been corrected with Git 2.33 (Q3 2021).
See commit f89ecf7, commit ec1e28e, commit 15316a4, commit f9221e2 (23 Jun 2021) by Taylor Blau (ttaylorr
).
(Merged by Junio C Hamano -- gitster
-- in commit 3b57e72, 16 Jul 2021)
midx
: report checksum mismatches during 'verify'
Suggested-by: Derrick Stolee
Signed-off-by: Taylor Blau
'git multi-pack-index verify
'(man) inspects the data in an existing MIDX for correctness by checking that the recorded object offsets are correct, and so on.
But it does not check that the file's trailing checksum matches the data that it records.
So, if an on-disk corruption happened to occur in the final few bytes (and all other data was recorded correctly), we would:
- get a clean result from '
git multi-pack-index verify
', but
- be unable to reuse the existing MIDX when writing a new one (since we now check for checksum mismatches before reusing a MIDX)
Teach the 'verify
' sub-command to recognize corruption in the checksum by calling midx_checksum_valid()
.
With Git 2.34 (Q4 2021), "git repack
"(man) has been taught to generate multi-pack reachability bitmaps.
See commit e861b09 (06 Oct 2021) by Jeff King (peff
).
See commit 324efc9 (01 Oct 2021), and commit 6d08b9d, commit 1d89d88, commit 5f18e31, commit a169166, commit 90f838b, commit 08944d1, commit 6fb22ca, commit 56d863e (28 Sep 2021) by Taylor Blau (ttaylorr
).
(Merged by Junio C Hamano -- gitster
-- in commit 0b69bb0, 18 Oct 2021)
builtin/repack.c
: support writing a MIDX while repacking
Signed-off-by: Taylor Blau
Teach git repack
(man) a new --write-midx
option for callers that wish to persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't cover our particular use-case.
These alternatives are:
- Call '
git multi-pack-index write
'(man) after running 'git repack
', or
- Set '
GIT_TEST_MULTI_PACK_INDEX=1
' in your environment when running 'git repack
'.
The former works, but introduces a gap in bitmap coverage between repacking and writing a new MIDX (since the repack may have deleted a pack included in the existing MIDX, invalidating it altogether).
Introduce a new option which eliminates this race by teaching git repack
to generate the MIDX at the critical point: after the new packs have been written and moved into place, but before the redundant packs have been removed.
This option is compatible with git repack
's '--bitmap' option (it changes the interpretation to be: "write a bitmap corresponding to the MIDX after one has been generated").
The MIDX code does not handle this, so avoid trying to generate a MIDX covering zero packs in the first place.
git repack
now includes in its man page:
This option
has no effect if multiple packfiles are created, unless writing a
MIDX (in which case a multi-pack bitmap is created).
And still git repack
now includes in its man page:
-m
--write-midx
Write a multi-pack index (see git multi-pack-index
)
containing the non-redundant packs.
With Git 2.38 (Q3 2022), the collection of what is referenced by objects in promisor packs have been optimized to inspect these objects in the in-pack order.
That will make the git fsck
from kenorb's answer much faster.
See commit 18c08ab (16 Jun 2022) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit 2b970bc, 11 Jul 2022)
Signed-off-by: Jeff King
When we generate the list of promisor objects, we walk every pack with a .promisor
file and examine its objects for any links to other objects.
By default, for_each_packed_object()
will go in pack .idx
order.
This is the worst case with respect to our delta base cache.
If we have a delta chain of A->B->C->D,
then visiting A may require reconstructing both B and C, unless we also visited B recently, in which case we may have cached its value.
Because .idx
order is based on sha1, it's random with respect to the actual object contents and deltas, and thus we're unlikely to get many cache hits.
If we instead traverse in pack order, then we get the optimal case: packs are written to keep delta families together, and to place bases before their children.
Even on a modest repository like git.git, this has a noticeable speedup on p5600.4, which runs "fsck
" on a partial clone with blob:none
(so lots of trees which need to be walked, and which delta well):
Test HEAD^ HEAD
-------------------------------------------------------
5600.4: 17.87(17.83+0.04) 15.42(15.35+0.06) -13.7%
On a larger repository like linux.git, the speedup is even more pronounced:
Test HEAD^ HEAD
-----------------------------------------------------------
5600.4: 322.47(322.01+0.42) 186.41(185.76+0.63) -42.2%
Any other operations that call is_promisor_object()
, like "rev-list --exclude-promisor-objects
", would similarly benefit, but the invocations in p5600 don't actually trigger any such cases.
Note that we may pay a small price to build a rev-index
in-memory to do the pack-order traversal.
But it's still a big net win, and even that small cost goes away if you are using pack.writeReverseIndex
.