decompressing git objects with zlib

Question

I decompressed three git objects with this Python script:

import zlib

filename = '/path_to_file' 
compressed_contents = open(filename, 'rb').read() 
decompressed_contents = zlib.decompress(compressed_contents) 
print(decompressed_contents)

for these three objects, I get these three outputs:

b'tree 32\x00100644 file\x00\xe6\x01\xe5\x92\x8e\xcc.\xc5\xbe\t\x91{\xe9\x92:\x85\xc4\x89\xe9H'

b'commit 196\x00tree 2b32fe41c7f8c21d5010fb59a59bcce42b2b3ab5\nauthor author <author> 1643729123 +0100\ncommitter author_email <author_email> 1643729123 +0100\n\nadd hello\n'

b'blob 6\x00hello\n'

In the git documentation (git probook) they say that git add a null byte at the end of the object header which is \u0000. But when I decompress those objects with zlib, \u0000 are replaced by \x00.

So, what does git really store in those files, \u0000 or \x00?
Does this script output the raw content of git objects?

\u0000 and \x00 are the same thing: print(repr('\u0000')) — Corralien, Commented Feb 1, 2022 at 16:46
They are not the same thing, confusing them is dangerous, as it pretends that "encodings" don't exist. — Joachim Sauer, Commented Feb 1, 2022 at 16:51

Joachim Sauer · Accepted Answer · 2022-02-01 19:58:19Z

2

\x00 is stored. Or more precisely: a single byte with the value 0 (or 0x00, if you want) is stored.

\u0000 is the Unicode NUL character, a.k.a U+0000 NUL. The \u escape mechanism is a common way to represent Unicode characters, even though it's usually limited to 4 hex digits (which means it can't represent Unicode code points outside of the BMP, such as U+1F600 😀).

Why are these two used interchangeably? Because in most character encodings \u0000 is actually encoded as 0x00. Specifically most 8-bit encodings as well as UTF-8 follow this practice.

Note that it's still important to distinguish the two things, because one is a character (that will often be mapped onto a byte) and the other is a byte value (that can often be interpreted as a character).

edited Feb 1, 2022 at 19:58

answered Feb 1, 2022 at 16:50

Joachim Sauer

307k59 gold badges561 silver badges617 bronze badges

so what is actually stored in the file is 0x00 and zlib interprets it as \x00?
– Karichi
Commented Feb 1, 2022 at 16:58
1

Both 0x00 and \x00 are just notations that describe a single byte with the value 0. What is stored inside of files are bytes, not their representation. So both 0x00 and \x00 are just notations that we humans use to talk about those values, but they represent the same thing (8 bits, all of which are 0). 0, \x00 and 0x00, as well as \0 are just more-or-less widely ways to describe this same value. Create a hexdump of those files (after decompressing them, of course) using something like hd inputfile and it'll probably help you understand the distinction.
– Joachim Sauer
Commented Feb 1, 2022 at 17:24
zlib only deals with bytes. It does not care what they represent. The input may be encoded in many ways. (As noted, in UTF-8, U+0000 is encoded as a single zero byte.) All zlib guarantees is that what you get from decompression is exactly what was compressed, byte for byte.
– Mark Adler
Commented Feb 1, 2022 at 19:09

Add a comment |

Collectives™ on Stack Overflow

decompressing git objects with zlib

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
git
object
zlib
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythongitobjectzlib or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
git
object
zlib
or ask your own question.