12

So I've been trying to understand what a SHA-1 collision is and what it means but there's one thing I'm not quite understanding.

How did they manage to not only find two documents with the same hash - but also one that looks visually almost exactly the same as the original? All the text is the same and the only difference is the background colour.

My understanding is that the really difficult part of this was finding any two document where the hash matched and that it took $100k of computing power to do this.

Wouldn't it be substantially harder if you were restricted by only collisions of documents that looked similar? Did they plan ahead for this to happen or was it just a coincidence?

1
  • Their exact methods won't be published until 90 days after it was announced, following their disclosure policy. Commented Feb 25, 2017 at 17:31

5 Answers 5

10

The main idea behind this attack relies on Differential cryptanalysis. You have two inputs t0 and t'0 with a difference of Δ0. Once these two inputs have gone through the function f (the compression function in the case of SHA-1) you get a difference Δ1 and so on (see following picture)...

enter image description here

The goal here is to find a difference in the input Δ0 such that after some iterations you get Δ2 = 0, in other words no difference.

the SHAttered Attack

Because of the Merkle–Damgård construction of SHA-1, one can choose to alter the differences between the iterations in order to improve the differences to match his needs. In the case of the SHAttered Attack, they chose an initial prefix (P), then later on the next blocks they introduce a difference (M1(1),M1(2)) and remove it (M2(1) and M2(2)). At this point they already have their collision. They just need to continue with the same following blocks (S), leading to a collision on the whole input.

enter image description here

SHA-1(P|| M1(1) || M2(1) || S)= SHA-1(P||M1(2) || M2(2) || S)

The values of M1(1), M1(1), M2(1) and M2(2) are as follows:

enter image description here

The difficulty is thus to find the correct Mi that satisfy the chosen prefix criteria P and that cancel themselves nicely afterwards. The probabilities of such to occur are really small, thus the need of massive amount of calls to SHA-1 compression function (263.1).

In the case of the provided PDFs, they also give a nice infographic which explain where the difference is located in the PDFs:

enter image description here

5

A semi-brute-force (263 SHA-1 operations) chosen prefix attack using the first 192 bytes of a carefully constructed PDF. The computational and cryptographic aspects are impressive, but the trickery required to leverage this attack is just as interesting.

The technical explanation can be found over on cryptography.SE in Biv's answer (which he has partly repeated here). An English simplification of this is that SHA-1 has had a known issue for over a decade now (PDF paper here), this can be used in chosen prefix collision attacks. The header in a PDF and the leading structure of an X.509 certificate are useful targets for this (this is part of the reason SHA-1 RSA signatures on certificates are considered insecure).

The 2005 attack allows, for a given input, better-than-brute-force computation of a pair of extra input blocks which "invert" each other and make no difference to the SHA-1 output if they are appended to the input. That alone is not so exciting, but it also allows computing a pair of those blocks, that is what is used in this attack. Once you can manipulate SHA-1 in this way you can continue to append more data to make real documents with SHA-1 collisions, albeit with the differences constrained to two adjacent 64-byte blocks.

That's what this means: SHA-1

Where P is the prefix (PDF header), M(1) and M(2) are the pairs of blocks which manipulate the SHA-1 state, and S is the suffix (document payload). To be clear: the prefix and the suffix must be identical, only the magic blocks may differ.

This partly explains how the provided detector works and why it only needs one document. It performs an incremental SHA-1 over the document length for each SHA-1 block input size (64 bytes) and checks for duplicate/suspicious SHA-1 internal state (see their paper here for details). You can partly see what's happening with (from bash):

for ss in {0..512..64};do head -c $ss shattered-1.pdf | sha1sum -; done
for ss in {0..512..64};do head -c $ss shattered-2.pdf | sha1sum -; done 

What's happened now is that it has been proven that this is computationally practical (and can be run on a GPU), and also has real world consequences (which to be honest, some attacks don't) by manipulating a document with visible results.

There's one tricky little problem that's been glossed over, those magic blocks are not controlled by the attacker, he must compute them (up to 263 operations) for a defined input prefix, and take whatever comes out as the result (which almost certainly won't be a red or blue JPEG).

What one could do is take a format which allows content ballast (PDF is good, as is anything which ignores extra content at the end, or the typical word processor document) and simply have two versions with the alternate pairs of magic blocks hidden away, but indistinguishable presentation of content. Not so exciting.

So how did they do it? In the JPEG.

Firstly the PDF format diagram is slightly misleading, JPEGs don't really have clearly defined header or data parts, they are a stream of segments each tagged with a type and (mostly) with a 2 byte length field.

The trick then is that the JPEG has two carefully aligned comment segments at the start:

hex dump

The first comment (with the obligatory witticism) aligns the second comment so that the length LSB is the first byte of the magic block (64 byte SHA-1 block alignment). In the PDF 1 document that byte is 0x73 (above), in the second it's 0x7f which makes that comment segment longer. At (relative) offset 0x0173 is another comment of length 0x00fc which consumes the the start of JPEG data at offset 0x017f. The JPEG image data in PDF 2 starts at 0x017f. In the image in document 1 there are a set (12) of carefully placed and sized comments to bounce over data segments that are processed in the image in document 2. These are mostly quantisation tables and Huffman tables that account for the image difference (I think, it's a bit of a mess in there).

So the prefix in the attack is the PDF header + content preamble + the start of the JPEG stream. Then come the magic blocks. The suffix is the remaining JPEG stream + the closing part of the PDF.

X.509 DER form certificates use ASN.1 encoding which supports fairly arbitrary metadata fields and also uses a similar type of length encoded fields as JPEG, so the same type of attack could work there (as long as the ASN.1 nesting doesn't trip you up), but I can't think of a useful way to leverage (like for MD5) that given that this is a known prefix attack and you won't know the prefix until the certificate is signed.

Another interesting observation is that the leading 320 bytes of these PDFs can be used as a prefix for any other arbitrary PDF, subject only to the constraints imposed by those leading 320 bytes, i.e. must be a PDFv1.3 and begin with a JPEG (or more precisely an Xobject type Image that works with a PDF DCTDecode filter, even the image dimensions and size are variable as they appear after the image data in the PDF serialisation). That genie is out of the bottle now: https://github.com/nneonneo/sha1collider

3

The different images displayed by the two SHAttered PDFs is due entirely to a single byte located at offset 192, which is "\x73" in shattered-1.pdf and "\x7f" in shattered-2.pdf. This byte forms the second part of a two-byte big endian integer that defines the length of a COM ("Comment") segment within the JPEG image stream. Bytes contained in a COM segment are not used for anything and are therefore skipped by the JPEG parser.

In shattered-1.pdf this COM segment length is read as "\x01\x73" (371 bytes), whereas in shattered-2.pdf it is read as "\x01\x7f" (383 bytes). The difference in this COM segment length allows the construction of an identical suffix containing two sets of image data arranged in such a way that what is parsed as image data in one file will be parsed as COM segments in the other. The result is the files display different images.

The image below compares the first 832 bytes of shattered-1.pdf and shattered-2.pdf:

comparison The magic blocks (the two different binary blobs that, when appended to the identical prefix, produce the same SHA-1 hash) are outlined in blue. The hex digits in red font are the only differences between the two PDFs. COM segment markers with their length bytes are highlighted in yellow and the remaining bytes contained in the COM segment are shaded in grey (note that all but the first byte of the magic blocks are contained in COM segments and not used for anything). Image data is highlighted in green. The bytes at the start of the file that are not highlighted is boilerplate PDF header material.

Indeed both PDF files fully contain both images. This is simple enough to demonstrate (the code below references shattered-1.pdf, but the results would be identical if shattered-2.pdf were substituted).

Begin by locating the byte offsets of the JPEG APP0 and EOI segment markers ('\xff\xe0' and '\xff\xd9', respectively) in the PDF:

$ grep -oba "$(printf '\xff\xe0')" shattered-1.pdf
574:��
187049:��
$ grep -oba "$(printf '\xff\xd9')" shattered-1.pdf
187043:��
421532:��

Since there are exactly two APP0 segment markers (starting at offsets 574 and 187049) and two EOI markers (starting at offsets 187043 and 421532), we can deduce that one set of image data resides in byte range 574 through 187045 (186,471 bytes) and the other in byte range 187049 through 421534 (234,485 bytes).

Create two empty JPEG files containing only a SOI header (\xff\xd8) then append each with one of the two byte ranges from the PDF:

$ printf '\xff\xd8' | tee a.jpeg >> b.jpeg
$ dd if=shattered-1.pdf of=a.jpeg oflag=append conv=notrunc bs=1 skip=574 count=186471
$ dd if=shattered-1.pdf of=b.jpeg oflag=append conv=notrunc bs=1 skip=187049 count=234485

This produces two JPEGs, a.jpeg and b.jpeg, and these two JPEGs are indeed the two different images displayed by shattered-1.pdf and shattered-2.pdf.

both images

2

They planed it ahead and having similar files have nothing to do with it.

They are controlling both PDFs and the main part that makes it work are hidden inside pdf file format. Take a look: https://shattered.it/static/pdf_format.png

1

SHA-1 is vulnerable to chosen-prefix attack, due to the construction of the SHA-1 algorithm. This means that if an attacker managed to find two different strings with the same prefix, they will be able to append arbitrary data to both data and they'd get the same SHA-1 for the resultant file.

In other words, the chosen prefix collision attack exploits the property that given two strings a and b such that SHA1(a) == SHA1(b), then SHA1(a + c) == SHA1(b + c) for any a and b even when a != b.

As mentioned in the shattered.io, the attack works for PDF because the PDF file format has a vulnerable structure:

enter image description here

To put it simply, a file format that's vulnerable to prefix attack has the structure:

  • prefix, which contains a user-visible data that is referenced later in the body
  • collision block, a space for arbitrary, non-user-visible data
  • body, which is the same between the two collided files

A prefix collision attack works by finding a string in the collision block that will cause the prefix + collision block of two different prefixes produce the same hash. Once you discover a colliding prefix, you can then append arbitrary bodies to the PDF and this will alter what is shown on the PDF while keeping the hashes equal. To put it simply, once you discovered a colliding prefix, you can append anything to the body, and as long as the appendages are the same, your two resultant files will also have equal hashes.

As a result, user-visible data are altered as the interpretation of the body depends on the data in the headers.

If you look at the hexdump diff of the published PDF file, the only difference between the shattered-1.pdf and shattered-2.pdf is just a very small section near the start of the file:

$ diff -u  <(od -A x -t x1z -v ./shattered-1.pdf) <(od -A x -t x1z -v ./shattered-2.pdf) 
--- /dev/fd/63  2017-02-28 00:27:15.844105847 +1100
+++ /dev/fd/62  2017-02-28 00:27:15.840104052 +1100
@@ -10,14 +10,14 @@
 000090 72 65 61 6d 0a ff d8 ff fe 00 24 53 48 41 2d 31  >ream......$SHA-1<
 0000a0 20 69 73 20 64 65 61 64 21 21 21 21 21 85 2f ec  > is dead!!!!!./.<
 0000b0 09 23 39 75 9c 39 b1 a1 c6 3c 4c 97 e1 ff fe 01  >.#9u.9...<L.....<
-0000c0 73 46 dc 91 66 b6 7e 11 8f 02 9a b6 21 b2 56 0f  >sF..f.~.....!.V.<
-0000d0 f9 ca 67 cc a8 c7 f8 5b a8 4c 79 03 0c 2b 3d e2  >..g....[.Ly..+=.<
-0000e0 18 f8 6d b3 a9 09 01 d5 df 45 c1 4f 26 fe df b3  >..m......E.O&...<
-0000f0 dc 38 e9 6a c2 2f e7 bd 72 8f 0e 45 bc e0 46 d2  >.8.j./..r..E..F.<
-000100 3c 57 0f eb 14 13 98 bb 55 2e f5 a0 a8 2b e3 31  ><W......U....+.1<
-000110 fe a4 80 37 b8 b5 d7 1f 0e 33 2e df 93 ac 35 00  >...7.....3....5.<
-000120 eb 4d dc 0d ec c1 a8 64 79 0c 78 2c 76 21 56 60  >.M.....dy.x,v!V`<
-000130 dd 30 97 91 d0 6b d0 af 3f 98 cd a4 bc 46 29 b1  >.0...k..?....F).<
+0000c0 7f 46 dc 93 a6 b6 7e 01 3b 02 9a aa 1d b2 56 0b  >.F....~.;.....V.<
+0000d0 45 ca 67 d6 88 c7 f8 4b 8c 4c 79 1f e0 2b 3d f6  >E.g....K.Ly..+=.<
+0000e0 14 f8 6d b1 69 09 01 c5 6b 45 c1 53 0a fe df b7  >..m.i...kE.S....<
+0000f0 60 38 e9 72 72 2f e7 ad 72 8f 0e 49 04 e0 46 c2  >`8.rr/..r..I..F.<
+000100 30 57 0f e9 d4 13 98 ab e1 2e f5 bc 94 2b e3 35  >0W...........+.5<
+000110 42 a4 80 2d 98 b5 d7 0f 2a 33 2e c3 7f ac 35 14  >B..-....*3....5.<
+000120 e7 4d dc 0f 2c c1 a8 74 cd 0c 78 30 5a 21 56 64  >.M..,..t..x0Z!Vd<
+000130 61 30 97 89 60 6b d0 bf 3f 98 cd a8 04 46 29 a1  >a0..`k..?....F).<
 000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
 000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
 000160 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<

The section that are different here is in the middle of an embedded JPEG image header. If you extract the JPEG image out of the PDF, you'll notice that the image has an unusual comment metadata:

$ file shattered-1-img.jpg shattered-2-img.jpg 
shattered-1-img.jpg: JPEG image data, comment: "F\334\221f\266~\021\217\002\232\266!\262V\017\371\312g\314\250\307\370[\250Ly\003\014+=\342\030\370m\263\251\011\001\325\337E\301O&\376\337\263\3348\351j\302/\347\275r\217\016E\274\340F", comment: "", comment: "\377\332", comment: "\377\332", comment: "\377\332", comment: "\377\332", comment: "\377\332", comment: "\377\332", comment: "\377\332", comment: "\377\332", comment: "\377\332", comment: "\377\332", Exif Standard: [TIFF image data, big-endian, direntries=1], baseline, precision 8, 1024x740, frames 3
shattered-2-img.jpg: JPEG image data, comment: "F\334\223\246\266~\001;\002\232\252\035\262V\013E\312g\326\210\307\370K\214Ly\037\340+=\366\024\370m\261i\011\001\305kE\301S", comment: "\377\376'\364", progressive, precision 8, 1024x740, frames 3

If you pay close attention to the diff of the pdf and the jpg, you'll notice several things:

  • the size of the diff is 64 bytes (= 512 bits), which corresponds to SHA-1 blocksize
  • the start of the diff (0x00c0-0x0140 in the PDF) is aligned to 64 bytes boundary, note that SHA-1 works in 64-bytes chunks
  • the bytes in the collision block are not entirely random, it follows the pattern two bytes same and two bytes different. For example, the first 16 bytes of each collision block:

73 46 dc 91 66 b6 7e 11 8f 02 9a b6 21 b2 56 0f

7f 46 dc 93 a6 b6 7e 01 3b 02 9a aa 1d b2 56 0b

  • just before the collision block are the JPEG comment segment marker (ff fe), which is followed by two bytes that specifies the length of the comment segment
  • the comment segment length marker (01 73) are split between the collision block boundary, 01 is outside the collision block boundary, while 73 and 7f are inside the boundary
  • the length of the comment segment are different (!), for shattered-1.pdf the length of the comment segment is 0x173 (= 371 bytes) while for shattered-2.pdf the length of the comment segment is 0x017f (= 383 bytes)
  • which means that there are 12 bytes difference in where the comment segment ends between the two files

Let's investigate where the comment segment ends:

enter image description here

enter image description here

I've marked where the comment segment ends in the above dump. shattered-1.pdf ends the comment segment in the place marked in blue (address 0x0232), whereas shattered-2.pdf ends the comment segment in the place marked in red (address 0x023e). footnote-1

In the place marked with blue where shattered-1.pdf image's comment segment ends, we find:

1a. a comment section (ff fe) of length 0x00fc (= 252 bytes), which jumps over shattered-2.pdf's metadata section

1b. a comment section (ff fe) of 0x27f4 (= 10228 bytes), which jumps a chunk of shattered-2.pdf's SOS data (see below)

1*. these series of sections coincide with 2*. Within these sections, shattered-1.pdf simply have a number of comment sections to jump over shattered-2.pdf's image data.

1c. at address 0x2daa9 is the real start of shattered-1.pdf's image data

In the place marked with red where shattered-2.pdf image's comment segment ends, we find a fairly regular JPEG metadata section:

2a. a JFIF-APP0 segment marker (ff e0) of length 0x0010 (=16 bytes)

2b. a DQT (Define Quantization Table) segment (ff db) of length 0x0043 (= 67 bytes)

2c. another DQT (Define Quantization Table) segment (ff db) of length 0x0043 (= 67 bytes)

2d. a SOF2 (Start of Frame 2 - Progressive DCT) segment (ff c2) of length 0x0011 (= 17 bytes)

2e. a DHT (Define Huffman Table) segment (ff c4) of length 0x001e (= 30 bytes)

2f. another DHT (Define Huffman Table) segment (ff c4) of length 0x001d (= 29 bytes)

2g. a comment segment (ff fe) of length 0x0006 (= 6 bytes)

2h. an SOS (Start of Scan) header segment (ff da) of length 0x0c (= 12 bytes), which would be immediately followed by image data (a1 fa ...). A scan segment are terminated by a segment start marker (anything that starts with ff, except ff 00 which is an escape sequence when the image data itself need to contain literal ff.

2*. these series of sections coincide with 1*. Within these sections, we find a sequence of DHT and SOS sections interleaved with shattered-1.pdf image's comment segment markers to ensure that these sequences are skipped when the JPEG renderer parses shattered-1.pdf's image. I believe this chunking was needed because a comment segment can only contain up to 0xffff (= 65535 bytes) of comment data, while the image actually need to be larger than that.

2i. at address 0x02daa3, we find an EOI (End of Image) marker (ff d9), which marks the end of shattered-2.pdf's JPEG data. Any data after this is effectively discarded when parsing shattered-2.pdf's image data. Four bytes later, is the real start of shattered-1.pdf's image data (1c).

Final notes:

  1. From what I can see, there is nothing in this attack that actually would have required using PDF at all. The entire attack looks like it could have been done entirely in JPEG.
  2. Given the right know-how, it should be possible to use the collision block with any two images. The two images don't have to be similar at all.

Footnotes:

  1. JPEG block length bytes counts the number of data bytes plus the length marker itself, but excludes the block identifier, which means that the start of the initial comment segment was 0x199, not 0x201.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .