70
$\begingroup$

This weekend, while totally minding my own business and in no way being suspicious, I just happened to come across the following interesting document:

003-004_thumb.jpg
Left page: download as TIFF (100,203,616 bytes) / JPEG (20,003,558 bytes)
Right page: download as TIFF (100,784,464 bytes) / JPEG (20,221,606 bytes)

The thumbnail doesn't do it justice: here's a small sample of the upper-left corner at full resolution from the scanner:

003_sample.jpg

It looks like some sort of paper data archiving format. I need to find out what's on there (purely out of curiosity, of course!) but I have no idea where to start. I heard that you guys like solving mysterious messages, so do you want to give it a shot?


Oh, by the way, I pulled some more papers out of the incinerator at the same time as that document. I'm pretty sure they're related, maybe they'll be helpful in figuring out how to decode it.

001_scaled.jpg
(click to enlarge) download as TIFF (63,056,232 bytes) / JPEG (8,353,067 bytes)

002_scaled.jpg
(click to enlarge) download as TIFF (74,861,272 bytes) / JPEG (12,290,125 bytes)


Puzzlemaker's note: this puzzle will be 'officially' solved when you successfully extract a computer file from the two-page document at the top of this question. Anything in that file is more of an easter egg than a part of the puzzle.

$\endgroup$
12
  • 4
    $\begingroup$ Holy smokes, I guess this needs a Quick Response. $\endgroup$ Commented Apr 25, 2016 at 14:38
  • 6
    $\begingroup$ Hold on while I download 300mb. :p $\endgroup$ Commented Apr 25, 2016 at 21:39
  • 1
    $\begingroup$ You might want to start a new fortnightly topic challenge with the topic "visual". $\endgroup$
    – Sleafar
    Commented Apr 25, 2016 at 21:40
  • 8
    $\begingroup$ I just happened to come across the following interesting document vs. I pulled some more papers out of the incinerator at the same time as that document. So you just happened to come across them in the incinerator? I believe you have been acting suspiciously, despite your claims. $\endgroup$ Commented Jun 30, 2016 at 17:01
  • 4
    $\begingroup$ There is one thing, which is even more tempting than the 500-point bounty, and it is the feeling of discovery when figuring out a puzzle. I don't know if you are about to reveal the complete answer after the bounty period, but I really hope not. Please let us have to enjoy the discovery towards the solution of this intriguing puzzle. Thanks for the experience, @2012rcampion! $\endgroup$
    – elias
    Commented Jul 31, 2016 at 4:35

6 Answers 6

32
+550
$\begingroup$

Here you can find the files found in the recovered compressed file. The ogg file is distorted, maybe somebody with better audio editing skills can figure out something. The pictures and filenames already tell enough for me though.

A shorter wrap-up of all the findings that lead to the result can be read here, but I also kept the original post with updates for those who are interested in the complete history of the discoveries.

I promised to share all the code I've written to solve this puzzle. I've cleaned them a little bit, and made them available here (zip, ~30.5kB).

The corners for positioning, the alignment bits around them and two extra white pixels in the middle are at the very same positions for the tiles of all three raven tilesets. If we rotate the tiles on the two-page document 180 degrees, they have all these bits in the very same positions too.

The pixels should be converted to a bitstream, using boustrophedon traversing: rows by two, alternating between left-to-right and right-to-left, inside those always column first, top to bottom. See the red arrows on the picture below for clarification. White pixels should be converted to 0, black ones to 1.

The forementioned two white pixels actually work as a separator between the header and body part of the bitstream. The header is 378 bits long, while the body consists of 2040 bits.

enter image description here

The header should be read as 63 6-bit units. It uses a Reed-Solomon code with a message length of 15 units, and 67 as the primitve polynomial - just as it was given as a parameter for gfPow on the printout. If you don't know Reed-Solomon codes, you can think of it as a clever version of parity bits. The high redundancy it introduces allows us to recover the important 15-unit long main part of the header even if any 24 units (out of the 63) were corrupted. This 24 comes from an important parameter of the code, 2x24+1=49 is the so-called distance of the error correction, meaning that any two encoded words differ in at least 49 units. It is also not a coincidence, that 49=63-15+1, that is it is one larger than the difference of the block length and the message length. So the first 15 6-bit units are the important part of the header, the rest was there to make it possible to recover this first 15 if they are read imperfectly.

Even after the solution, most of these 15 6-bit units' meaning remained unknown, but we know that the last 4 is a counter. Each tile has a unique index, starting from zero, with an increment of one. The tiles were ordered according to this index in the raven-none and raven-ordered tilesets, but they were in a randomized order in the raven-random tileset and for the pages in the focus of the puzzle. For this latter, the largest index found was 2408 - suggesting there were at least 2409 tiles in this tileset, but only 2182 remained intact and were given in the tiles.zip provided by the official hint.

The only other 6-bit unit understood was the very first one. Its value serves as a parameter for how the body should be read - this was 16 for all the raven tilesets, but 20 for the last one.

The body uses Reed-Solomon error correcting codes as well, but with different parameters than the header. Here, units are 8 bits = 1 byte, and the block length is 255 bytes. The primitive polynomial is 301 for these (just as it was for rsgentest on the printout), the distance is 33 for the raven tilesets, and 41 for the tileset on the burnt paper. 33=2x16+1 and 41=2x20+1, so the first number in the header is the maximum number of errors that can be corrected in the message.

If a tile is successfully decoded, we're left with 223 correct bytes per tile in the raven tilesets, 215 in the document. But for some tilesets not all the tiles can be successfully decoded: raven-ordered suffered a spilled out coffee, and the two-page document got burnt. It is still possible to recover these, because of a second layer of Reed-Solomon codes. The raven tilesets use 223 bytes block length and 191 bytes message length, the two page document uses 215 bytes block length and 175 bytes message length. All the other parameters are the same they were in the first encoding.

For the raven-none tileset there is no interleaving, so the second layer is just another encoding on a single tile. But for all the other tilesets this second layer uses a diagonal structure: for example on the mysterious document the 617th block starts with the 1st byte of the 617th tile's , then the 2nd byte of the 618th tile's message follows, and so on, the last one in this block is the 215th byte of the 831st tile's message. Wherever it is needed, the indices wrap around, so the 2nd byte of the 2409th block is coming from the 2nd byte of the 1st tile's message (assuming that there are actually 2409 tiles in total, and not more).

A last problem was, that some of these blocks miss more than 20 bytes. However, in this case, the position of these wrong bytes are known (using coding theory terminology, in this case they are called erasures, not errors), and because of this, twice as many can be corrected. It is possible to recover all these blocks, extract the messages inside them, and you just have to concatenate them to get the original file. It is a bz2 file, containing a tar file, containing the three images and the audio file listed on the link above.


Original post (April 28)
Just some ideas which might lead someone somewhere:

First,

@ffao's answer already suggested the use of polynomials over finite fields. Might it be, that the 'rs' in the filenames means Reed-Solomon, which is a type of error-correcting codes, just as QR codes are? I don't see the maths matching, however.

And I think the last page contains

three examples, what the encoding (with different parameters) produces on a sample text, probably Edgar Allan Poe's The Raven. At least that's what that filename suggests for me.

Update #1 (July 6)
I finally decided to digitalize the images. You can find the output as a zip of text files here (~1.9MB).
I used the numbered, cut out and aligned tiles from @2012rcampion's answer with hints. 1s represent whites, and 0s represent blacks. There are some errors in the conversion, as the blacks and whites weren't perfect, but hey, if it's QR, it should be prone to errors :).

An even more cropped version can be found here (~1.6MB).
The difference from the above one is that 4 bits are removed in every direction, so there is no overlap between two tiles, and the aligment bits are cut in quarters in this conversion.

Feel free to use them!

Update #2 (July 7)
Sorry, I just realized that the cropped images were cropped too much (missing 4 important rows on the right and the bottom). I've corrected them, the link above now points to the corrected version.
I also changed the threshold used to distinguish black and white cells. It's still not perfect, but seemed to be better then the previous one.

Update #3 (July 7)
Using again the official hint answer,

this time the averaged pictures (for which the original idea comes from @WesleySitu). It seems that the alignment bits of the raven-* documents look like the bluish parts of this (png, ~2.6kB).
raven_aligment_bits_and_traversing_order

However,

the mysterious document's alignment bits do not match, until you rotate them 180 degrees, upside down. So I'm pretty sure @LeppyR64's observation about rotating them upside down before further processing is correct.

Update #4 (July 11)
Traversing the pixels in the order that @WesleySitu suggested, the raven-* codes turn into these (zip, ~61.2kB) strings of 0s and 1s. If it is needed, I can convert the pages of the mysterious document too, but I think we should understand first what raven.txt went under.
For clarification, the order in which the pixels are read goes to the following pattern:

Neglect the alignment bits.
The rest is processed in rows of two, alternating from left to right, column first, from the top.
raven_aligment_bits_and_traversing_order_2

Update #5 (July 12)
As @2012rcampion pointed out, there was a bug in my code which was responsible for the order in which the pixels were read in. I've corrected it, and replaced the zip above.

Update #6 (July 26)
I tried to run the lines (which I got with the method above) through a Reed-Solomon decoder with the standard parameters (n=255, k=223). The bytestream I got is not human readable. Either this is not the next step that should be done, the parameters are not-standard, or I messed up something. I tried it with white cells as 1s and blacks as 0s and vice versa as well.

Update #7 (July 28)
I'm not yet sure how I did it, and the data seems to be dirty, but

just as I guessed in my original post, the raven-* documents are the encoded version of Edgar Allen Poe's The Raven.

I will be back after I manage to come up with a cleaner output.

Update #8 (July 28)
So here's what I've found earlier today. It is still work in progress, but seems like I've managed to identify some parts of the tiles:

raven_bit_clusters

What makes me to think this?

I was parsing the files I linked above as the digitized versions of the raven-none tiles, and playing with different Reed-Solomon parameters and bit offsets. Some parts of the resulting bytestreams started to make sense for the human eye, and 'BANG!' I immediately recognized parts of the poem I first mentioned in this post exactly three months ago.

Decoding the 'actual message' part as bytes and converting them to the corresponding ASCII characters, the raven-* documents hold these messages.
I did not check, but I think the raven-ordered and raven-random messages are some permutations of the characters in raven-none message.

The numbers (191+32+32 bytes) match too nicely not to be some kind of Reed-Solomon codes (my other first impression in my original post), but I didn't yet manage to build those parts together in a way that it makes sense.

As the raven-none tiles don't have any black pixels in the last 32 bytes, I think those bits cannot be a part of the error correcting bits. My first idea was that they are also part of the actual message and have to be used to reorder the characters to get back the original message, but did not manage to get anywhere with this so far.

So our idea about how to traverse the bits is confirmed.
And if you are using my files, you have to invert 0s and 1s as it seems black pixels represent 1s, white pixels represent 0s.

Update #9 (July 29)
A correction again - there was a mistake in the bit cluster picture above.

Update #10 (July 29)
I also figured out what to do with the ordered interleaving.

Rows represent tiles (in order, so tile0000 is in the first row), and the columns are the bytes extracted with the same method I used for raven-none. Newlines and other special characters are replaced by '+'. raven_ordered_message

Another new observation is, that

the tiles in the raven-random image are the very same tiles as the ones in the raven-ordered image, the only difference being their order (ordered tile0000 is random tile0011, ordered tile0001 is random tile0003, and so on).
The name suggests their order is random, but maybe with the help of the interleaving bits they can be somehow sorted, and once they are ordered, the previous method can be used to get the content of the message.

Preparing for the worst, I suppose

the mysterious document's pages are encoded with random interleaving, which means some characters (of the burnt tiles) will be missing every now and then from the message. Also those don't seem to have alphabetical characters, at least that's what their average's uniform gray color suggests. We will see soon.

Update #11 (July 31)
I was working on turning the mysterious pages into bits when I read @TusanHomichi's results, and his suggestion to redigitize the tiles.
I tried to do my best, played a lot with ImageMagick commands, and although it's still not 100%, I managed to

get a new version of the raven-* tiles, which gives an almost perfect bitstream - even before the error correcting is done. Of course I still converted the mysterious document as well. White pixels are decoded as 0 this time. You can find all these files here (zip, ~1.5MB).

Using these, and @TusanHomichi's observations about

the header, I managed to come up with the correct order of 2055 tiles of the mysterious document (txt, ~25.2kB). 'l' and 'r' stands for left and right page, the 4-digit number is the index of the tile as it was given in the zip provided by @2012rcampion. There are some holes in the sequence, but not that much. I estimate the missing number of tiles around 350-360.

Update #12 (July 31)
After these findings, I returned to work on

the message parts of the now ordered tiles. As @TusanHomichi noted earlier, the error correcting codes do not work in most of the cases, I managed to decode only 507 tiles.
So I tried to brute force some parameters, namely I replaced the modulus of 301 with all the possible irreducible 7-degree polynomials over GF(2). None of those helped to decode, not even a single tile.
Thus it seems, the modulus 301 is hardcoded in the encoding procedure, and this makes sense, as it wasn't given in the command line during the raven encodings.

However,

there is another parameter, which was: '-t 16'. I've also noticed that the first and the seventh 6-bit chunk of the header of the raven tiles has the value of 16. For the mysterious document, the seventh is also 16, but the first one is 20. So maybe it was encoded with '-t 20'. But what that t means is still unclear. As another '--t=4' appears in the arguments of the run of rsgentest, I suspect it is worth looking for in that direction.

As a less serious sidenote,

another command line parameter of the encoding was '-s 6x6'. Probably it was for the shape. And what could be the shape parameter for the mysterious document's command?
It definitely has 30 columns and at least 40 rows, that's 2400 tiles on the two pages altogether. We also know, that the pages were upside down, and the last row is full, suggesting, that the number of tiles printed on those two pages is divisible by 30. Furthermore, in the headers we've seen indices in the range between 0 and 2408. So we have at least 2409 tiles, maybe a little bit more, if the ones with the largest indices are among the burnt ones. But it is very unlikely that there are 21 in a row lost, which would complete a full row. So probably there was another page with the last 9 or slightly more tiles in the office.
Haven't you found anything like that, @2012rcampion? Maybe at that time you didn't happen to realize it is the last part of this document, but you still picked it up with all the other papers. Not that we need some more tiles that much, but maybe there is some other important information (a.k.a. hint) on that piece of paper...

Update #13 (August 1)
Further optimizing my digitizing process, I managed to

get the counter in the header for all the tiles provided. I had to do the last 14 manually, but hopefully there aren't any errors. The file I linked yesterday about the correct order of the tiles is now updated. Of course the burnt tiles are missing, so are the ones which did not fit on the pages.
The message parts can be decoded with my current method for about 1700 out of 2182. I will publish those soon.

Update #14 (August 1)
I started publishing

the message bytestreams of the tiles (now ordered according to the counters inside them). You can find all the encoded messages and the succesfully decoded ones here [EDIT: link removed with update #16]. The set of the latter is continuously updated.

Update #15 (August 1)
Here [EDIT: outdated, see update #18] you can find my current best attempt as the target file's reconstruction. Some assumptions I used:

I assumed that the tile index 2408 what we saw is the actual largest, so there are 2409 tiles in total.
Wherever the decoding was successful, I used its outcome. If the decoding failed, I used the corresponding byte of the encoded version. If the whole tile was missing, I used the byte 0xff, as a simple frequency analysis showed there are more black pixels than white ones.

If anyone can open it with something, let me know. I guess it still has too many corrupted bits to be opened.

Update #16 (August 2)
For some tiles the main problem was

a one pixel translation into x or y direction. Identifying these tiles and realigning them before processing allowed me to encode further 174 of them. Now there are 1910 which can be perfectly decoded. The link above to the reconstructed file is updated.
I removed the link from update #14 to the decoded and encoded versions of the tiles, as they are not that helpful. Here is a zip of undecoded tiles instead [EDIT: no more tiles left unsolved].

Update #17 (August 2)
Further finetuning of

parameters in my ImageMagick scripts allowed me to decode all but 5 tiles. The reconstructed file and the 5 problematic tiles are now available on the links above.
These tiles seem to be tricky, probably they need manual corrections, but it's getting late here, and I just cannot focus with my eyes any more today. Maybe someone else is more fresh to do that.
I still don't know what to do with the restored file. Maybe it needs the last 5 tiles. Maybe even with those a good enough reconstruction is impossible.

Update #18 (August 3)
I'm done with

manually correcting the last 5 tiles. I've updated the reconstructed file. My system automatically detects it as a .bz2 file, but still cannot open it. Maybe it has some error correcting structure, or it can be restored with some data restoring software.
It is available in 2 versions: supposing there are 2409 tiles in total, or 2410.

Update 19 (August 3)
I've found some

short description about what kind of error correction is used for .bz2 files, and managed to use the recovering tool mentioned there on the 2409-tile version above (it fails on the other one). This is the result, but I still cannot open it.

Update #20 (August 6)
I realised, that

there is another layer of Reed-Solomon encoding inside the encoded versions of the tiles, with block length of 215 bytes, and message length of 175 bytes. The blocks go across tiles, so it might be possible to restore the missing tiles and thus the missing parts of the file, but so far I've only managed to recover only about half of the blocks. The file remains corrupted.

Update #21 (August 7)
As a second tought, I'm not sure about

the message length being 175 bytes. Do you remember the number 16 in one of the 6-bit chunks in the header? Maybe it means the maximal distance for the error correction, and then the message should be 183 bytes.
Anyway, it seems the only direction going forwards is to decode some more tiles, which weren't provided in the post with the official hints. There aren't too many, but maybe just enough to get inside the maximal distance, and have a successful decoding for all of the blocks.

Update #22 (August 8)
I'm now working on

extracting some more tiles from the original pictures. There are about 116 partial tiles, here are the rotated versions of some which give some hope to be recovered.

Update #23 (August 10)
It seems like

the message length is 175 bytes, at least I manage to decode more parts with this parameter. This means that the 215 byte block can have 20 bytes wrong. Because of the diagonal structure, a part of the bz2 file can be recovered if we have at least 215-20=195 tiles of every consecutively indexed sequence of tiles - that is, the first part needs at least 195 tiles from tiles indexed between 0 and 214, the second part needs at least 195 tiles from tiles indexed between 1 and 215, and so on.
Well, the original tiles.zip does not contain enough tiles to cover all these sequences sufficiently: for example, we have only 185 tiles with indices in the range [165, 379], so I don't see how the corresponding part of the file can be recovered without knowing some more tiles with indices in that range.
So I'm now hunting more tiles. Some more which I managed to decode successfully are the ones marked with blue here: more tiles of the left page more tiles of the right page
The first few were possible to extract without manual work, but the task is getting more and more tricky. It obviously offers only a limited number of tiles to be recovered. Probably not even enough to recover the missing parts of the file.

The method worked very well with the raven-ordered file where some tiles were too badly corrupted themselves, but the diagonal error coding made it possible to recover them. However, maybe there is another direction which we should go for these pages, as we don't have enough tiles for the diagonal decoding. I'll still keep hunting for tiles for a while though.

Update 24 (August 11)
Finally here is the recovered file.

It is a bz2, extracting it you get a tar, extracting that you get 3 pictures and an audio file. I'm still having troubles playing the audio, but the pictures suggest I shouldn't even tell anybody I've ever seen them or heard the message on the audio.

$\endgroup$
41
  • 2
    $\begingroup$ So you're reading the bits in rows of two, alternating from left to right (I can't resist using the word "boustrophedon" here); but what order are you reading each row in (e.g. from the top or bottom; row first or column first)? $\endgroup$ Commented Jul 11, 2016 at 21:50
  • 3
    $\begingroup$ I'm eager to hear more! $\endgroup$
    – LeppyR64
    Commented Jul 28, 2016 at 15:03
  • 2
    $\begingroup$ Nice job on the decoding! $\endgroup$ Commented Jul 28, 2016 at 19:50
  • 2
    $\begingroup$ I'll still write a TL;DR wrap-up and maybe share the code I've used to get to the result - it needs some cleaning first obviously. But I'll leave the original post and the updates here as well, so if anybody is interested about the long process of how we got there. Once again, thanks for everybody who participated with helping comments and answers, and thanks for the amazing puzzle, @2012rcampion! $\endgroup$
    – elias
    Commented Aug 11, 2016 at 18:57
  • 2
    $\begingroup$ Great work! As a previous victim of @2012rcampion incredible puzzles, welcome to the club! $\endgroup$
    – LeppyR64
    Commented Aug 11, 2016 at 19:48
10
$\begingroup$

At least as a start, here's what gfPow.c is doing:

Let's map every number to a polynomial according to its binary representation, e.g. $13 = 1101_2$ is the polynomial $x^3 + x^2 + 1$.

The numbers represent the remainder of the division of $x^n$ by 67 (aka $x^6 + x + 1$) over the finite field GF(2). It might be relevant to mention that CRC32 (which appears in the compilation of rsgentest.c) is defined very similarly.

As for rsgentest.c, I can reproduce the output, but I don't understand what it means:

The entry T(n,k) of the triangle is the remainder of the division of $T(n-1,k-1) + x^n \cdot T(n-1,k)$ by 301 ($x^8+x^5+x^3+x^2+1$) over GF(2).

$\endgroup$
3
  • $\begingroup$ For rsgentest, can you determine what the parameter t=4 is referring to? It appears that the encode uses a paramter of t=16 instead. $\endgroup$
    – LeppyR64
    Commented Jun 29, 2016 at 2:56
  • $\begingroup$ The first part (polynomials in a finite field modulo another polynomial) reminds me of the standard construction of finite fields of non-prime order. $\endgroup$ Commented Jul 27, 2016 at 23:56
  • $\begingroup$ Also; in the second part, if you think of each number as a digit, then your recurrence is equivalent to a multiplication by $1,\ x^n$. $\endgroup$ Commented Jul 28, 2016 at 0:09
10
$\begingroup$

I managed to make progress decoding the raven files. Here is what all but a few bits at the beginning are doing.

elias has the right order in his digitizations, but the 1's and 0's need to be reversed. After that, you have a 380-bit header, then 255 bytes of data. Both have error correcting sections.

The parameters for the error corrections are

It seems to use Reed-Solomon for everything. The raven data sections use a modulus of 301, and 8-bit chunks. The first 223 bytes are data, and the last 32 are error correction. The headers use a modulus of 67, and 6-bit chunks. The first 90 bits (15 words) are data, the next 288 bits (48 words) are error correction, then there are 2 bits as a delimiter. (That's the white spot in the averaged tiles)

For no interleaving

You can safely ignore the header. Correct the data section, and you get a 223 byte block, which is 191 bytes of data, and 32 bytes of error correction. Just concatenate the 191 byte sections to get the poem.

For ordered interleaving

Again, the header doesn't matter. Now however, the 223 byte blocks are laid out a little differently. The first block is laid along a "diagonal". It it the first byte in the first tile, then the second block in the second tile, etc. The second block is offset by one. It is the first byte of the second tile, then the second byte of the third tile, etc.

To do random interleaving

The last 24 bits of the header are a counter, that indicate which tile we have. Once you put the tiles in the order of that counter, you need to undo the ordered interleaving.

What remains:

The first 66 bits of the headers are the same in every tile, and I don't know what they encode. The error correcting codes only let me correct 934 out of 2168 headers on the pages. Since they all have the same first 66 bits, I tried to correct some errors that way, and got up to 1222 potentially correct headers. Either way we probably need to redo the digitization now that we can measure how bad the error is. None of the page tiles can be corrected with a modulus of 301 and 8-bit chunks. Probably the parameters for Reed-Solomon are encoded in the header, somehow.

Update 7/31

The --t parameter is

The number of errors that can be corrected. The raven.txt encodings used t=16, which gives 32 Reed-Solomon bytes. This is the first 6 bits of the header. The pages use t=20, or 40 Reed-Solomon bytes.

Thanks to @elias's new digitizations, I found:

The pages actually do use the same Reed-Solomon parameters as the raven (but with 40 error-correction symbols instead of 32). I have no idea what the remaining 60 bits in the header are for, but I managed to get the first block out of the pages (with some manual correction). Looks like a bzip archive. Even the new digitizations have too many errors for it to go through all the way. I'm looking at trying to digitize each tile separately, adjusting the threshold until I get a valid decoding, but it's taking a very long time to run.

$\endgroup$
6
  • $\begingroup$ Great results! May I ask what program do you use for RSdecoding? The one I was working with does not allow all these ecc parameters to be changed, and crashes from time to time. Thanks! $\endgroup$
    – elias
    Commented Jul 30, 2016 at 5:45
  • 1
    $\begingroup$ I'm using this python Reed-Solomon implementation. As a note, the way he does things, the first root of the generator polynomial is 1, not 2 like the output from rsgen in the problem. You can fix this by setting fcr=1. I think you might have to do a few more tweaks to do 6-bit decoding, but let me know if you have any trouble, and I can see what I've changed. $\endgroup$ Commented Jul 30, 2016 at 11:28
  • $\begingroup$ I redigitized all the tiles, now with a better precision, but cannot edit my post somewhy since yesterday. New versions of bitstreams can be found here. For the raven-none tiles, they are almost perfect. I hope that they are good enough for you on the page-* tiles as well. $\endgroup$
    – elias
    Commented Jul 31, 2016 at 4:30
  • $\begingroup$ And thanks for pointing out the python library. It works really smoothly! $\endgroup$
    – elias
    Commented Jul 31, 2016 at 6:16
  • $\begingroup$ I still cannot edit my post, but if you use the files I posted, 2022 headers can be corrected with the parameters you suggested. Maybe overriding the first 66 bits can still help in some more cases. $\endgroup$
    – elias
    Commented Jul 31, 2016 at 7:12
7
$\begingroup$

Here are some observations I got from overlaying the raven.txt encodings on top of each other.

Interlaced


The data is the same up until a certain point for each tile, so that might help with determining how the data is interlaced/formatted. I haven't counted it yet.

Random



There's a section where the data matches up, which I've outlined in red. Most likely a lucky match with the interlacing, so further indication of the size of the chunks.
Also, the top portions of each tile is the same for both the interlaced and random versions.

This pattern is also in the pages (flipped 180 degrees)

Also, I was curious as to how QR codes were encoded, so I did some looking around on the web and found this:

http://www.swetake.com/qrcode/qr3_en.html . It has things like GF and remainders and all that fancy stuff, so it might relate to @ffao's answer. All of this is new to me, so I'm still trying to figure things out. The link might be useful for others.

Here's an image where the exact alignment bits are marked. I've also marked where the data is chunked (for interleaving).

enter image description here

From this, I believe that the data is encoded in rows of two. Quaternary maybe.

A guess at how we should read the bits

enter image description here

I used the last tile from the interlacing-none encoding to attempt to figure things out. That seemed like the fishiest tile out there. It has a bunch of white space between two chunks of data and the tail end doesn't even reach the end. The data might be encoded like a sandwich.

Or even crazier

http://www.pclviewer.com/rs2/qrpacking.png

$\endgroup$
2
  • $\begingroup$ I suspect you're right. I think if we can understand how those first two lines say raven then we're off :) $\endgroup$
    – LeppyR64
    Commented Jul 7, 2016 at 0:33
  • $\begingroup$ The basic idea seems to be correct for me too. We are finally showing some progress. $\endgroup$
    – elias
    Commented Jul 7, 2016 at 7:53
7
$\begingroup$

"Official" Hints

Based on Wesley's answer, I tried averaging together all the tiles from each encoding:

raven.txt

interleaving=none

interleaving=none mean tile

interleaving=ordered

interleaving=ordered mean tile

interleaving=random

interleaving=random mean tile

Mysterious document

left page/right page

left page mean tile right page mean tile


Since I went ahead and cut out/aligned all the tiles, I thought I'd share my work:

tiles.tar.gzip (65 460 835 bytes) / tiles.zip (66 239 046 bytes)

Each encoding has its own directory with an index.png file that shows the tile numbering I used, for example:

tiles/raven-none/index.png

...And a corresponding list of tile****.png files, which look like this—scaled so that each 'module' or 'bit' is exactly 3×3 pixels, and there is a border of 1 module around the square alignment marks. The alignment marks are each 6×6 modules, and they are 44 modules between them, so each image is 2 × 6 + 44 + 2 = 58 modules, or 174 pixels on a side.

tiles/raven-none/tile0000.png

I hope this helps!


It's come to my attention that the tiles above are too blurry to decode properly. I strangely seem to get much better results when transforming directly to 1:1 (bit:pixel) in one step, instead of aligning first and then downscaling. Here are the blocks I used when decoding:

blocks.tar.gzip (927 112 bytes) / blocks.zip (1 804 760 bytes)

$\endgroup$
15
  • $\begingroup$ What is this? For interleaving=none, is that an overlay of all of the 34 tiles from raven.txt? $\endgroup$
    – LeppyR64
    Commented Jun 28, 2016 at 19:19
  • $\begingroup$ @LeppyR64 Correct; each image is made from the individual tiles from the corresponding block cut out, overlaid, aligned, and averaged. $\endgroup$ Commented Jun 28, 2016 at 19:23
  • $\begingroup$ I'm guessing the mysterios document is a GIF image some other compressed data that has regularly spaced non-random bits. $\endgroup$
    – Jasen
    Commented Jul 8, 2016 at 4:09
  • $\begingroup$ @Jasen What makes you say that? $\endgroup$ Commented Jul 9, 2016 at 3:57
  • $\begingroup$ the uniform gray field is how compressed data will look, that white spot in in it is a '1' ocurring every 2048(?) bits (a gif would have a 255 every 256 bytes, so not a gif) $\endgroup$
    – Jasen
    Commented Jul 9, 2016 at 8:01
5
+50
$\begingroup$

Barcode Structure

Each corner of each tile has an alignment mark. Each alignment mark has two other alignment points to determine the orientation of the barcode.

Using these alignment marks, the codes can be rotated to know which alignment to read them in.

Illustrating alignment points
For a "interleaving=none" encoding only the information within the red lines is for a specific tile. Interleaving might change that, but I'm focusing on none for now.

If anyone wants to see a binary representation of tile000 from the normal interleaving cropped only to it's values, here it is:
enter image description here

1111111111111111111111111111111111111111111111111111111111
1000000111111111111111111111111111111111111111111110000001
1011110111111111111111111111111111111111111111111110111101
1010010111111111111111111111111111111111111111111110100101
1010010111110110011110011101111111101010011111111010100101
1011110100111010001011100010111111011000011111111110111101
1000000110001011011111110110010111000100011111111110000001
1111111101010000110011000110100000100100110111111111111111
1111101000100011001001010001111101010000001010001110011111
1111001001010010011010110011111001100110000100001101101111
1111010111010101110110000010100111011101111100000100001111
1111010001010111110000100001110101000100101010000110011111
1111001000111111111101111110111110111111101111110011001111
1111100100101100010111010011110011011000010100010111111111
1111011101110100010001110111011101110101010001001100111111
1111110110111110100010110000001111001001101010001011111111
1111111001100110111000100110101001101110111011101010111111
1111110100011001010101011001000111000111110101001101001111
1111011011110111011001100110010101110110011001010111011111
1111110110111100101010011011100000111110110100110001101111
1111111011100010001011101110101011101110011011101010111111
1111110011010001010101010000110100010111011111000001001111
1111011001010111011101010111011101000111011101000111011111
1111110100110001100010000011111010101001101111011001101111
1111011100110010101011101010111001101110001001101110111111
1111011111010000010100001111110100011001010010111101101111
1111011101110100011101110111010001100111011101110111011111
1111101111101010100110111110001010011001100000010011111111
1111101011101010011000101110101011101010001001101110011111
1111100000001101100100000000101111000101000101000001001111
1111011101110100010101010100010101110101010001110111011111
1111101000100000100010110000101010111110100010111100101111
1111111000101110011000101010111001110011101001100110011111
1111000101111101010100001101001101111100000111011001011111
1111011101100111011101110111010001000111011011110111011111
1111101111101110100010101010100010101011110110111100101111
1111001011101110101001100110111000101110111011100110001111
1111010100011000110101001011110101011000110011011001011111
1111011101100110010001110111011101110101011101100101011111
1111001111010010101010001010101010000001001111101100101111
1111111001101110101011101110101011100110111011101110111111
1111010111010000110100111101100110010001001111011011111111
1111110010011011100101000000010001100111011101110111011111
1111001011100111110011000111011010011011001100011010001111
1111111000110010100111011110110111010011110101001011101111
1111010101010001000101111111011100000010110000111111011111
1111111111110100000111111100000100101011100111010111111111
1111101000001010111010101110011000100101100011100111011111
1111001111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111101111
1111111111111111111111111111111111111111111111111111111111
1000000111111111111111111111111111111111111111111110000001
1011110111111111111111111111111111111111111111111110111101
1010010101111111111111111111111111111111111111111010100101
1010010111111111111111111111111111111111111111111110100101
1011110111111111111111111111111111111111111111111110111101
1000000111111111111111111111111111111111111111111110000001
1111111111111111111111111111111111111111111111111111111111

Suspicions

1) The two rows between the top two alignment markings are a header. Suspect they are properties from the input parameters (raven.txt). They don't differ between interleavings though, so that information is somewhere else.

2) The four pixels to the right of the bottom alignment mark are suspicious.

Left and Right Pages

The Left and Right pages need to be rotated 180 degrees because of their alignment markings.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.