The popular .bz2
compression format is used to store all sorts of things. One of the more interesting features is that it consists of several independently decompressible blocks, allowing recovery of partial segments if the file is damaged. The bzip2recover
program, included with bzip2
, allows recovery of blocks from a bzip2
archive. In this challenge, you will implement a simple clone of bzip2recover
.
The basic idea is to scan a bitstream for a particular magic number, and output blocks delimited by this magic number to separate files. Read on for gory details.
Technically, bzip2 files are a bitstream, with every group of 8 bits being stuffed into a byte. The first bit is placed in the most significant bit of a byte, with the eighth bit going in the least significant bit of a byte. Files begin with the header "BZh", and a digit "N" from 1 to 9 indicating the compression level (9 is the most common). Then the bitstream begins.
The bitstream is split up into blocks. Each block encodes N*100KB of uncompressed text (e.g. 900KB of text for compression level 9). Blocks can begin basically anywhere in the bitstream, and do not necessarily begin on 8-bit boundaries. Blocks begin with a magic number, which is set to the 48-bit sequence 0x314159265359 (in ASCII, "1AY&SY"). For example, the sequence of bytes "73 14 15 92 65 35 90" contains the magic number, offset by 4 bits within the byte.
Your goal will be to take a bzip2 file on stdin, and output each detected BZip2 block to a separate decompressible file. Specifically, your program does the following:
- Read and skip the bzip2 header ("BZh" + N)
- Scan the bitstream for the magic sequence
- Copy the bitstream between two consecutive magic sequences to a new file, with a new BZip2 header and block header. Note that this may require realigning the bitstream to start at a byte boundary, and padding the last byte.
You don't have to care about parsing the blocks, and you can assume that a magic sequence always starts a block (i.e. that it isn't in the middle of a block).
How you name the output files is up to you. The only requirement is that they be clearly sequential somehow (so, e.g. "a b c ... z aa ab" or "1 2 3 4 5 .. 9 10 11 12" are both acceptable).
For a test case, you may use this file from the IPSC 2014 problem-solving contest. The file is a bzip2 file with four blocks. The file as a whole actually decompresses to another bzip2 file with 162 complete blocks. Therefore, you may use these block counts to validate that your solution is getting all of the blocks. (Note that the file actually decompresses even further into more bzip2 files, but you probably don't want to decompress it all the way).
This is code golf, with a twist: your score is the number of bytes in the bz2-compressed version of your source code. (I'm aware this may make submissions larger, but these are the rules!). Since bzip2 compressors may differ in performance, provide a base64-encoded version of the compressed file to prove it's size. (You may therefore be able to obtain a better score by optimizing your program for bzip2, though I don't necessarily expect anyone to do this). Standard loopholes apply.