0

So my obsession with deflate compression is leading me no where but i feel like there is something i can do the process better.

Here is what i understood so far:

def lz77(uncompressed):
    #find repeated characters and replace it with references
    #output will be like this m0,0 a0,0 l0,0 3,1
    return lz77 encoded data
def huffmanencode(lz_data):
    #encode data
    #put block type identifier and the beginning of every block
    #Put huffman encoded data
    #put end of block or new block identifier

Now lets say the original uncompressed input is:

I am considered a loser in this society, entrapped evilness inside of me now you gimmick you are not what i try to be. Drug deals in back alleys like a demented game of hide and seek Toast an average rape on a bad day, i haven't even begun to ride my peek But moving into adult hood my future is bleak.

And then we deflate it with normal or default parameters of 7zip or Winrar.

Then instead of knowing the whole uncompressed input i know:

I am considered a loser in this society, entrapped evilness inside of me

But i need to obtain few bytes of original compressed data using the part i know.

Here is my assumption regardless of the actual facts:

Since the portion of uncompressed data is at the very beginning then i have a shot against the first step of deflate compression which LZ77. So the output LZ77(original uncompressed) and LZ77(known portion) will share most significant bits. The hardest part if not impossible is Huffman encoding output which is always different in test cases but what if i knew the approximate length of the Huffman tree or the Huffman code used reconstruct the tree upon decompression, Then i can brute force it, if it's less than 4 bytes conclusion: Since the output of LZ77 is identical at the beginning then gaining knowledge of Huffman tree would definitely allow me to output the identical most significant bits.

I have been looking at Pyflate trying to understand extraction of information Huffman encoding that was using when compressing but it's unclear or i misunderstood the code.

Is it possible to get identical most significant bits using only first portion of original input data?

My goal is to utilize this:

MIME-Version: 1.0
Date: Sun, 18 Feb 2024 15:02:54 +0200
From:

And get most significant bits of this:

MIME-Version: 1.0
Date: Sun, 18 Feb 2024 15:02:54 +0200
From: Bluebird <[email protected]>
Subject: Your monthly statement is available
Thread-Topic: Your monthly statement is available
Message-ID: <[email protected]>
To: "[email protected]" <[email protected]

If not possible the i will need to know the size of the Huffman code or tree for the above.

1
  • Your repeated use of "most significant" in front of "bits" makes no sense. Either you have the bits or you don't. None of them are more significant than others.
    – Mark Adler
    Commented Feb 22 at 16:01

1 Answer 1

0

No.

The two different headers will represent two completely different codes. Even knowing the lengths of the headers and knowing the literal and length/distance LZ77 representation of the data, the bits for those two after the headers will be different because the codes are different.

3
  • I couldn't agree more, the block header covers less than 3 bytes followed by actual compressed data and the block headers affects the byte representation of actual compressed data right, then given the headers, can't one reconstruct the huffman tree or codes and use that to compress the portion known then return data in form of ([headers][actual compressed portion]) which will most likely be identical number of bytes to the target, in theory anyway Commented Feb 23 at 6:49
  • since headers are less than 3 bytes, one could brute force them and gain the 80% identical bytes Commented Feb 23 at 6:57
  • I can't make any sense of your words, but in any case, a dynamic block header is typically around 80 bytes in length.
    – Mark Adler
    Commented Feb 23 at 7:55

Not the answer you're looking for? Browse other questions tagged or ask your own question.