6

I am trying to compress short strings (max 15 characters).

The goal is to implement the "Normalized Compression Distance"[1], I tried a few compression algorithms in python (I also looked to se if i could do it in Julia but the packages all refuse to install). I always obtain in the end a bit-string longer than the original string I am trying to compress which totally defeats the purpose.

An example with zlib :

import zlib
data = b"this is a test"

compressed_data = zlib.compress(data, 9)
print(len(data))
print(len(compressed_data))

Which returns :

13
21

Do you now what I am doing wrong, or how i could do this more efficiently ?

[1] : https://arxiv.org/pdf/cs/0312044.pdf

3
  • 3
    A compressed string needs to store some metadata that allows the string to be decompressed; for a short string, this metadata uses more space than is saved by compressing it in the first place.
    – chepner
    Commented May 17, 2019 at 15:19
  • 2
    This is true for general compression methods, which don't make any a priori assumptions about the data being compressed. A different method that, for example, assumes the input is restricted to lowercase ASCII and space could skip the metadata, because that's hardcoded in the decompression algorithm itself.
    – chepner
    Commented May 17, 2019 at 15:22
  • see repl.it to see the contents
    – depperm
    Commented May 17, 2019 at 15:22

3 Answers 3

7

Check out these libraries for compressing short strings:

https://github.com/siara-cc/unishox :

Unishox is a hybrid encoder (entropy, dictionary and delta coding). It works by assigning fixed prefix-free codes for each letter of the 95 letter printable Character Set (entropy coding). It encodes repeating letter sets separately (dictionary coding). For Unicode characters (UTF-8), delta coding is used. It also has special handling for repeating upper case and num pad characters.

Unishox was developed to save memory in embedded devices and compress strings stored in databases. It is used in many projects and has an extension for Sqlite database. Although it is slower than other available libraries, it works well for the given applications.

https://github.com/antirez/smaz :

Smaz was developed by Salvatore Sanfilipo and it compresses strings by replacing parts of it using a codebook. This was the first one available for compressing short strings as far as I know.

https://github.com/Ed-von-Schleck/shoco :

shoco was written by Christian Schramm. It is an entropy encoder, because the length of the representation of a character is determined by the probability of encountering it in a given input string.

It has a default model for English language and a provision to train new models based on given sample text.

PS: Unishox was developed by me and its working principle is explained in this article:

2
  • do you think zstd would be worth mentioning in this? Commented Feb 15, 2020 at 22:13
  • @OscarSmith zstd is still not for short strings such as the OP is trying to compress. zstd offers to pre-create dictionary based on given Training sets so subsequent small sets of data can be compressed more. But when I tried to do so, it says number of samples too low processing. Commented Feb 16, 2020 at 16:25
1

According to your reference the extra overhead added by Zlib may not matter. That article defines the NCD as (C(x*y) − min(C(x),C(y))) / max(C(x),C(y)), where using your zlib compression for C:

C(x) = length(zlib.compress(x, 9))

NCD(x,y) = (C(x*y) − min(C(x),C(y))) / max(C(x),C(y))

As long as Zlib only adds a constant overhead the numerator of the NCD should not change, and the demoninator should only change by a small amount. You could add a correction factor like this:

C(x) = length(zlib.compress(x, 9)) - length(zlib.compress("a", 9)) + 1

which might eliminate the remaining issues with the denominator of NCD.

0

The DEFLATE algorithm uses a 32kb compression dictionary to deduplicate your data. By default it builds this dictionary from the data you provide it. With short strings, it won't be able to build a decent compression dictionary, and therefore won't be able to compress efficiently and the meta-data overhead is what increases the size of your compressed result.

One solution would be to use a preset dictionary with samples of recurring patterns. This question handles the same issue: Reusing compression dictionary

You can use my dicflate utility to experiment with DEFLATE compression on short and long strings with and without preset dictionaries: dicflate

Not the answer you're looking for? Browse other questions tagged or ask your own question.