16
\$\begingroup\$

I think understanding basic data compression and pattern finding / entropy reduction primitives is really important. Mostly the basic algorithms (before the many tweaks and optimizations are made) are really elegant and inspiring. What's even more impressive is how widespread they are, but perhaps also how little they are really understood except outside of people working in compression directly. So I want to understand.

I understand and have implemented the naive implementations of LZ77 (just find the maximal prefix that existed before the current factor) and LZ78 (just see if we can extend a factor that occurred before at least twice, or we make a new record and start a new factor). I also understand the naive version of Burrows-Wheeler Transform (maintain the invariant that the first column is sorted, and keep adding that column and resorting until the matrix is square).

I also understand why all these (parts of) compression algorithms work:

  1. LZ77 asymptotically optimal since in the long term and with an infinite window all factors that exist are composed of prefixes earlier in the input which LZ77 can find.
  2. LZ78 is also pretty good since if a factor occurs at least once before it can be compressed the second time, so long as it is not a suffix of another factor found earlier.
  3. BWT, by grouping same letters on the left hand side (of the matrix), and finding patterns in the preceding letters in the right hand side, can exploit bigram repetitions, recursively.

The BWT code below works, but the inverse is not efficient. What are the optimizations, and how can I understand why they work?

BWT

def rot(v):
        return v[-1] + v[:-1]

def bwt_matrix(s):
        return sorted(reduce(lambda m,s : m + [rot(m[-1])],xrange(len(s)-1),[s]))

def last_column(m):
        return ''.join(zip(*m)[-1])

def bwt(s):
        bwtm = bwt_matrix(s)
        print 'BWT matrix : ', bwtm
        return bwtm.index(s), ''.join(last_column(bwtm))

def transpose(m):
        return [''.join(i) for i in zip(*m)]

def ibwt(s):
        return reduce(lambda m, s: transpose(sorted(transpose([s]+m))),[s]*len(s),[])

s = 'sixtysix'
index, bwts = bwt(s)
print 'BWT, index : ', bwts, index
print 'iBWT : ', ibwt(bwts)

BWT output

BWT matrix :
ixsixtys
ixtysixs
sixsixty
sixtysix
tysixsix
xsixtysi
xtysixsi
ysixsixt
BWT, index :  ssyxxiit 3
iBWT : 
ixsixtys
ixtysixs
sixsixty
sixtysix
tysixsix
xsixtysi
xtysixsi
ysixsixt

Which is correct.

LZ77

def lz77(s):
        lens = len(s)
        factors = []
        i = 0
        while i < lens:
                max_match_len = 0
                current_word = i
                j = 0
                while j < lens-1:
                        if current_word >= lens or s[current_word] != s[j]:
                                j -= (current_word - i)
                                if current_word - i >= max_match_len:
                                        max_match_len = current_word - i
                                if current_word >= lens:
                                        break
                                current_word = i
                        else:
                                current_word += 1
                        j += 1
                if max_match_len > 1:
                        factors.append(s[i:i+max_match_len])
                        i += max_match_len
                else:
                        factors.append(s[i])
                        i += 1
        return ','.join(factors)

LZ78

def lz78(s):
        empty = ''
        factors = []
        word = empty
        for c in s:
                word += c
                if word not in factors:
                        factors.append(word)
                        word = empty
        if word is not empty:
                factors.append(word)
        return ','.join(factors)

Correct outputs against 7th Fibonacci word:

LZ77( abaababaabaab ): a,b,a,aba,baaba,ab
LZ78( abaababaabaab ): a,b,aa,ba,baa,baab
\$\endgroup\$
4
  • \$\begingroup\$ A couple of bugs: (i) in rot v[-1] should be v[-1:] (ii) in lz78 there's a reference to an undefined variable empty. \$\endgroup\$ Commented Jan 30, 2013 at 14:17
  • \$\begingroup\$ Hey @GarethRees. Thanks! But if I make v[-1]+v[:-1] v[-1:]+v[:-1] I will get rot('abcde') = 'e'+'abcd' which is the same as v[-1] ? Please explain this. \$\endgroup\$ Commented Jan 30, 2013 at 14:25
  • \$\begingroup\$ With v[-1] it only works for strings: try rot(range(3)) for example. With v[-1:] it works for all sequences. (Perhaps "bug" was a bit strong.) \$\endgroup\$ Commented Jan 30, 2013 at 14:28
  • \$\begingroup\$ No no you are right. I thought 'maybe he means that, but that would be extremely forward thinking of him considering bwt is mostly applied to strings' but you are right...I even have a use for applying it to sequences! \$\endgroup\$ Commented Jan 30, 2013 at 14:40

1 Answer 1

23
+100
\$\begingroup\$

1. Introduction

The only actual question you asked is about how to make the inverse Burrows–Wheeler transform go faster. But as you'll see below, that's plenty for one question here on Code Review. If you have anything to ask about your LZ77 and LZ78 implementations, I suggest you start a new question.

2. A bug

First, the output of the function ibwt doesn't match the example output in the question:

>>> from pprint import pprint
>>> pprint(ibwt('ssyxxiit'))
['iisstxxy',
 'xxiiysts',
 'stxxsiyi',
 'iystixsx',
 'xsiyxtis',
 'tixssyxi',
 'yxtiissx',
 'ssyxxiit']

You should always prepare your question by copying and pasting input and output from source files and from an interactive session, rather than typing in what you think it should be. It's easy to introduce or cover up mistakes if you do the latter.

So it looks as if there is a missing transpose:

>>> pprint(transpose(ibwt('ssyxxiit')))
['ixsixtys',
 'ixtysixs',
 'sixsixty',
 'sixtysix',
 'tysixsix',
 'xsixtysi',
 'xtysixsi',
 'ysixsixt']

3. Avoiding transpositions

With that bug fixed, let's see how long the function takes on a medium-sized amount of data:

>>> import random, string
>>> from timeit import timeit
>>> s = ''.join(random.choice(string.printable) for _ in range(1024))
>>> t = bwt(s)
>>> timeit(lambda:ibwt(t[1]), number=1)
49.83987808227539

Not so good! What can we change? Well, an obvious idea is to cut out all those calls to transpose. The function ibwt transposes the matrix in order to sort it, and then transposes it back again each time round the loop. By keeping the matrix in transposed form throughout, it would be possible to avoid having to call transpose at all:

def ibwt2(s):
    """Return the inverse Burrows-Wheeler Transform matrix based on the
    string s.

    >>> ibwt2('SSYXXIIT')[3]
    'SIXTYSIX'

    """
    matrix = [''] * len(s)
    for _ in s:
        matrix = sorted(i + j for i, j in zip(s, matrix))
    return matrix

That simple change results in a big improvement:

>>> s = ''.join(random.choice(string.printable) for _ in range(1024))
>>> t = bwt(s)
>>> timeit(lambda:ibwt2(t[1]), number=1)
0.9414288997650146

4. Aside on programming style

Something that's perhaps worth mentioning at this point is that you seem quite keen on transforming code so as to use a functional style (for example in this question, where you wanted to express a loop using reduce). Transforming code into pure functional form is an excellent intellectual exercise, but you have to be aware of two problems that often arise:

  1. Pure functional programs can easily end up doing a lot of allocation and copying. For example, it is often tempting to transform your data into a format suitable for passing in to a function (rather than transforming your function so that it operates on the data you have). Thus in ibwt it is convenient to call sorted on the data in row-major order, but convenient to append a new instance of the string s in column-major order. In order to express both of these operations in the most convenient way, the function has to transpose the matrix back and forth, and since it do so functionally, each transposition has to copy the whole matrix.

  2. Pursuit of pure functional style often results in tacit or point-free programming where variable names are avoided and data is routed from one function to another using combinators. This can seem very elegant, but variable names have two roles in a program: they don't just move data around, they also act as mnemonics to remind programmers of what the data means. Point-free programs can end up being totally mysterious because you lose track of exactly what is being operated on.

Anyway, aside over. Can we make any more progress on the inverse Burrows–Wheeler transform?

5. An efficient implementation

If you look at what the inverse Burrows–Wheeler transform does, you'll see that it repeatedly permutes an array of strings (by sorting them). The permutation is exactly the same at each step (exercise: prove this!). For example, if we are given the string SSYXXIIT, then at each step we apply the permutation SSYXXIITIISSTXXY, which (taking note of where repeated letters go) is the permutation 0123456756017342.

This observation allows us to reconstruct a row of the matrix without having to compute the whole matrix. In this example we want to reconstruct row number 3. Well, we know that the final matrix looks like this:

I.......
I.......
S.......
S*......
T.......
X.......
X.......
Y.......

so the first element of the row is S. What's the next letter of this row (marked with *)? Where did this come from? Well, we know that the last step of the transform must have looked like this:

S0......         I5......
S1......         I6......
Y2......    ->   S0......
X3......    ->   S1......
X4......    ->   T7......
I5......    ->   X3......
I6......         X4......
T7......         Y2......

So the * must have come from row 1 (which is the image of 3 under the permutation), and so it must be I (which is element number 1 of the sorted string IISSTXXY).

We can follow this approach all the way along the row. Here's an implementation in Python:

def ibwt3(k, s):
    """Return the kth row of the inverse Burrows-Wheeler Transform
    matrix based on the string s.

    >>> ibwt3(3, 'SSYXXIIT')
    'SIXTYSIX'

    """
    def row(k):
        permutation = sorted((t, i) for i, t in enumerate(s))
        for _ in s:
            t, k = permutation[k]
            yield t
    return ''.join(row(k))

Let's check that works:

>>> ibwt3(3, 'SSYXXIIT')
'SIXTYSIX'
>>> s = ''.join(random.choice(string.printable) for _ in range(1024))
>>> t = bwt(s)
>>> ibwt3(*t) == s
True

and is acceptably fast:

>>> timeit(lambda:ibwt3(*t), number=1)
0.0013821125030517578

(If you compare my ibwt3 to the sample Python implementation on Wikipedia you'll see that mine is not only much simpler, but also copes with arbitrary characters, not just bytes in the range 0–255. So either I've missed something important, or the Wikipedia article could be improved.)

\$\endgroup\$
0

Not the answer you're looking for? Browse other questions tagged or ask your own question.