I think understanding basic data compression and pattern finding / entropy reduction primitives is really important. Mostly the basic algorithms (before the many tweaks and optimizations are made) are really elegant and inspiring. What's even more impressive is how widespread they are, but perhaps also how little they are really understood except outside of people working in compression directly. So I want to understand.
I understand and have implemented the naive implementations of LZ77 (just find the maximal prefix that existed before the current factor) and LZ78 (just see if we can extend a factor that occurred before at least twice, or we make a new record and start a new factor). I also understand the naive version of Burrows-Wheeler Transform (maintain the invariant that the first column is sorted, and keep adding that column and resorting until the matrix is square).
I also understand why all these (parts of) compression algorithms work:
- LZ77 asymptotically optimal since in the long term and with an infinite window all factors that exist are composed of prefixes earlier in the input which LZ77 can find.
- LZ78 is also pretty good since if a factor occurs at least once before it can be compressed the second time, so long as it is not a suffix of another factor found earlier.
- BWT, by grouping same letters on the left hand side (of the matrix), and finding patterns in the preceding letters in the right hand side, can exploit bigram repetitions, recursively.
The BWT code below works, but the inverse is not efficient. What are the optimizations, and how can I understand why they work?
BWT
def rot(v):
return v[-1] + v[:-1]
def bwt_matrix(s):
return sorted(reduce(lambda m,s : m + [rot(m[-1])],xrange(len(s)-1),[s]))
def last_column(m):
return ''.join(zip(*m)[-1])
def bwt(s):
bwtm = bwt_matrix(s)
print 'BWT matrix : ', bwtm
return bwtm.index(s), ''.join(last_column(bwtm))
def transpose(m):
return [''.join(i) for i in zip(*m)]
def ibwt(s):
return reduce(lambda m, s: transpose(sorted(transpose([s]+m))),[s]*len(s),[])
s = 'sixtysix'
index, bwts = bwt(s)
print 'BWT, index : ', bwts, index
print 'iBWT : ', ibwt(bwts)
BWT output
BWT matrix :
ixsixtys
ixtysixs
sixsixty
sixtysix
tysixsix
xsixtysi
xtysixsi
ysixsixt
BWT, index : ssyxxiit 3
iBWT :
ixsixtys
ixtysixs
sixsixty
sixtysix
tysixsix
xsixtysi
xtysixsi
ysixsixt
Which is correct.
LZ77
def lz77(s):
lens = len(s)
factors = []
i = 0
while i < lens:
max_match_len = 0
current_word = i
j = 0
while j < lens-1:
if current_word >= lens or s[current_word] != s[j]:
j -= (current_word - i)
if current_word - i >= max_match_len:
max_match_len = current_word - i
if current_word >= lens:
break
current_word = i
else:
current_word += 1
j += 1
if max_match_len > 1:
factors.append(s[i:i+max_match_len])
i += max_match_len
else:
factors.append(s[i])
i += 1
return ','.join(factors)
LZ78
def lz78(s):
empty = ''
factors = []
word = empty
for c in s:
word += c
if word not in factors:
factors.append(word)
word = empty
if word is not empty:
factors.append(word)
return ','.join(factors)
Correct outputs against 7th Fibonacci word:
LZ77( abaababaabaab ): a,b,a,aba,baaba,ab
LZ78( abaababaabaab ): a,b,aa,ba,baa,baab
rot
v[-1]
should bev[-1:]
(ii) inlz78
there's a reference to an undefined variableempty
. \$\endgroup\$v[-1]
it only works for strings: tryrot(range(3))
for example. Withv[-1:]
it works for all sequences. (Perhaps "bug" was a bit strong.) \$\endgroup\$