codecogs equations

torsdag 24 oktober 2019

Compression: Word bias in LZ78

Previous:
In the previous post, I implemented a simple (but powerful, and often used) compression algorithm in Python. Look here at the last 20 matches it makes at the end of a 6.1MB file (big.txt - link leads to download [1]). I removed a 0.4MB at the end of big.txt, so that it should end at the end of "War and Peace".

nd to r
ecognize a
 motion w
e did n
ot feel;
 in the prese
nt case
 it is sim
ilarly n
ecessary to ren
ounce a
 freedom th
at does n
ot exist
, and to re
cognize a
dependence of
which we a
re not co
nscious.

To me, it seems wasteful that there are so many nonsense words in the phrase tree. The nonsense words appear when we start encoding in the middle of a word. My idea of language is that it consists of words with meaning, and if we preserve the words better in encoding, we should get longer matches and therefore better compression.

Restarting encoding at the beginning of words

I implemented an encoder and decoder that, after each emitted sequence, restarts reading from the last non-alphabetic character, as long as the matched sequence contains a non-alphabetic character. The phrase tree is still expanded from the longest possible match.

we did
did not fe
feel;
 in the pre
present ca
case it is s
similarly n
necessary to ren
renounce a
a freed
freedom tha
that does
does not exi
exist, an
and to recog
recognize a d
dependence of
of which w
we are not co
conscious.

The average match length is clearly longer, however the compression rate is not improved by this. The obvious waste is that most of the text is "transmitted twice" because the sequences overlap. If we don't transmit the sequence corresponding to the full match when emitting, but let the end of it be implied by the beginning of the next sequence, we can do much better:

original                                            6331174
LZ78 (plaintext)                             4209692
LZW  (plaintext)                             3357669
LZ78 + word backtrack                  4580235
Word backtrack + implicit initial     3301512
unix compress                                 2321803

So we are able to beat plaintext LZW by a small margin, but are still very much worse than the binary representation, despite using a more data specific model. 

References
[1] norvig.com

Inga kommentarer:

Skicka en kommentar