Garage of Code: Compression: Word bias in LZ78

LZ78

In the previous post, I implemented a simple (but powerful, and often used) compression algorithm in Python. Look here at the last 20 matches it makes at the end of a 6.1MB file (big.txt - link leads to download [1]). I removed a 0.4MB at the end of big.txt, so that it should end at the end of "War and Peace".

nd to r
ecognize a
motion w
e did n
ot feel;
in the prese
nt case
it is sim
ilarly n
ecessary to ren
ounce a
freedom th
at does n
ot exist
, and to re
cognize a
dependence of
which we a
re not co
nscious.

To me, it seems wasteful that there are so many nonsense words in the phrase tree. The nonsense words appear when we start encoding in the middle of a word. My idea of language is that it consists of words with meaning, and if we preserve the words better in encoding, we should get longer matches and therefore better compression.

Restarting encoding at the beginning of words

I implemented an encoder and decoder that, after each emitted sequence, restarts reading from the last non-alphabetic character, as long as the matched sequence contains a non-alphabetic character. The phrase tree is still expanded from the longest possible match.

we did

did not fe

feel;

in the pre

present ca

case it is s

similarly n

necessary to ren

renounce a

a freed

freedom tha

that does

does not exi

exist, an

and to recog

recognize a d

dependence of

of which w

we are not co

conscious.

The average match length is clearly longer, however the compression rate is not improved by this. The obvious waste is that most of the text is "transmitted twice" because the sequences overlap. If we don't transmit the sequence corresponding to the full match when emitting, but let the end of it be implied by the beginning of the next sequence, we can do much better:

original 6331174

LZ78 (plaintext) 4209692

LZW (plaintext) 3357669

LZ78 + word backtrack 4580235

Word backtrack + implicit initial 3301512

unix compress 2321803

So we are able to beat plaintext LZW by a small margin, but are still very much worse than the binary representation, despite using a more data specific model.

References
[1] norvig.com

Garage of Code

codecogs equations

torsdag 24 oktober 2019

Compression: Word bias in LZ78

Inga kommentarer:

Skicka en kommentar