codecogs equations

onsdag 23 oktober 2019

Compression: LZ78

Compression fascinates me. The value added is obvious: we use less storage and transfer resources without losing any information. We pay for this with the time it takes to compress and decompress. The algorithms have two obvious performance metrics for a given input data: compression rate, and run-time. I am most interested in optimizing compression rate, though one shouldn't forget the importance of speed for the practicability.

No Free Lunch

No compression algorithms can compress all possible input files without loss. We can call this the "No Free Lunch"-theorem of compression. It is very easy to prove. A compression algorithm maps an input file to an encoded file. For lossless compression, this mapping must be invertible, so two input files can never map to the same encoded file. Therefore, the compressor must map the space of input files to itself in a one-to-one mapping, and the average size of the encoded files must be the same as the average size of the input files.

If the encoding is to be invertible for all binary files, we must map the space of input files to itself, and therefore the average file size remains the same after compression. 
For this reason, all compression must rely on qualifying a subset of files that we are interested in compressing, and then writing an algorithm that maps these files to a smaller size. There is a trade-off between generality and performance at work here. In the extreme, if all we ever want is to store or transmit two different files, we would just need 1 bit, either 0 or 1.

Another way to see it is that we think about the set of files that we want to target for compression, and think about whether the naive representation is redundant. By naive representation I mean plain text. Some examples:

  • If we want to compress source code files, we can observe that names of variables and functions are long strings that describe their purpose and identity. This is what we want for human readability, but it is not a necessary representation. The names could just be random, compact strings. Since variable names in source code are used at least twice (if they are both assigned and used), it is probably worth it to store a dictionary with full names and short names.
  • If we want to compress large matrices with a lot of zeroes, we should use a sparse matrix representation. The simplest representations store a list of (row, column, value). So they have a positive compression rate if the matrix is less than about 1/3rd non-zero values. Otherwise, this representation makes the matrix file bigger!
  • In video compression, we can save a lot of space if only small parts of the scene change between exposures. In that case, we only need to send a small diff image. If the camera is moving, this is not possible to the same degree. Same if there is a lot of noise. In that case, noise had a twofold negative impact on the video: it reduces the quality, and makes the file larger! If there are reflections or blinking lights in the image this also makes compression harder. In fact, a way to detect problems with security cameras is to see if their compression rate drops suddenly. 

Lempel Ziv 1978

I found a very good book which deals with compression of data strings. It's the PhD thesis of Ross N. Williams, Adaptive Data Compression (link leads to download) [1]. The first chapter of the book provides a thorough introduction to the field of data compression. Starting on page 52, Williams describes the LZ78 algorithm.

From page 54 of William's Adaptive Data Compression [1]
The idea is to incrementally build a tree of phrases. Every path from the root node represents a phrase that has been encountered in the data. Every edge holds a symbol. Phrases that have the same initial strings can reuse the same nodes. A node is identified by its ID, which we can assume is a natural number. The tree is transmitted by sending child->parent edges together with the corresponding symbol. The child always uses the lowest available number for its ID, so we only need to send the parent ID and the symbol. As explained on page 54 in [1], we don't even need to send the symbol, since it can be inferred to be the first symbol in the next phrase to be sent.

Python Implementation

The encoding is super easy. The phrase tree G can be represented with a dict of dicts where the keys of the top dict are sequence IDs, the keys of the sub dicts are symbols, and the values of the sub dicts are sequence IDs.

Note that we need seq_old in case the final string is not a complete match, so we emit a non-leaf. The decoding is almost as easy:

Note that if we hit the end (StopIteration) we have no work half-done: the encoder only emits full matches.

Testing correctness is easy! Just throw in some large text files and check that the file is identical after decompression (os.system("diff {} {}".format(fn, fn_reconstructed)) - Warning: prints to screen if files aren't identical. May be a lot if the files are large. To minimize printing, add flag '--brief'.). Debugging is straightforward: edit a test file by trial and error to be the shortest possible file that triggers an error. Many bugs can be found with files that are only a handful of symbols long.

The encoded output of this very simple algorithm looks like this:


The original is:


Which is considerably shorter. So in the beginning, while the phrase tree is still very small, the overhead of the compression is definitely larger than the space saved by abbreviating matching strings to their sequence ID.

Testing on a file (Nils Holgersson, a Swedish novel) which is 1.1MB gives (in bytes):

Before:  1146989
After:    1176260

Well darnit, the compression isn't even better than plain text.

Optimization and Testing

It's really wasteful to use integers as IDs, which we write in plaintext. The ID can be strings of any characters. To begin with, we can use all letters, lower and upper case, and some punctuation. Just raiding the keyboard gives me 89 characters that can be used. Note that it is very important to only pick characters that are represented with one byte. The punctuation I used was:

".:,;!%&/()=@${[]}^~'*<>|-`\"

And the full alphabet:

"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:,;!%&/()=@${[]}^~'*<>|-`\"

With this new encoding the compressed files looks like:

Which definitely looks more mumbo-jumbo than before. Which is good! More randomness in encoded files means more potential for compression. With this, the result is:

Before:  1146989
After:    830805

Yes! We compressed the file by about 25%. Is that good? Linux's built-in zip-tool, unix compress, manages to get the file down to:

original  1146989
this         830805
zip         417624 

So the built in is less than half the size of our compression! What compression algorithm does compress use? A quick lookup tells us that it uses LZW, which just so happens to be quite a simple optimization of LZ78. The optimization is that the final symbol needn't be transmitted, but can be inferred from the next sequence ID. We can easily calculate how big that would make the file with our implementation:

original     1146989
this            830805
LZW opt   657987
zip             417624 

So even with the LZW-optimization, we're not close to the built-in program. It's possible that more opimizations are used, but a lot of the difference is to be explained with the fact that compress works on a pure binary level. The encoded file for compress looks like:


[1] has the following to say about LZW:


References
[1] Williams, Ross N. Adaptive data compression. Vol. 110. Springer Science & Business Media, 2012.

Inga kommentarer:

Skicka en kommentar