No Free Lunch
No compression algorithms can compress all possible input files without loss. We can call this the "No Free Lunch"-theorem of compression. It is very easy to prove. A compression algorithm maps an input file to an encoded file. For lossless compression, this mapping must be invertible, so two input files can never map to the same encoded file. Therefore, the compressor must map the space of input files to itself in a one-to-one mapping, and the average size of the encoded files must be the same as the average size of the input files.
![]() |
If the encoding is to be invertible for all binary files, we must map the space of input files to itself, and therefore the average file size remains the same after compression. |
Another way to see it is that we think about the set of files that we want to target for compression, and think about whether the naive representation is redundant. By naive representation I mean plain text. Some examples:
- If we want to compress source code files, we can observe that names of variables and functions are long strings that describe their purpose and identity. This is what we want for human readability, but it is not a necessary representation. The names could just be random, compact strings. Since variable names in source code are used at least twice (if they are both assigned and used), it is probably worth it to store a dictionary with full names and short names.
- If we want to compress large matrices with a lot of zeroes, we should use a sparse matrix representation. The simplest representations store a list of (row, column, value). So they have a positive compression rate if the matrix is less than about 1/3rd non-zero values. Otherwise, this representation makes the matrix file bigger!
- In video compression, we can save a lot of space if only small parts of the scene change between exposures. In that case, we only need to send a small diff image. If the camera is moving, this is not possible to the same degree. Same if there is a lot of noise. In that case, noise had a twofold negative impact on the video: it reduces the quality, and makes the file larger! If there are reflections or blinking lights in the image this also makes compression harder. In fact, a way to detect problems with security cameras is to see if their compression rate drops suddenly.
Lempel Ziv 1978
I found a very good book which deals with compression of data strings. It's the PhD thesis of Ross N. Williams, Adaptive Data Compression (link leads to download) [1]. The first chapter of the book provides a thorough introduction to the field of data compression. Starting on page 52, Williams describes the LZ78 algorithm.
![]() |
From page 54 of William's Adaptive Data Compression [1] |
Python Implementation
The encoding is super easy. The phrase tree G can be represented with a dict of dicts where the keys of the top dict are sequence IDs, the keys of the sub dicts are symbols, and the values of the sub dicts are sequence IDs.
Note that we need seq_old in case the final string is not a complete match, so we emit a non-leaf. The decoding is almost as easy:
Note that if we hit the end (StopIteration) we have no work half-done: the encoder only emits full matches.
Testing correctness is easy! Just throw in some large text files and check that the file is identical after decompression (os.system("diff {} {}".format(fn, fn_reconstructed)) - Warning: prints to screen if files aren't identical. May be a lot if the files are large. To minimize printing, add flag '--brief'.). Debugging is straightforward: edit a test file by trial and error to be the shortest possible file that triggers an error. Many bugs can be found with files that are only a handful of symbols long.
The encoded output of this very simple algorithm looks like this:
The original is:
Which is considerably shorter. So in the beginning, while the phrase tree is still very small, the overhead of the compression is definitely larger than the space saved by abbreviating matching strings to their sequence ID.
Testing on a file (Nils Holgersson, a Swedish novel) which is 1.1MB gives (in bytes):
Before: 1146989
After: 1176260
Well darnit, the compression isn't even better than plain text.
Optimization and Testing
It's really wasteful to use integers as IDs, which we write in plaintext. The ID can be strings of any characters. To begin with, we can use all letters, lower and upper case, and some punctuation. Just raiding the keyboard gives me 89 characters that can be used. Note that it is very important to only pick characters that are represented with one byte. The punctuation I used was:
".:,;!%&/()=@${[]}^~'*<>|-`\"
And the full alphabet:
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:,;!%&/()=@${[]}^~'*<>|-`\"
With this new encoding the compressed files looks like:
Which definitely looks more mumbo-jumbo than before. Which is good! More randomness in encoded files means more potential for compression. With this, the result is:
Before: 1146989
After: 830805
Yes! We compressed the file by about 25%. Is that good? Linux's built-in zip-tool, unix compress, manages to get the file down to:
original 1146989
this 830805
zip 417624
zip 417624
original 1146989
this 830805
LZW opt 657987
zip 417624
LZW opt 657987
zip 417624
So even with the LZW-optimization, we're not close to the built-in program. It's possible that more opimizations are used, but a lot of the difference is to be explained with the fact that compress works on a pure binary level. The encoded file for compress looks like:
[1] has the following to say about LZW:
[1] Williams, Ross N. Adaptive data compression. Vol. 110. Springer Science & Business Media, 2012.
Inga kommentarer:
Skicka en kommentar