Gzip Notes

Gzip is an example of an application of Huffman and Lempel-Ziv 
encodings.  Gzip uses first Lempel-Ziv and then Huffman encodings to 
transform a file, byte by byte, into a much more compact series of 
1's and 0's.  Gzip treats a file as a series of bytes to be 
transformed, regardless of the actual composition of the file.  A 
simplification of what Gzip does can be described as follows (we are 
not claiming this is exactly what Gzip does, but it captures the 
flavor):

Gzip begins by breaking a file into 32 kilobyte chunks.  Why 32k? 
Lempel-Ziv, as you recall, represents sequences of bytes found in a 
file as literals, lengths and positions. But a computer simply stores 
all information as 1's and 0's.  We must have a way to mark each word 
as we encode it, so that when we eventually decode it, we know which 
of these three things it is.
A solution is to use the first bit to distinguish a literal-word from 
a length-word, and note that the Lempel-Ziv method assures us that if 
we have just read a length-word, then the next word we will read will 
be a position-word.  This leaves 15 bits in a 2-byte code word to 
represents literals and lengths.  Since 2^15 = 32k, we are limited to 
a 32k length with this scheme.  So, we separately encode each 32k of 
the file with its own codebook using Lempel-Ziv.   Note that each L-Z 
word is two bytes, so if greater than 50% compression is not achieved 
here, the file size will actually grow under only Lempel-Ziv.
However, since this is only the intermediate form we don't
care about the compression yet (at worst we will need 64K bytes
to store the intermediate form of the current block being encoded).

Now we apply the Huffman scheme to our Lempel-Ziv encoding to achieve 
further savings.  After each segment is encoded, it is scanned to 
determined the frequency of each word.  These frequencies are used to 
build up a Huffman encoding using two separate Huffman trees.  One 
tree holds the different positions, and one tree holds the lengths 
and values together.
Note here that the more items in a Huffman tree, the longer the code 
words, since Huffman trees are binary trees.  Therefore, it would be 
nice if we could put each Lempel-Ziv word type in its own tree, thus 
shortening the size of each tree and reducing the overall resulting 
encoded file size.  But we cannot;  we must store the values 
(literals) with the lengths or with the positions.
To understand why, consider that during the decoding process, we will 
decode the Huffman code first, resulting in a Lempel-Ziv code, which 
we will then decode again.  When we see a new word in the Huffman 
code, we won't know until we decode it what kind of thing it is 
(literal, length or position).  So we must have only a single tree 
for literals and either positions or lengths, or else we won't know 
which tree to look in.  Since positions always follow lengths (or 
vice versa, if we had chosen the opposite during Lempel-Ziv), we can 
put the positions (or lengths) in another tree, thus saving space by 
shortening their codewords.
The reason for the particular arrangement we have chosen is that 
there are most likely more distinct positions to be encoded than 
distinct lengths, so we put the positions in their own tree to avoid 
lengthening the code words too much.

We have not yet thought about a serious problem.  How do we do the 
string matching in Lempel-Ziv?  If we were to use brute force, we 
would end up with an algorithm which is Theta(n^2).
We can speed this up significantly by using
a hash table.  Each 3-letter sequence (including overlapping sequences)
is hashed into a table and its position is stored there.
Thus each cell of the hash table contains of list of all positions
which our cucrrent triple of characters could hash to.
Upon a successful lookup in the table, we will still have to scan
further from the location found, to determine the extent of the current match.
But this will be much better than scanning every old character for each new
character examined. This can be further sped up by limiting the scope of the
search, but you will then of course miss some matches and reduce the
compression achieved.

In summary, Gzip translates the bytes of a file in 32kilobyte chunks. 
Each 32kilobyte file chunk is first encoded into Lempel-Ziv words. 
The Lempel-Ziv chunk (which may actually have grown in size compared 
to the 32 kilobytes at this point) is then encoded using Huffman 
trees.  The chunk is saved and the following chunks are then 
processed until the file is completed.   The chunk-by-chunk approach 
keeps the intermediate steps (which may be large) from taking too 
much room.

Gzip generally compresses files to 20-50% of their original size.