Gzip Notes Gzip is an example of an application of Huffman and Lempel-Ziv encodings. Gzip uses first Lempel-Ziv and then Huffman encodings to transform a file, byte by byte, into a much more compact series of 1's and 0's. Gzip treats a file as a series of bytes to be transformed, regardless of the actual composition of the file. A simplification of what Gzip does can be described as follows (we are not claiming this is exactly what Gzip does, but it captures the flavor): Gzip begins by breaking a file into 32 kilobyte chunks. Why 32k? Lempel-Ziv, as you recall, represents sequences of bytes found in a file as literals, lengths and positions. But a computer simply stores all information as 1's and 0's. We must have a way to mark each word as we encode it, so that when we eventually decode it, we know which of these three things it is. A solution is to use the first bit to distinguish a literal-word from a length-word, and note that the Lempel-Ziv method assures us that if we have just read a length-word, then the next word we will read will be a position-word. This leaves 15 bits in a 2-byte code word to represents literals and lengths. Since 2^15 = 32k, we are limited to a 32k length with this scheme. So, we separately encode each 32k of the file with its own codebook using Lempel-Ziv. Note that each L-Z word is two bytes, so if greater than 50% compression is not achieved here, the file size will actually grow under only Lempel-Ziv. However, since this is only the intermediate form we don't care about the compression yet (at worst we will need 64K bytes to store the intermediate form of the current block being encoded). Now we apply the Huffman scheme to our Lempel-Ziv encoding to achieve further savings. After each segment is encoded, it is scanned to determined the frequency of each word. These frequencies are used to build up a Huffman encoding using two separate Huffman trees. One tree holds the different positions, and one tree holds the lengths and values together. Note here that the more items in a Huffman tree, the longer the code words, since Huffman trees are binary trees. Therefore, it would be nice if we could put each Lempel-Ziv word type in its own tree, thus shortening the size of each tree and reducing the overall resulting encoded file size. But we cannot; we must store the values (literals) with the lengths or with the positions. To understand why, consider that during the decoding process, we will decode the Huffman code first, resulting in a Lempel-Ziv code, which we will then decode again. When we see a new word in the Huffman code, we won't know until we decode it what kind of thing it is (literal, length or position). So we must have only a single tree for literals and either positions or lengths, or else we won't know which tree to look in. Since positions always follow lengths (or vice versa, if we had chosen the opposite during Lempel-Ziv), we can put the positions (or lengths) in another tree, thus saving space by shortening their codewords. The reason for the particular arrangement we have chosen is that there are most likely more distinct positions to be encoded than distinct lengths, so we put the positions in their own tree to avoid lengthening the code words too much. We have not yet thought about a serious problem. How do we do the string matching in Lempel-Ziv? If we were to use brute force, we would end up with an algorithm which is Theta(n^2). We can speed this up significantly by using a hash table. Each 3-letter sequence (including overlapping sequences) is hashed into a table and its position is stored there. Thus each cell of the hash table contains of list of all positions which our cucrrent triple of characters could hash to. Upon a successful lookup in the table, we will still have to scan further from the location found, to determine the extent of the current match. But this will be much better than scanning every old character for each new character examined. This can be further sped up by limiting the scope of the search, but you will then of course miss some matches and reduce the compression achieved. In summary, Gzip translates the bytes of a file in 32kilobyte chunks. Each 32kilobyte file chunk is first encoded into Lempel-Ziv words. The Lempel-Ziv chunk (which may actually have grown in size compared to the 32 kilobytes at this point) is then encoded using Huffman trees. The chunk is saved and the following chunks are then processed until the file is completed. The chunk-by-chunk approach keeps the intermediate steps (which may be large) from taking too much room. Gzip generally compresses files to 20-50% of their original size.