ECS 60 Homework 4: Operation Miraq Victory
Announcement.
May 19, 00:35. I have fixed a subtle bug in my reference program. Please download the latest version.
Due: Friday, May 29 at 2200 hours.
Hand out: my reference program
Hand in:
Makefile and all the necessary program files.
When I type "make" with no command line argument, your Makefile should create an executable named huffman. Your program should compile on CSIF Linux machines and have the same input/output behavior as my reference program.
Note: This is a group homework. You should work in groups of two. Only
one person from each group shall hand in all the files.
On the first two lines
of Makefile, write each team member's
email username at ucdavis.edu, last name, and first name (one person per line) as comments.
For example:
# bsimpson; Simpson, Bart
# mburns; Burns, Montgomery
If you fail to follow this specification, you will lose points.
Description
In the last episode of the Miraq trilogy, the Mamerican forces
invaded liberated Miraq in pursuit of oil freedom. However, terrorist attacks spread across Miraq like wildfire in Santa Barbara. You, as the Commanding General of the Mamerican forces and their Miraqi puppets allies, just discovered the book, which describes efficient algorithms for terminating insurgents, that would ensure victory in Miraq. You wish to send this book to all the troops via the Internet ASAP. However, due to the astronomical deficit of the Mamerican government, you wish to compress the book to reduce your ISP charges.
Use Huffman coding for compression/decompression.
- The input to compression is a sequence of 8-bit characters.
- When computing the Huffman tree, do not compute the code for any character that does not exist in the input. Do not insert these characters into the min-heap.
- Create a dummy character whose frequency is 0 and whose "ASCII" value is 256. Its purpose will become clear later. You need to compute the code for this dummy character, even though it does not occur in the input.
- To ensure consistent behavior between your program and mine, during the
delete operation on the min-heap, you need to determine the priority of the subtrees that have the same weight. Let S and T be two subtrees. Define: S has a higher priority than T if and only if:
- S's weight is smaller than T's weight, or
- S and T have the same weight, and the smallest character (in ASCII value) in any of S's leaf nodes is smaller than that in any of T's leaf nodes. Again, consider the ASCII value of the dummy character to be 256.
Under this definition, the delete operation should remove the subtree with the highest priority from the min-heap. Also when merging two subtrees, set the tree with the lower priority as the left subtree, and the tree with the higher priority as the right subtree.
You may NOT use STL classes except the string class.
Command line: Your program accepts an optional command line argument "-d":
- When this argument is absent, compress the input and writes the compressed data to the output.
- When this argument is present, decompress the input and writes the decompressed data to the output.
Read input from cin and write output to cout.
Uncompressed data: The uncompressed data contains a sequence of 8-bit characters. The input contains at most 232-1 characters.
Compressed data: The compressed data contains three sections:
- Magic cookie. This section contains 8 characters: the string "HUFFMAN" followed by the ASCII 0 character (
\0).
- Frequencies. This section contains the frequencies of all the characters from ASCII 0 to ASCII 255, even if a characer is absent from the uncompressed data. The frequency of a character is its count in the uncompressed data. Order the frequencies by the ASCII values of their corresponding characters. Each frequency is represented by a 4-byte unsigned integer in the compressed data. Do NOT print the frequency of the dummy character (since it is always 0).
- Compressed data. This section contains the codes of all the characters in the same order as they appear in the uncompressed data. Additionally, append the code of the dummy character to the end of the uncompressed data. Since this section contains a sequence of bits but the smallest unit of data is a byte in files, you need to convert bits into bytes by the following rules:
- Starting from the beginning of the bit sequence, convert each 8 consecutive bits into 1 byte. If the number of bits is not a multiple of 8, pad the end of the bit sequence with 0s.
- When converting 8 bits into 1 byte, let the first bit be the least significant bit (LSB) in the byte, the second bit be the second LSB, and so on.
We will test the decompression function of your program with only valid compressed data, so your program need not handle errors in the compressed data.
Extras
- My reference program has a verbose mode, where during compression it prints all the codes for debugging purposes. Use the "-v" option to invoke this mode. You do not need to implement this mode.
- The compression function in my reference program scans the input twice. During the first pass, it computes the frequency of each ASCII character. During the second pass, it encodes each character in the input. The function calls
istream::seekg() between the two passes. Since a pipe is unseekable, you cannot pass a UNIX pipe as the input to my program (e.g., "cat input | huffman" would not work). Instead, you must use input redirection, such as "huffman < input".
- Food for thought (you need not submit answers to these questions):
- What is the purpose of the dummy character? Can you do without it?
- How do you compare your program in compression ratio with other Unix compressors, such as gzip/gunzip?