COMP 2011/2711 Data Organisation, Semester 1, 2006

Project 2

Due: Friday 5 May, 6:00 pm
Marks: 15% of final assessment

This project will involve writing a Java class to implement Adaptive Data Compression and Static Compression.

Data Compression

The purpose of data compression is to take a file A and compress it to another file B in such a way that:

• B is smaller than A, and
• it is possible to exactly reconstruct A by decompressing B.
It is desirable that such an algorithm be efficient, and that B be a lot smaller than A. The size of B divided by that of A is called the compression ratio. While it would be nice to have an algorithm which can achieve a compression ratio less than 1.0 for all possible input files, unfortunately such an algorithm cannot exist. To see this, let us suppose that all possible strings of length less than or equal to k can be compressed to strings of length less than or equal to l, with l<k. Since the number of strings of length less than or equal to l is smaller than the number of strings of length less than or equal to k, there would not be enough compressed strings to represent all the original strings uniquely.

Since it is not possible to successfully compress all files, the goal of data compression must be to achieve a good compression ratio for the types of files that are typically found on computers. Fortunately, a number of powerful methods have been discovered to do this, and you will be implementing one of them.

Prefix Codes

For simplicity, suppose we have a set of five letters: A, B, C, D, E. For each of these letters, a (binary) codeword will be assigned. These codewords will be binary strings with the property that no one string is a prefix of any other. Such a set of codewords is called a prefix code. There is a one-to-one correspondence between binary prefix codes for N letters, and binary trees with N leaves. Here is an example of a prefix code and its corresponding tree:

```      A: 00                                o    <--Root
B: 010                              / \
C: 011                             /   \
D: 10                             /     \
E: 11                            o       o   <-- internalNode
/ \     / \
A   o   D   E   <-- leafNode
/ \
B   C
```
Note:
1. the codeword for a leaf is just a binary string representing the path from the root of the tree to that leaf, where "left" corresponds to "0" and "right" corresponds to "1".
2. the codewords vary in length, and the compression ratio can generally be improved by assigning shorter codes to the more commonly occurring symbols.

Adaptive Splay Trees and Huffman Tree

In this assignment you will be implementing two things based on prefix codes: an adaptive splay tree compression algorithm, and a static Huffman tree. The adaptive splay algorithm continually changes the code tree as it reads the input, while the Huffman tree is only constructed after all the input is read. The differences between these approaches are as discussed in lectures.

Adaptive Splay Compression

In this section we describe the classes for the adaptive splay tree compression that you have to implement.

The classes `BinaryTreeNode`, `InternalNode`, `LeafNode`, `BitInputStream`, `BitOutputStream` are provided for you, and you are not allowed to make any changes to them. Your job is to fill in the details of the file `CodeTree.java` by providing code for the following five methods:

```public CodeTree(int num_leaves)         - build a new coding tree
public BinaryTreeNode getRoot()         - return the root of the tree
public BinaryTreeNode getLeaf(int item) - return leaf storing this item
public splay(InternalNode node)         - splay node to root of tree
public static void main(String[] args)  - compress or decompress
```

The constructor `CodeTree(int num_leaves)` should create an initial tree which has the specified number of leaves, and is "nearly balanced" in the sense that, for every node in the tree, the number of leaves in its left and right subtree must either be equal, or the number of leaves in the right subtree must exceeding those in the left subtree by one. This initialization is not unique and small variations from the "ideal" are immaterial as the subsequent splaying should smooth them out.

The `getRoot`() method should return the root of the tree.

The `getLeaf`() method should return the leaf node storing the specified `item`, where `item` is an integer between 0 and the number of leaves minus one.

The `splay`() method should restructure the tree by splaying the specified node up to the root of the tree, according to the splay algorithm described in lectures and your textbook. The classic paper by D.W. Jones on the use of splay trees for data compression which describes the specialization of splaying to adaptively shorten the prefix codes of high probability messages (symbols) may also help you understand the background of the work. However, you only need to read the first few pages of it, ignoring the remarks on entropy if you are unfamiliar with the concept. Moreover, note that the actual splay methods you have to program are those in the text, viz., zig, zig-zig and zig-zag, and not the ones described in this paper, i.e., do not twist the tree.

The `main` method in class `CodeTree.java` will be used to perform the compression and decompression. The first command-line argument will be the word "compress" or "decompress"; the second and third argument will specify the names of the input and output files, e.g.:

`java CodeTree compress` infile compfile
(compress the contents of infile to produce compfile)

`java CodeTree decompress` compfile decompfile
(decompress the contents of compfile to produce decompfile, which should then be an exact copy of infile).

The compressor should read the bytes of its input one by one and output a sequence of bits representing the codewords generated by the algorithm. The decompressor should read bits one-by-one, decode them into bytes, and output the bytes. We have written the classes `BitOutputStream` and `BitInputStream` to help you with the output and input. The `writeBit`() method collects bits one at a time, and when it has collected eight bits it packs them into a byte and outputs the byte. The method `readBit`() reads a byte, breaks it into eight bits, and returns the bits one at a time before reading the next byte.

What happens if the compressed version of a file has a length in bits that is not a multiple of eight? In this case there will be a problem packing it into bytes, since all files are required to end at a byte boundary. If we fill in the remaining bits with zeros (by calling the `flush`() method in `BitOutputStream`) the decompressor may get confused, since it may try to interpret those remaining zeros as codewords. In order to deal with this problem, the compressor will send a special terminating code to indicate the end of the file. So, as far as the `main` method is concerned the code tree will be assumed to have 257 leaves instead of 256. The leaves numbered 0 to 255 are used for sending the bytes of the input file, and a special leaf numbered 256 will be used as a signal to indicate the end of the input. The compressor writes this special code just before it calls `flush`() and quits. When the decompressor recognises this signal, it realizes the input is finished, and quits.

Huffman Tree (Static) Compressor

Huffman coding is described on pages 565-567 of your textbook, and the algorithm you have to implement is essentially that on page 567.

As in the first part of the assignment, you are to use the classes `BinaryTreeNode`, `InternalNode`, `LeafNode`, `BitInputStream`, `BitOutputStream` which are provided for you, and you are not allowed to make any changes to them. Your job is to fill in the following methods in `HuffmanTree.java`:

```public HuffmanTree(int[] freq)          - build a new Huffman tree
public BinaryTreeNode getRoot()         - return the root of the tree
public BinaryTreeNode getLeaf(int item) - return leaf storing this item
public double weightedPathLength()      - calculate the weighted path length
public static void main(String[] args)  - compress or decompress
```
The constructor `HuffmanTree(int[] freq)` should create a tree according to the Huffman algorithm. The byte frequency counts are to be given in the array `freq`, which should contain exactly 257 elements, such that `freq[c]` is the number of times the byte `c` occurs in the input.

The `getRoot`() method should return the root of the tree.

The `getLeaf`() method should return the leaf node storing the specified `item`, where `item` is an integer between 0 and the number of leaves minus one.

The `weightedPathLength`() method should calculate and return the weighted path length of the Huffman tree. This is the number that is calculated as follows. For each encoded byte B, if its code length is len(B) and its frequency is freq(B), then its (normalised) path weight w(B) is len(B)*freq(B)/N. (N is the total number of bytes in the input file). The weighted path length of the entire (Huffman) code tree is the sum of the path weights w(B) over all B's. The number you get should be between 1 and 8. If you get something wildly different, then there is something wrong with your program.

The `main()` method in the class `HuffmanTree` will be used to perform compression and decompressing using Huffman encoding. The first command-line argument will be the word "compress" or "decompress"; the second and third argument will specify the names of the input and output files, e.g.:

`java HuffmanTree compress` infile compfile
(compress the contents of infile to produce compfile)

`java HuffmanTree decompress` compfile decompfile
(decompress the contents of compfile to produce decompfile, which should then be an exact copy of infile).

In addition to outputting the compressed file, you are also to output the weighted path length of the Huffman tree on standard output in the following format:

```Weighted path length: 4.640663390663392
```
(replace the number with the actual value of the weighted path length)

The difference between static compression and adaptive splay compression is that the Huffman tree is only constructed after the entire file is read in. As you read in the symbols, keep a count of how many times each symbol occurs. At the end you have the total count of the occurrences of each symbol. Then construct the Huffman tree using these numbers, and compress the input using the tree. (Do you need to read in the input again?)

Unlike adaptive splay compression, you cannot reconstruct the tree from the compressed data alone. Therefore you will need to store enough information in the compressed file so that you can reconstruct the tree when you decompress the file. The easiest way to do this is to store the frequency counts at the beginning of the file, using the supplied `writeInt()` and `readInt()` methods. The method `writeInt()` is in `BitOutputStream`, and writes a 32-bit integer to the output file. The method `readInt()` is in `BitInputStream`, and reads a 32-bit integer from the input file.

You will need to use a priority queue to implement the Huffman algorithm. Fortunately, the Java API already provides us with a priority queue implementation in the template class `PriorityQueue`. You may use this class for this assignment. The following code creates a `PriorityQueue` object for holding `BinaryTreeNode` objects:

```PriorityQueue<BinaryTreeNode> queue = new PriorityQueue<BinaryTreeNode>();
```
To use the priority queue, the methods you will need are:
```public boolean add(BinaryTreeNode o)    - adds o to priority queue
public BinaryTreeNode poll()            - removes and returns head of the queue
public int size()                       - returns number of items in the queue
```

Testing your code:

We have made available `hw2test.java` which you can use for testing the tree creation and splaying methods in your program. The following command will use your CodeTree class to generate a tree with the given number of leaves, then perform a series of splay operations on the tree. After each step it will display the tree, so you can see the results of the splay operation.

`java hw2test num_leaves [ item ... ]`

Note that this program will not be used for testing your program in automarking. It is provided only as an aid for debugging your program.

We have put up some sample test data in samples/. The table below lists the performance of our compressors on the sample test data. The first three numbers in each line are the original and compressed sizes of the files (for both compressors), given in bytes. The last number gives the weighted path length of the Huffman tree generated by the Huffman compressor. Notice that for some files, the compressed file is actually bigger than the uncompressed file.

```File          Original  Adaptive   Huffman    WPL
----------------------------------------------------
image1.tiff     124120     35906     23143    1.4253
image2.jpg       33365     44284     34375    7.9954
image3.tiff      60286     15930     60905    7.9455
image4.jpg       66378     87367     67246    7.9805
random           65536     87995     66593    8.0033
text1.txt         8400      6637      5745    4.4914
text2.txt         8547      6794      5853    4.5156
text3.txt         5744      4707      4298    4.5528
text4.txt        12731      9951      8010    4.3867

Notes:
image[13].tiff   - uncompressed TIFF file
image[24].jpg    - compressed JPEG file
random           - 65536 random bytes
text[1234].txt   - plain English text files
```
Your program will be assessed on unseen test data and not just the sample test data provided.

Adaptive compression: if your compressor is within 10% of ours you get full performance marks, prorated accordingly for worse performance.

Huffman compression: if your compressor implements the Huffman algorithm correctly it will achieve the same performance as ours - we will look for this when marking. There are other assessment components such as style and question answering for the total mark.

Question

At the top of your code, in a block of comments, you should provide a brief answer (one or two short paragraphs) to the following question:

(i) How well do these methods of data compression work?
(ii) On which types of files do they work well (i.e. achieve a low compression ratio) and on which do they work poorly?
(iii) What is the main advantage of the adaptive method over the static method?

Submission

Once submissions are open, you should submit by typing

```give cs2011 hw2 CodeTree.java HuffmanTree.java ```

You can submit as many times as you like - later submissions will overwrite earlier ones.

Remember that you should not edit the files ```BinaryTreeNode.java, InternalNode.java, LeafNode.java, BitInputStream.java, BitOutputStream.java``` or `hw2test.java`. If you attempt to submit an altered version of any of these files, your submitted version will be ignored and the standard version will be used instead.

The submission deadline is as above.
15% penalty will be applied to the (maximum) mark for every 24 hours late after the deadline.

Additional information may be found in the FAQ as it is built up, and will be considered as part of the specification for the project.

If you have a question that has not already been answered on the FAQ, you can email it to `cs2011.hw2@cse.unsw.edu.au`

Assessment

The project will be marked on style and comments as well as functionality and performance. Programs which generate compilation errors will receive a very low mark, no matter what other virtues they may have. In general, a program that attempts a substantial part of the job but does that part correctly will receive more marks than one attempting to do the entire job but with many errors. The answer to the question is worth 3 (out of 15) marks.

Plagiarism Policy

Group submissions will not be allowed. Your program must be entirely your own work. Plagiarism detection software will be used to compare all submissions pairwise. These checks may be run after the automarking and release of marks, so the released marks are not necessarily final.

• First detection: negative mark for the assignment
• Second detection: failure of the course
• Third detection: faculty disciplinary action (including possible expulsion from the university)

DO NOT COPY FROM OTHERS; DO NOT ALLOW ANYONE TO SEE YOUR CODE

Please refer to the Yellow Form, or to section on Originality of Assignment Submissions in the Unix Primer, if you require further clarification on this matter.

Good luck!