Due: Friday 5 May, 6:00 pm
Marks: 15% of final assessment
This project will involve writing a Java class to implement Adaptive Data Compression and Static Compression.
Data Compression
The purpose of data compression is to take a file A and compress it to another file B in such a way that:
Since it is not possible to successfully compress all files, the goal of data compression must be to achieve a good compression ratio for the types of files that are typically found on computers. Fortunately, a number of powerful methods have been discovered to do this, and you will be implementing one of them.
Prefix Codes
For simplicity, suppose we have a set of five letters: A, B, C, D, E. For each of these letters, a (binary) codeword will be assigned. These codewords will be binary strings with the property that no one string is a prefix of any other. Such a set of codewords is called a prefix code. There is a one-to-one correspondence between binary prefix codes for N letters, and binary trees with N leaves. Here is an example of a prefix code and its corresponding tree:
A: 00 o <--Root B: 010 / \ C: 011 / \ D: 10 / \ E: 11 o o <-- internalNode / \ / \ A o D E <-- leafNode / \ B CNote:
Adaptive Splay Trees and Huffman Tree
In this assignment you will be implementing two things based on prefix codes: an adaptive splay tree compression algorithm, and a static Huffman tree. The adaptive splay algorithm continually changes the code tree as it reads the input, while the Huffman tree is only constructed after all the input is read. The differences between these approaches are as discussed in lectures.
The classes BinaryTreeNode
,
InternalNode
,
LeafNode
,
BitInputStream
,
BitOutputStream
are provided for you,
and you are not allowed to make any changes to them.
Your job is to fill in the details of the file
CodeTree.java
by providing code for the following five methods:
public CodeTree(int num_leaves) - build a new coding tree public BinaryTreeNode getRoot() - return the root of the tree public BinaryTreeNode getLeaf(int item) - return leaf storing this item public splay(InternalNode node) - splay node to root of tree public static void main(String[] args) - compress or decompress
The constructor CodeTree(int num_leaves)
should create an initial tree which has the specified number of leaves,
and is "nearly balanced" in the sense that, for every node in the tree,
the number of leaves in its left and right subtree must either be equal,
or the number of leaves in the right subtree must exceeding those in the
left subtree by one. This initialization is not unique and small variations
from the "ideal" are immaterial as the subsequent splaying should
smooth them out.
The getRoot
() method should return the root of the tree.
The getLeaf
() method should return the leaf node
storing the specified item
, where item
is an integer between 0 and the number of leaves minus one.
The splay
() method should restructure the tree
by splaying the specified node up to the root of the tree,
according to the splay algorithm described in lectures and your
textbook. The classic paper by D.W. Jones on
the use of splay trees for
data compression which describes the specialization of splaying to
adaptively shorten the prefix codes of high probability messages (symbols)
may also help you understand the background of the work.
However, you only need to read the first few pages of it, ignoring
the remarks on entropy if you are unfamiliar with the concept.
Moreover, note that the actual splay methods you have to program are
those in the text, viz., zig, zig-zig and zig-zag, and not the
ones described in this paper, i.e., do not twist the tree.
The main
method in class CodeTree.java
will be used to perform the compression and decompression.
The first command-line argument will be the word "compress" or "decompress";
the second and third argument will specify the names of
the input and output files, e.g.:
java CodeTree compress
infile compfile
(compress the contents of infile to produce compfile)
java CodeTree decompress
compfile decompfile
(decompress the contents of compfile to produce decompfile,
which should then be an exact copy of infile).
The compressor should read the bytes of its input one by one
and output a sequence of bits representing the codewords generated
by the algorithm.
The decompressor should read bits one-by-one, decode them into bytes,
and output the bytes.
We have written the classes BitOutputStream
and BitInputStream
to help you with the output and input.
The writeBit
() method collects bits one at a time,
and when it has collected eight bits it packs them into a byte
and outputs the byte.
The method readBit
() reads a byte,
breaks it into eight bits, and returns the bits one at a time
before reading the next byte.
What happens if the compressed version of a file has a length in
bits that is not a multiple of eight? In this case there will be a problem
packing it into bytes, since all files are required to end at a byte boundary.
If we fill in the remaining bits with zeros
(by calling the flush
() method in BitOutputStream
)
the decompressor may get confused,
since it may try to interpret those remaining zeros as codewords.
In order to deal with this problem, the compressor will send a special
terminating code to indicate the end of the file.
So, as far as the main
method is concerned
the code tree will be assumed to have 257 leaves instead of 256.
The leaves numbered 0 to 255 are used for sending the bytes of the input file,
and a special leaf numbered 256 will be used as a signal to indicate
the end of the input.
The compressor writes this special code just before it calls
flush
() and quits.
When the decompressor recognises this signal, it realizes the input is finished,
and quits.
As in the first part of the assignment, you are to use the classes
BinaryTreeNode
,
InternalNode
,
LeafNode
,
BitInputStream
,
BitOutputStream
which are provided for you, and you are not allowed to make any changes
to them.
Your job is to fill in the following methods in
HuffmanTree.java
:
public HuffmanTree(int[] freq) - build a new Huffman tree public BinaryTreeNode getRoot() - return the root of the tree public BinaryTreeNode getLeaf(int item) - return leaf storing this item public double weightedPathLength() - calculate the weighted path length public static void main(String[] args) - compress or decompressThe constructor
HuffmanTree(int[] freq)
should create a
tree according to the Huffman algorithm.
The byte frequency counts are to be given in the array freq
,
which should contain exactly 257 elements, such that freq[c]
is the number of times the byte c
occurs in the input.
The getRoot
() method should return the root of the tree.
The getLeaf
() method should return the leaf node
storing the specified item
, where item
is an integer between 0 and the number of leaves minus one.
The weightedPathLength
() method should calculate and
return the weighted path length of the Huffman tree.
This is the number that is calculated as follows.
For each encoded byte B, if its code length is len(B) and its
frequency is freq(B), then its (normalised) path weight w(B) is
len(B)*freq(B)/N.
(N is the total number of bytes in the input file).
The weighted path length of the entire (Huffman) code tree is the sum of
the path weights w(B) over all B's.
The number you get should be between 1 and 8.
If you get something wildly different, then there is something wrong with
your program.
The main()
method in the class HuffmanTree
will
be used to perform compression and decompressing using Huffman encoding.
The first command-line argument will be the word "compress" or "decompress";
the second and third argument will specify the names of
the input and output files, e.g.:
java HuffmanTree compress
infile compfile
(compress the contents of infile to produce compfile)
java HuffmanTree decompress
compfile decompfile
(decompress the contents of compfile to produce decompfile,
which should then be an exact copy of infile).
In addition to outputting the compressed file, you are also to output the weighted path length of the Huffman tree on standard output in the following format:
Weighted path length: 4.640663390663392(replace the number with the actual value of the weighted path length)
The difference between static compression and adaptive splay compression is that the Huffman tree is only constructed after the entire file is read in. As you read in the symbols, keep a count of how many times each symbol occurs. At the end you have the total count of the occurrences of each symbol. Then construct the Huffman tree using these numbers, and compress the input using the tree. (Do you need to read in the input again?)
Unlike adaptive splay compression, you cannot reconstruct the tree from the
compressed data alone. Therefore you will need to store enough information
in the compressed file so that you can reconstruct the tree when you
decompress the file.
The easiest way to do this is to store the frequency counts at the beginning
of the file, using the supplied writeInt()
and
readInt()
methods.
The method writeInt()
is in BitOutputStream
, and
writes a 32-bit integer to the output file.
The method readInt()
is in BitInputStream
, and
reads a 32-bit integer from the input file.
You will need to use a priority queue to implement the Huffman algorithm.
Fortunately, the Java API already provides us with a priority queue
implementation in the template class PriorityQueue
.
You may use this class for this assignment.
The following code creates a PriorityQueue
object for holding
BinaryTreeNode
objects:
PriorityQueue<BinaryTreeNode> queue = new PriorityQueue<BinaryTreeNode>();To use the priority queue, the methods you will need are:
public boolean add(BinaryTreeNode o) - adds o to priority queue public BinaryTreeNode poll() - removes and returns head of the queue public int size() - returns number of items in the queue
hw2test.java
which
you can use for testing the tree creation and splaying methods in your
program. The following command will use your CodeTree class to generate
a tree with the given number of leaves, then perform a series of splay
operations on the tree. After each step it will display the tree, so you
can see the results of the splay operation.
java hw2test num_leaves [ item ... ]
Note that this program will not be used for testing your program in automarking. It is provided only as an aid for debugging your program.
We have put up some sample test data in samples/. The table below lists the performance of our compressors on the sample test data. The first three numbers in each line are the original and compressed sizes of the files (for both compressors), given in bytes. The last number gives the weighted path length of the Huffman tree generated by the Huffman compressor. Notice that for some files, the compressed file is actually bigger than the uncompressed file.
File Original Adaptive Huffman WPL ---------------------------------------------------- image1.tiff 124120 35906 23143 1.4253 image2.jpg 33365 44284 34375 7.9954 image3.tiff 60286 15930 60905 7.9455 image4.jpg 66378 87367 67246 7.9805 random 65536 87995 66593 8.0033 text1.txt 8400 6637 5745 4.4914 text2.txt 8547 6794 5853 4.5156 text3.txt 5744 4707 4298 4.5528 text4.txt 12731 9951 8010 4.3867 Notes: image[13].tiff - uncompressed TIFF file image[24].jpg - compressed JPEG file random - 65536 random bytes text[1234].txt - plain English text filesYour program will be assessed on unseen test data and not just the sample test data provided.
Adaptive compression: if your compressor is within 10% of ours you get full performance marks, prorated accordingly for worse performance.
Huffman compression: if your compressor implements the Huffman algorithm correctly it will achieve the same performance as ours - we will look for this when marking. There are other assessment components such as style and question answering for the total mark.
Question
At the top of your code, in a block of comments, you should provide a brief answer (one or two short paragraphs) to the following question:
(i) How well do these methods of data compression work?
(ii) On which types of files do they work well (i.e. achieve a low compression ratio) and on which do they work poorly?
(iii) What is the main advantage of the adaptive method over the static method?
Submission
Once submissions are open, you should submit by typing
give cs2011 hw2 CodeTree.java HuffmanTree.java
You can submit as many times as you like - later submissions will overwrite earlier ones.
Remember that you should not edit the files
BinaryTreeNode.java, InternalNode.java, LeafNode.java,
BitInputStream.java, BitOutputStream.java
or
hw2test.java
.
If you attempt to submit an altered version of any of these files,
your submitted version will be ignored and the
standard version will be used instead.
The submission deadline is as above.
15% penalty will be applied to the (maximum) mark
for every 24 hours late after the deadline.
Additional information may be found in the FAQ as it is built up, and will be considered as part of the specification for the project.
If you have a question that has not already been answered on the FAQ,
you can email it to cs2011.hw2@cse.unsw.edu.au
Assessment
The project will be marked on style and comments as well as functionality and performance. Programs which generate compilation errors will receive a very low mark, no matter what other virtues they may have. In general, a program that attempts a substantial part of the job but does that part correctly will receive more marks than one attempting to do the entire job but with many errors. The answer to the question is worth 3 (out of 15) marks.
Plagiarism Policy
Group submissions will not be allowed. Your program must be entirely your own work. Plagiarism detection software will be used to compare all submissions pairwise. These checks may be run after the automarking and release of marks, so the released marks are not necessarily final.
DO NOT COPY FROM OTHERS; DO NOT ALLOW ANYONE TO SEE YOUR CODE
Please refer to the Yellow Form, or to section on Originality of Assignment Submissions in the Unix Primer, if you require further clarification on this matter.
Good luck!