Lab Exercises Week 8

Objectives

To learn about balanced trees
To practise tree implementations
To learn about hash tables

Assessment

Deadline: 11:59pm Wednesday 10th February

Total Marks: 4 (1 mark optional)

Related Chapters of textbook

Chapter 12.8 - 13.2 Sedgewick

Chapter 14 Sedgewick

You may do this lab in pairs

Setting Up

Set up a directory for this lab under your cs1927/labs directory, change into that directory, and run the following command:

cp /home/cs1927/public_html/16x1/labs/lab08/files/* .
ln -s /home/cs1927/public_html/16x1/labs/lab08/largeFiles/dict3 dict3
ln -s /home/cs1927/public_html/16x1/labs/lab08/largeFiles/dict4 dict4
ln -s /home/cs1927/public_html/16x1/labs/lab08/largeFiles/dict5 dict5

If you've done the above correctly, you should now find the following files:

Tree.h	A binary search tree interface
Tree.c	A binary search tree implementation that needs to be modified
words.c	a main program which loads words into the tree
Item.h	a simple Item definition. Items are arrays of characters
Makfile	A simple makefile for the lab
HashTable.h, HashTable.c	a Hash Table ADT
HashTableItem.h, HashTableItem.c	Defines keys and values for the HashTable as implements a hashFuntion
words2.c	a main program that loads words into the hash table
dict0	a file containing just 13 words
dict1	a file containing 1777 words
dict2	a file containing 7106 words

Setup

Unix has a large dictionary of English words which available to use for tasks such as spell checking. This dictionary is normally held in a file called /usr/share/dict/words which contains almost 100000 words, including a large number of proper names (e.g. John, Sydney), and also including possessive forms of most nouns (e.g. cat and cat's). For spell-checking, we would scan a file of text and check whether each word in the file also appeared in the dictionary; any word that did not would be treated as a spelling error. This essentially means using the dictionary as a "lookup table".

Before /usr/share/dict/words can reasonably be used as a lookup table, it needs to be put into a more useful form than a list of words (albeit alphabetically ordered).One possibility is to load the dictionary into a binary search tree and do the searches on this data structure.

The difference between dict4 and dict5 is that all of the proper names and all of the possessive forms have been removed. Apart from dict0, the other dictionaries are randomly-chosen subsets of dict4. All of the files are sorted in alphabetical order. Note that we give you links to the large dictionaries rather than copies of them because they occupy a non-trivial amount of disk quota. If you want to work on this exercise at home and use the large dictionaries, you'll need to grab copies of them.

We have implemented a program, words.c that reads the dictionary words into Tree structures for fast lookup.

Task1: Rotations - 1 Marks

rotLeft and rotRight in Tree.c are broken. They do not update the size information in the nodes. This stops other functions that rely on the size field, such as partition, from working properly. You must fix these and write some white box tests to check they work correctly.

Note: Our implementation or rotLeft and rotRight do not check for situtations such as trying to rotate an emptyTree etc, so you may want to add error checking for these cases too.

Task2: Balancing Trees - 1 Marks

Note: You must get task1 working otherwise the balancing functions used in this task will not work

The main program words reads from a user-specified file of words, and, once the words are loaded, prints out the number of nodes and the height of the tree used to store the words. It also prints out what kind of rebalancing strategy was used.

It takes a command-line argument containing the name of the dictionary file to be loaded along with an integer that represents what tree balancing strategy should be used by the tree. The program works as follows:

It inserts all the words from the test file dictionary into a dictionary ADT that is implemented via a tree. The program then prints out properties of the resulting underlying tree.

Make the file and confirm that it runs using the dict0 file and using balance strategy 0 (No rebalancing) by running it as follows

./words dict0 0

You can run other balancing strategies (once you have implemented them) by using different command line arguments. For example, once you have implemented REBALANCE_1 strategy you could also try the approach where the tree has been rebalanced after every insertion by running:

./words dict0 1

We are going to investigate/implement the following approaches for rebalancing a BST.

NO_REBALANCE : No rebalancing - normal BST insertion
REBALANCE_1 : Global rebalancing after every insertion
REBALANCE_100 : Global rebalancing after every 100 items are inserted
REBALANCE_1000 : Global rebalancing after every 1000 items are inserted
RANDOMISED : Using randomised BST insertion
SPLAY: Using splay insertion

The task is to complete a function called treeInsert that is partially implemented in the file tree.c

void treeInsert(Tree tree, Item it);

The function should call the appropriate insertRecursive and balance functions according to the specified rebalancing approach. The function already handles the first balancing strategies (not balancing the tree at all), you need to add code to handle the rest.

Note: All the necessary functions for standard insertion, balancing, random insertion and splay insertion are provided. You just need to call them appropriately.

Once you have implemented the above strategies, run all approaches on the different dictionaries supplied and record the

heights of the resulting trees in a file called lab08.txt.
the amount of user time it takes to insert and search for all keys
the amount of user time it takes to insert all keys (comment out the line in the main function that calls the searchAllWords function)

Note: Make sure you run the randomised BST approach a number of times to get an average (as this is a randomised approach and heights will differ 'randomly' each time).

Discuss the reasons for the differences in height and run times between these different approaches and algorithms. How does each algorithm influence the time to insert vs the time to search vs the height of the resulting tree? What other tests could you run to compare these implementations.

Task3: Search in a splay tree (Harder and optional) - 1 Mark

In lectures, we discussed inserting an item into a splay tree (Balanced tree examples). The amortised cost of insert and search operations on splay trees is only valid if we also perform splay rotations during searching for an item. An partial of implementation of splaySearch has been provided in Tree.c to get you started. Its prototype is as follows:

link searchSplay (link n, Key k, int * found);

It searches for an item and performs splay rotations on the traversed path, much like the function splayInsert() does.

NOTE: Once an item has been found it ismoved to the root of tree. If the key does not exist in the tree, the last node on the search path should be moved to the root.

You can look at insertSplay to help you implement this function.

NOTE2: Because the search function changes the root of the tree, it needs to return a pointer to the new root, thus the actual result of the search should be stored in the variable pointed to by found. This should be 1 if the key was in the tree and 0 otherwise.

The treeSearch function needs to be modified so that it calls the appropriate version of searching ,depending what balancing strategy is used. ie it calls splaySearch function if the balanceStrategy is set to SPLAY or the standard searchRecursive otherwise.

Try running the program on all dictionaries using splay insert and search, inserting all keys, printing out the height and then calling the searchAllWords function and printing out the height again and recording the times again. Record the heights before and after searching. Compare the heights and times and discuss your results.

Task 4 HashTables - 1 Mark

words2.c is a program that is similar to words.c except instead of using trees to store the words it stores the words in a hashtable. The program is run in a similar way, however this time, the second command line argument indicates what size to make the hashTable.

For example

./words2 dict0 100

reads dict0 into a hash table of size 100

The program then prints out some statistics about the hashtable, such as the size and the number of collisions. We also want it to print out the length of the longest chain in the hashtable. At the moment this has not been implemented and it simply prints out 0.

Inside the HashTable.c file, there is an implementation of a hashTable that uses chaining to handle collisions.

Your task is to write the function

static int longestChainSize(HashTable table);

This function should go through the hash table and find the length of the longest chain in the hashtable.

Once you have written this function and are convinced it works, try running the program with different dictionaries and different table sizes. Record the number of collisions and the length of the longest chain for the different dictionaries and different table sizes. How effective would you say that the given hashFunction is in this situation? Write this in your lab08.txt file.

Perhaps even try modifying the hashFunction inside HashTableItem.c. Can you improve on the one we have given you? (Note: you do not need to improve the hashFunction for the lab mark. This is just for interest)

Submission

When you are happy with your work, please show it to your tutor to get it marked. Before you leave your lab, remember to submit your lab via the give command

1927 classrun 16x1 give lab08 Tree.c lab08.txt HashTable.c