The contents in sample_output directory in the skeleton2.zip file are obsolete. Please use the latest results (with 10 results for queries 'australia', 'kangaroo', and 'powerpoint').
The cache mechanism is meant to (1) avoid a faulty program to use up all your daily BOSS API quota, and (2) enable you to develop/debug your program offline. It does not take care of the number of results in the cached xml file and the requested size. So, you need to copy those files in cache-10 folder to the cache fold and pass 10 to ClusterBOSS if you want to debug your program. When we test your program, we will make sure the command line parameter passed to your program matches the number of results in the cached xml file. In other words, you do not need to modify/improve the current caching code.
Is it possible that you will request arbitrary number of results (‹= 1000) in the test?
Yes. For example, I can request your program to cluster top-75 results of a query into 10 clusters.
What if there is more than one cell in the matrix with the same similarity value?
We will make sure it does not occur in the test dataset. Unfortuneately, this occurs in the 'australia-10-3-complete' case (last step where all entries are 0.0).
Do I need to round up the similarity value to a certain precision so that my matrix looks like the same as in the log file?
No. I (and you should) use double to record all the similarity values. It is printed with 4 digits after the decimal point to make the output matrix readable.
Why the similarity values in my average-link algorithm looks different from yours?
The single most likely reason is that you always update the matrix using the average of two similarity values from the last-round matrix. This is WRONG. Please double check the formula used in the average-link algorithm.
For those who understand how to do it correctly, yes, it is possible to calculate the new matrix just based on the matrix obtained in the last round. You will need to do a weighted average.