COMP9318 (09s1) Project 1 Specification

1. Update History

2. Aim

3. Part I -- Get Started

Last modified: Thu Apr 23 15:54:36 EST 2009

3.1. Register an Yahoo! BOSS account

Go to http://developer.yahoo.com/search/boss/ and click the yellow “Get an App ID” button on the right hand side.

You will get an API ID and please save it to a plain text file named APIKEY.

3.2. Install JDK 1.6

Since all the CSE machines are using Java 1.6, we require you to use Java 1.6 if you choose to run the program on your own machine.

3.3. Play with the provided source code

The skeleton code for Part I can be found at here. It is adapted from http://frankmccown.blogspot.com/2008/07/yahoos-new-search-boss-api.html.

To compile the provided source code, you need toc either add json.jar to the CLASSPATH or specify it in the command line. E.g.,

  javac -classpath "%CLASSPATH%;./json.jar" *.java

Running the program

  java -classpath "%CLASSPATH%;./json.jar" BOSS australia 50
(under Linux)
  java -classpath "${CLASSPATH}:./json.jar" BOSS australia 50

  java -Dhttp.proxyHost=chopin.cse.unsw.edu.au -Dhttp.proxyPort=3128 -classpath "%CLASSPATH%;./json.jar" BOSS australia 50

If your query contains more than one keyword, wrap them inside the quotation mark. E.g.,

  java -classpath "%CLASSPATH%;./json.jar" BOSS "xml xquery" 50

3.3.1. Use cached results

There is a simple caching mechanism implemented in the source code such that you can run and test your program even without an Internet connection (or in the unlikely situation that you've run out of the BOSS API quota for the day).

The query results of every (live) query you used will be stored in XML format under the ./cache directory. A simple filename conversion rule is used which converts space to '_' char.

You may prefix the query using the keyword CACHE: to instruct the program to use cached results. E.g.,

  java -classpath "%CLASSPATH%;./json.jar" BOSS CACHE:australia 50
will just reuse the cached result (check out the file at ./cache/australia.xml).

Several example cached query files are included.

3.4. What you need to do

3.4.1. Fetch more than 50 query results

You need to implement the query function such that you can fetch up to 1000 results per query. Read the block comments in BOSS.java for more details.

4. Part II -- Clustering the Query Results

Last modified: Sun May 24 13:29:36 EST 2009

4.1. Provided Source Code

THIS SECTION HAS BEEN UPDATED: Cached results and sample output in the original skeleton2.zip file were removed.

The skeleton code for Part II can be found at here.

4.2. What you need to do

Note that

The final project, once implemented, can be run by, e.g.,

  java -classpath "%CLASSPATH%;./json.jar" ClusterBOSS australia 50 10 avg

4.2.1. Note

THIS SECTION HAS BEEN UPDATED.

Sample output using the latest cached queries can be found here. The order of your cluster output may vary but the cluster membership should be the same.

4.3. Debug Information

THIS SECTION HAS BEEN UPDATED.

You can find some debug information and datasets here. Cached query results can be found in the debug-info-new/cache-10 and debug-info-new/cache-50 directory. You can copy them to ./cache to use them.

Detailed logs can be found in the log files. These log files were genearted using the 10 cached query results provided in the debug-info-new.tar.bz2 file (note that the latest query results from the Yahoo! BOSS API *will* produce (slightly) different results).

The format of the log file should be quite self-explanatory. The (cosine) similarity matrix is dumped before and after each cluster merge step. The matrix is printed in a similar way as the lecture notes do, i.e., only the upper triangular part (excluding the diagonal) is printed. For example, in the result for australia and average link, the first cell in the first matrix is 0.0726, meaning that the first QueryResult and the second one has a cosine similarity value of 0.0726.

Note that we replace the query leopard with powerpoint, as the complete-link clustering results for the former might be non-unique (all cells in the matrix is 0).

4.3.1. Cosine Similarity

In this project, we will perform hierarchical clustering based on a similarity matrix. The similarity between two documents is measured by the cosine similarity.

Cosine similarity is a widely-used similarity function to evaluate the similarity between two documents (or to be precise, the vector representations of the documents).

In our project, we use use the title and the snippet of the query result as a document (with different weights). We tokenize the document into tokens that are separated by any one of the ClusterBOSS.SEPARATORS. We also perform stemming and stop-word removal. If we do this for all documents, we can get the set of unique tokens (denoted as T). Then each document can be represented a |T|-dimensional vector, where each dimension correponds to a token t in T, and the value represents how many times the token t appears in the document. Since tokens are not equally important, we adopt a well-known token weighting scheme as follows: for each token t, we calculate how many documents in the collection containing it (denoted as df(t)). The weight of the token will be multiplied by idf(t) = log2 (N/df(t)), where N stands for the total number of document in the collection.

For example, given the following two documents:

 x = "aa bb cc cc"
 y = "bb dd"
 z = "cc"

T will be {aa, bb, cc, dd }. df(aa) = df(dd) = 1 and df(bb) = df(cc) = 2. Therefore, idf(aa) = idf(dd) = log2 (3/1) = 1.58, while idf(bb) = idf(cc) = log2 (3/2) = 0.58. The document vector representation of these documents will be:

x = [ 1.58, 0.58, 1.16, 0.00 ]
y = [ 0.00, 0.58, 0.00, 1.58 ]
z = [ 0.00, 0.00, 0.58, 0.00 ]

Cosine similarity between two |T|-dimensional vectors is defined as:

cosine(x, y) = A / B

A = sum_i (x[i] * y[i])
B = sqrt{ sum_i (x[i]^2) } * sqrt{ sum_i (y[i]^2) }

In our example,

  cosine(x, z) = ( 1.16 * 0.58 ) / ( sqrt{1.58^2 + 0.58^2 + 1.16^2} * sqrt{ 0.58^2 } ) 
               = 0.67 / (2.04 * 0.58 ) 
               = 0.57

For large document collections, there will be a large number of unique tokens, i.e., T is usually in the order of tens of thousands. If the average length of documents is only in the thousands, most of the values in the document vectors will be 0. Therefore, it is better to encode the document vectors as sparse vectors. This is exactly how DocVector class is implemented.

For the same example, assume we assign tokenIDs of _1_ to 4 to the four tokens in T in the alphabetical order. Then the sparse vector representation of the documents will be:

x = [ (1 => 1.58), (2 => 0.58), (3 => 1.16) ]
y = [ (2 => 0.58), (4 => 1.58) ]
z = [ (3 => 0.58) ]

Now, one of your tasks is to figure out how to implement cosine similarity function for two sparse document vectors.

5. Deadline and Submission Details

  tar cvf proj1.tar *.java
Then you can submit the file by
  give cs9318 proj1 proj1.tar

6. Marking

7. FAQ

Yes, you do. That's why you need to get Part I done and copy your implementation to the Part II.

You probably need to edit BOSS.java, ClusterBOSS.java, DocVector.java, and create your own classes to implement the clustering algorithms. In other words, the rest of the java files should *not* be modified.