24 May 09: Updated the spec, cached results, output and debugging info (Secs 4.1, 4.2.1, and 4.3). Also added marking criteria and FAQ (Secs 6 and 7).
21 May 09: Add Sec 4.2.1, 4.3, and first paragraph in Sec 4.3.1. See Sec 4.2.1 to see how to fix a problem with the sample output log files given last time. See Sec 4.3. for a link to more detailed logging files.
Get familiar with text/Web mining
Implement and evaluate several clustering algorithms learned in the course.
Last modified: Thu Apr 23 15:54:36 EST 2009
Go to http://developer.yahoo.com/search/boss/ and click the yellow “Get an App ID” button on the right hand side.
You will get an API ID and please save it to a plain text file named APIKEY.
Since all the CSE machines are using Java 1.6, we require you to use Java 1.6 if you choose to run the program on your own machine.
The skeleton code for Part I can be found at here. It is adapted from http://frankmccown.blogspot.com/2008/07/yahoos-new-search-boss-api.html.
json.jar: the java library to parse and process data in json format. See http://www.json.org/java/ for more information.
BOSS.java: the main class; you need to implement the query function.
QueryResult.java: encapsulates each Web query result
QueryResults.java: an array of QueryResult objects
Utils.java: static helper functions.
To compile the provided source code, you need toc either add json.jar to the CLASSPATH or specify it in the command line. E.g.,
javac -classpath "%CLASSPATH%;./json.jar" *.java
Running the program
Without a proxy: just type (under Windows)
java -classpath "%CLASSPATH%;./json.jar" BOSS australia 50(under Linux)
java -classpath "${CLASSPATH}:./json.jar" BOSS australia 50
Without the CSE proxy:
java -Dhttp.proxyHost=chopin.cse.unsw.edu.au -Dhttp.proxyPort=3128 -classpath "%CLASSPATH%;./json.jar" BOSS australia 50
If your query contains more than one keyword, wrap them inside the quotation mark. E.g.,
java -classpath "%CLASSPATH%;./json.jar" BOSS "xml xquery" 50
There is a simple caching mechanism implemented in the source code such that you can run and test your program even without an Internet connection (or in the unlikely situation that you've run out of the BOSS API quota for the day).
The query results of every (live) query you used will be stored in XML format under the ./cache directory. A simple filename conversion rule is used which converts space to '_' char.
You may prefix the query using the keyword CACHE: to instruct the program to use cached results. E.g.,
java -classpath "%CLASSPATH%;./json.jar" BOSS CACHE:australia 50will just reuse the cached result (check out the file at ./cache/australia.xml).
Several example cached query files are included.
You need to implement the query function such that you can fetch up to 1000 results per query. Read the block comments in BOSS.java for more details.
Last modified: Sun May 24 13:29:36 EST 2009
THIS SECTION HAS BEEN UPDATED: Cached results and sample output in the original skeleton2.zip file were removed.
The skeleton code for Part II can be found at here.
DocVector: encodes the text content of a text document. The central data structure within the class are: docID and vec, the latter being an array of (tokenID, weight) sorted on tokenID.
TokenDict: a data structure to keep the bi-directional mapping between tokens and their tokenIDs.
TokenPreprocessor and Stemmer: helper classes to preprocess the tokens, including stemming (i.e., converts reading and reads to the same token read), stop-word removal (remove high-frequency word in English; see stop_words.txt).
Cluster: represents a cluster and its memebers.
Implement the cosine similarity function (i.e., calcCosineSimilarity function in the DocVector class).
Implement the three hierarchical clustering algorithms (i.e., doHACClustering function in the ClusterBOSS class).
Note that
Note that you are not supposed to modify other exiting functions in the provided source code. You may create new classes in the project though.
If you are using the BOSS.java file from Part I, you need to comment out the first println in BOSS.queryWeb method.
The final project, once implemented, can be run by, e.g.,
java -classpath "%CLASSPATH%;./json.jar" ClusterBOSS australia 50 10 avg
THIS SECTION HAS BEEN UPDATED.
Sample output using the latest cached queries can be found here. The order of your cluster output may vary but the cluster membership should be the same.
THIS SECTION HAS BEEN UPDATED.
You can find some debug information and datasets here. Cached query results can be found in the debug-info-new/cache-10 and debug-info-new/cache-50 directory. You can copy them to ./cache to use them.
Detailed logs can be found in the log files. These log files were genearted using the 10 cached query results provided in the debug-info-new.tar.bz2 file (note that the latest query results from the Yahoo! BOSS API *will* produce (slightly) different results).
The format of the log file should be quite self-explanatory. The (cosine) similarity matrix is dumped before and after each cluster merge step. The matrix is printed in a similar way as the lecture notes do, i.e., only the upper triangular part (excluding the diagonal) is printed. For example, in the result for australia and average link, the first cell in the first matrix is 0.0726, meaning that the first QueryResult and the second one has a cosine similarity value of 0.0726.
Note that we replace the query leopard with powerpoint, as the complete-link clustering results for the former might be non-unique (all cells in the matrix is 0).
In this project, we will perform hierarchical clustering based on a similarity matrix. The similarity between two documents is measured by the cosine similarity.
Cosine similarity is a widely-used similarity function to evaluate the similarity between two documents (or to be precise, the vector representations of the documents).
In our project, we use use the title and the snippet of the query result as a document (with different weights). We tokenize the document into tokens that are separated by any one of the ClusterBOSS.SEPARATORS. We also perform stemming and stop-word removal. If we do this for all documents, we can get the set of unique tokens (denoted as T). Then each document can be represented a |T|-dimensional vector, where each dimension correponds to a token t in T, and the value represents how many times the token t appears in the document. Since tokens are not equally important, we adopt a well-known token weighting scheme as follows: for each token t, we calculate how many documents in the collection containing it (denoted as df(t)). The weight of the token will be multiplied by idf(t) = log2 (N/df(t)), where N stands for the total number of document in the collection.
For example, given the following two documents:
x = "aa bb cc cc" y = "bb dd" z = "cc"
T will be {aa, bb, cc, dd }. df(aa) = df(dd) = 1 and df(bb) = df(cc) = 2. Therefore, idf(aa) = idf(dd) = log2 (3/1) = 1.58, while idf(bb) = idf(cc) = log2 (3/2) = 0.58. The document vector representation of these documents will be:
x = [ 1.58, 0.58, 1.16, 0.00 ] y = [ 0.00, 0.58, 0.00, 1.58 ] z = [ 0.00, 0.00, 0.58, 0.00 ]
Cosine similarity between two |T|-dimensional vectors is defined as:
cosine(x, y) = A / B
A = sum_i (x[i] * y[i])
B = sqrt{ sum_i (x[i]^2) } * sqrt{ sum_i (y[i]^2) }
In our example,
cosine(x, z) = ( 1.16 * 0.58 ) / ( sqrt{1.58^2 + 0.58^2 + 1.16^2} * sqrt{ 0.58^2 } )
= 0.67 / (2.04 * 0.58 )
= 0.57
For large document collections, there will be a large number of unique tokens, i.e., T is usually in the order of tens of thousands. If the average length of documents is only in the thousands, most of the values in the document vectors will be 0. Therefore, it is better to encode the document vectors as sparse vectors. This is exactly how DocVector class is implemented.
For the same example, assume we assign tokenIDs of _1_ to 4 to the four tokens in T in the alphabetical order. Then the sparse vector representation of the documents will be:
x = [ (1 => 1.58), (2 => 0.58), (3 => 1.16) ] y = [ (2 => 0.58), (4 => 1.58) ] z = [ (3 => 0.58) ]
Now, one of your tasks is to figure out how to implement cosine similarity function for two sparse document vectors.
This is an individual assignment.
The deadline is 23:59 5 Jun, 2009 (Fri).
You shouls submit your java files via the give system. You need to archive all your source file in a file named proj1.tar by using the following command under Linux:
tar cvf proj1.tar *.javaThen you can submit the file by
give cs9318 proj1 proj1.tar
To facilitate marking, please include your name and student ID in the source code.
Regarding your submission:
Please make sure all the java .les can be compiled correctly under CSE machines. You will receive 0 mark if your program cannot compile.
Your program should not use an excessive amount of memory.
If your program takes too much time to run to its completion, we will have to kill the process.
Please respect the output format exactly, and do not output other debugging information. Remember that the submissions will be automatically marked!
Late submission policy: we use soft penalty at the rate of -10% if late by one day, and -20% per day onwards.
Copying assignments is unacceptable.
The project will be auto-marked. So please make sure (1) your project can be successfully compiled with JDK1.6 on CSE servers; (2) your output has the same format as specified (You may refer to the sample output for details; the clustering results should be the same modulo the order among clusters or urls in the same cluster).
Common issues in the past are: (1) last minute change to the code (to fix a small bug) and forgot to test if the program compiles or not; (2) forgot to turn off debugging output.
We will test at least two cases: (1) we will test whether your program can correctly fetch a designated number of results (more than 50) from the Yahoo! BOSS service; (2) we will use one of the provided cache results.
We will not test corner cases such as requesting the program to produce more clusters than the number of results. That being said, it is still a good habit to guard against such erroneous input in your program.
We won't test your program using a very large number of results, but your program should be able to cope with ~100 results in reasonable amount of time.
Do I need to handle results of more than 50?
Yes, you do. That's why you need to get Part I done and copy your implementation to the Part II.
What are the files that I need to edit?
You probably need to edit BOSS.java, ClusterBOSS.java, DocVector.java, and create your own classes to implement the clustering algorithms. In other words, the rest of the java files should *not* be modified.