COMP2521 18s2 - Assignment 2

Change log:

[04:35pm 23/Oct] : Instructions on how to submit the assignment are now available, see "Submission".
[01:50pm 23/Oct] : Just a reminder, as mentioned in the lecture and in the hints, you need to use BST to implement your inverted index in 1B. See hints on "How to Implement Ass2 (Part-1)" , as discussed in the lecture.
[01:50pm 23/Oct] : In 1C, as per the specs you need to "find page with one or more search terms and outputs (to stdout) top 30 pages in descending order of number of search terms found and then within each group, descending order of Weighted PageRank. The example for 1C is now appended to match with the sample data in 1B, to make it easy to understand.
[02:40pm 11/Oct] : Due date is changeed to "11:59:00 pm Thursday Week-13", as discussed in the lecture.
[08:00am 11/Oct] : In 1.B, for the expected output, earlier "url101 was at the end (incorrectly), revised to " mars url101 url25 url31 "
[08:00am 11/Oct] : In the 1.A algorithm, "iteration++;" is moved to the end of the loop body to improve readability

Objectives

Admin

Aim

In this assignment, your task is to implement simple search engines using well known algorithms like (Weighted) PageRank and tf-idf, simplified for this assignment, of course!. You should start by reading the wikipedia entries on these topics. Later I will also discuss these topics in the lecture.

The main focus of this assignment is to build a graph structure, calculate Weighted PageRank, tf-idf, etc. and rank pages based one these values. You don't need to spend time crawling, collecting and parsing weblinks for this assignment. You will be provided with a collection of "web pages" with the required information for this assignment in a easy to use format. For example, each page has two sections,

Hint: If you use fscanf to read the body of a section above, you do not need to impose any restriction on line length. I suggest you should try to use this approach - use fscanf! However, if you want to read line by line using say fgets, you can assume that maximum length of a line would be 1000 characters.

Hint: You need to use a dynamic data structure(s) to handle words in a file and across files, no need to know max words beforehand.

In Part-1: Graph structure-based search engine, you need to create a graph structure that represents a hyperlink structure of given collection of "web pages" and for each page (node in your graph) calculate Weighted PageRank and other graph properties. You need to create "inverted index" that provides a list of pages for every word in a given collection of pages. Your graph-structure based search engine will use this inverted index to find pages where query term(s) appear and rank these pages using their Weighted PageRank values.

In Part-2: Content-based search engine, you need to calculate tf-idf values for each query term in a page, and rank pages based on the summation of tf-idf values for all query terms. Use "inverted index" you created in Part-1 to locate matching pages for query terms.

In Part-3: Hybrid search engine, you need to combine both PageRank and tf-idf values in order to rank pages.

Additional files: You can submit additional supporting files, *.cand *.h, for this assignment. For example, you may implement your graph adt in files graph.c and graph.h and submit these two files along with other required files as mentioned below.

Sample files

Part-1: Graph structure-based Search Engine

A: Calculate Weighted PageRanks

You need to write a program in the file pagerank.c that reads data from a given collection of pages in the file collection.txt and builds a graph structure using Adjacency Matrix or List Representation. Using the algorithm described below, calculate Weighted PageRank for every url in the file collection.txt. In this file, urls are separated by one or more spaces or/and new line character. Add suffix .txt to a url to obtain file name of the corresponding "web page". For example, file url24.txt contains the required information for url24.

Simplified Weighted PageRank Algorithm you need to implement (for this assignment) is shown below. Please note that the formula to calculate PR values is slightly different to the one provided in the corresponding paper (for explanation, read Damping factor).

Your program in pagerank.c will take three arguments (d - damping factor, diffPR - difference in PageRank sum, maxIterations - maximum iterations) and using the algorithm described in this section, calculate Weighted PageRank for every url.

Your program should output a list of urls in descending order of Weighted PageRank values (use format string "%.7f" ~~to 8 significant digits~~) to a file named pagerankList.txt. The list should also include out degrees (number of out going links) for each url, along with its Weighted PageRank value. The values in the list should be comma separated. For example, pagerankList.txt may contain the following:

Sample Files for 1A

You can download the following three sample files with expected pagerankList.txt files. For your reference, I have also included the file "log.txt" which includes values of Win, Wout, etc. Please note that you do NOT need to generate such a log file.

Use format string "%.7f" to output pagerank values. Please note that your pagerank values might be slightly different to that provided in these samples. This might be due to the way you carry out calculations. However, make sure that your pagerank values match to say first 6 decimal points to the expected values. For example, say an expected value is 0.1843112, your value could be 0.184311x where x could be any digit.

All the sample files were generated using the following command:

% pagerank  0.85  0.00001  1000

B: Inverted Index

You need to write a program in the file named inverted.c that reads data from a given collection of pages in collection.txt and generates an "inverted index" that provides a sorted list (set) of urls for every word in a given collection of pages. Before inserting words in your index, you need to "normalise" words by,

You need to use BST to implement your inverted index in 1B, see hints on "How to Implement Ass2 (Part-1)" , as discussed in the lecture.

In each sorted list (set), duplicate urls are not allowed. Your program should output this "inverted index" to a file named invertedIndex.txt. One line per word, words should be alphabetically ordered, using ascending order. Each list of urls (for a single word) should be alphabetically ordered, using ascending order.

C: Search Engine

Write a simple search engine in file searchPagerank.c that given search terms (words) as commandline arguments, finds pages with one or more search terms and outputs (to stdout) top 30 pages in descending order of number of search terms found and then within each group, descending order of Weighted PageRank. If number of matches are less than 30, output all of them.

Your program must use data available in two files invertedIndex.txt and pagerankList.txt, and must derive result from them. We will test this program independently to your solutions for "A" and "B".

Part-2: Content-based Search Engine

In this part, you need to implement a content-based search engine that uses tf-idf values of all query terms for ranking. You need to calculate tf-idf values for each query term in a page, and rank pages based on the summation of tf-idf values for all query terms. Use "inverted index" you created in Part-1 to locate matching pages for query terms.

Read the following wikipedia page that describes how to calculate tf-idf values:

Write a content-based search engine in file searchTfIdf.c that given search terms (words) as commandline arguments, outputs (to stdout) top 30 pages in descending order of number of search terms found and then within each group, descending order of summation of tf-idf values of all search terms found. Your program must also output the corresponding summation of tf-idf along with each page, separated by a space and using format "%.6f", see example below.

If number of matches are less than 30, output all of them. Your program must use data available in two files invertedIndex.txt and collection.txt, and must derive result from them. We will test this program independently to your solutions for Part-1.

Part-3: Hybrid/Meta Search Engine using Rank Aggregation

In this part, you need to combine search results (ranks) from multiple sources (say from Part-1 and Part-2) using "Scaled Footrule Rank Aggregation" method, described below. All the required information for this method are provided below. However, if you are interested, you may want to check out this presentation on "Rank aggregation method for the web".

Let T1 and T2 are search results (ranks) obtained using two different criteria (say Part-1 and Part-2). Please note that we could use any suitable criteria, including manually generated rank lists.

A weighted bipartite graph for scaled footrule optimization (C,P,W) is defined as,

The final ranking is derived by finding possible values of position 'P' such that the scaled-footrule distance is minimum. There are many different ways to assign possible values for 'P'. In the above example P = [1, 3, 2, 5, 4]. Some other possible values are, P = [1, 2, 4, 3, 5], P = [5, 2, 1, 4, 3], P = [1, 2, 3, 4, 5], etc. For n = 5, possible alternatives are 5! For n = 10, possible alternatives would be 10! that is 3,628,800 alternatives. A very simple and obviously inefficient approach could use brute-force search and generate all possible alternatives, calculate scaled-footrule distance for each alternative, and find the alternative with minimum scaled-footrule distance.

If you use such a brute-force search, you will receive maximum of 65% for Part-3. However, you will be rewarded 100% for Part-3 if you implement a "smart" algorithm that avoids generating unnecessary alternatives, in the process of finding the minimum scaled-footrule distance. Please document your algorithm such that your tutor can easily understand your logic, and clearly outline how you plan to reduce search space, otherwise you will not be awarded mark for your "smart" algorithm! Yes, it's only 35% of part-3 marks, but if you try it, you will find it very challenging and rewarding.

Write a program scaledFootrule.c that aggregates ranks from files given as commandline arguments, and output aggregated rank list with minimum scaled footrule distance.

The following command will read ranks from files "rankA.txt" and "rankD.txt" and outputs minimum scaled footrule distance (using format %.6f) on the first line, followed by the corresponding aggregated rank list.

For the above example, there are two possible answers, with minimum distance of 1.400000.

Two possible values of P with minnimum distance are:
C = [url1, url2, url3, url4, url5] P = [1, 4, 2, 5, 3] and P = [1, 5, 2, 4, 3]

By the way, you need to select any one of the possible values of P that has minium distance, so there could be multiple possible answers. Note that you need to output only one such list.

One possible answer for the above example, for P = [1, 4, 2, 5, 3] :

    1.400000
    url1
    url3
    url5
    url2
    url4

Another possible answer for the above example, P = [1, 5, 2, 4, 3] :

    1.400000
    url1
    url3
    url5
    url4
    url2

Please note that your program should also be able to handle multiple rank files, for example:

Assignment-2 Group Creation

Submission

Additional files: You can submit additional supporting files, *.c and *.h, for this assignment.

IMPORTANT: Make sure that your additional files (*.c) DO NOT have "main" function.

For example, you may implement your graph adt in files graph.c and graph.h and submit these two files along with other required files as mentioned below. However, make sure that these files do not have "main" function.

I explain below how we will test your submission, hopefully this will answer all of your questions.

You need to submit the following files, along with your supporting files (*.c and *.h):

Now say we want to mark your pagerank.c program. The auto marking program will take all your supporting files (other *.h and *.c) files, along with pagerank.c and execute the following command to generate executable file say called pagerank. Note that the other four files from the above list (inverted.c, searchPagerank.c, searchTfIdf.c and scaledFootrule.c) will be removed from the dir:

So we will not use your Makefile (if any). The above command will generate object files from your supporting files and the file to be tested (say pagerank.c), links these object files and generates executable file, say pagerank in the above example. Again, please make sure that you DO NOT have main function in your supporting files (other *.c files you submit).

We will use similar approach to generate other four executables (from inverted.c, searchPagerank.c, searchTfIdf.c and scaledFootrule.c).

How to Submit

Go to the following page, select the tab "Make Submission", select "Browse" to select all the files you want to submit and submit ising "Submit" button. The submission system will try to compile each required file, and report the outcome (ok or error). Please see the output, and correct any error. If you do not submit a file(s) for a task(s), it will report it as an error(s).

You can now submit this assignment, click on "Make Submission" tab, and follow the instructions.

Plagiarism

This is a group assignment. You are not allowed to use code developed by persons other than in your group. In particular, it is not permitted to exchange code or pseudocode between groups. You are allowed to use code from the course material (for example, available as part of the labs, lectures and tutorials). If you use code from the course material, please clearly acknowledge it by including a comment(s) in your file. If you have questions about the assignment, ask your tutor.

Before submitting any work you should read and understand the sub section named Plagiarism in the section "Student Conduct" in the course outline. We regard unacknowledged copying of material, in whole or part, as an extremely serious offence. For further information, see the course outline.

Marks	20 marks (14 marks towards total course mark)
Group	This assignment is completed in group of two, based on your current lab group.
Due	11:59:00 pm Thursday Week-13
Late Penalty	2 marks per day off the ceiling. Last day to submit this assignment is 5pm Friday Week-13, of course with late penalty.
Submit	Read instructions in the "Submission" section below.

COMP2521: Assignment 2 Simple Search Engines