COMP2521: Assignment 1
Information Retrieval

[The specification may change, Please check the change log on this page.]

Change log:

Objectives

Admin

Marks 10 marks (8 marks towards total course mark)
Individual Assignment This is an individual assignment.
Due 09:00am Thursday of Week-07
Late
Penalty
1 marks per day off the ceiling.
Last day to submit this assignment is 9am Monday of Week-08, of course with late penalty.
Submit Instructions on how to submit will be available later.

Aim

In this assignment, your task is to implement an information retrieval system using well known term-weighting scheme called "tf-idf". You should start by reading the wikipedia entries on these topics. The following wikipedia page describes how to calculate tf-idf values. Later I will also discuss these topics in the lecture.

For this assignment,

For clarifications, see the example below.

 

Download


Part-1: Inverted Index, using BST

You need to implement the required functions in the file invertedIndex.c that reads data from a given collection of files in collection.txt (see simple example files) and generates an "inverted index" that provides a sorted list (set) of filenames for every word in a given collection of files. You need to use binary search ADT to implement your inverted index. For more information on this, please see the following hints:

We will also discuss the above hints in the lecture. Please note that each list of filenames (for a single word) in your inverted index should be alphabetically ordered, using ascending order, and importantly duplicate filenames are not allowed.

Hints: You should use fscanf to read words from a file, you do not need to impose any restriction on a line length. You need to use a dynamic data structure(s) to handle words in a file and across files, no need to know max words beforehand. You can assume a maximum word length of 100.

Normalise words: Before inserting words in your inverted index, you need to "normalise" words by,

Please note that if you have multiple punctuation marks at the end of a word, you need to remove only the last punctuation mark. You also don't need to remove the above punctuation marks if they appear in the middle or at the start of a word.

Importantly, you need to modify a given string, do NOT create another copy. You can use the functions tolower and strlen. You may find the following links useful:

For example,
wordnormalised word
.Net.net
smh.com.ausmh.com.au
Sydney!sydney!
why?why
order.order
text;text
abc.net.au.abc.net.au
sydney???sydney??

You need to implement the following functions in the file invertedIndex.c. The API file invertedIndex.h is provided, you must implement the required functions in your invertedIndex.c. We will individually test these functions for awarding marks.

For example, invertedIndex.txt may look like the following. Please note that the following example is not related to any sample files provided. The example below offers formatting information.

    design file11.txt file21.txt
    mars nasa.txt news1.txt
    weather info31.txt nasa.txt news1.txt  
    


Part-2: Information Retrieval Engine

In this part, you need to implement an information retrieval function that finds files (documents) with one or more query terms, and uses summation of tf-idf values of all matching query terms (words) for ranking such files (documents). You need to calculate tf-idf values for each matching query term in a file (document), and rank files (documents) based on the summation of tf-idf values for all matching query terms present in that file. Use "inverted index" you created in Part-1 to locate files with one or more query terms, and to calculate the required tf-idf values for such files.

Implement the following information retrieval function retrieve in the file invertedIndex.c that given search terms (words), returns an ordered list of type TfIdfList, where each node contains filename and the corresponding summation of tf-idf values for given searchWords. The list must be in descending order of summation of tf-idf values. See invertedIndex.h for the type definition of TfIdfList. You also need to implement another function calculateTfIdf.

You need to implement the following two functions in the file invertedIndex.c. Total number of documents D is provided as an argument in both the functions.


Testing

Download the following zip file, read instructions provided at the top of the file test_Ass1.c to run simple tests for the first four functions. Similarly, tests for the last function and more additional tests for all five functions will be available over the weekend.


Marking Scheme

This assignment will be marked out of 10 marks. There will be 8 marks for the correctness. We will test each function separately for its correctness, and also all of them together.

There will be 2 marks for "Style and Complexity of code". Regarding time complexity, considering the specs is already asking you to use specific ADTs (BST and ordered lists), the corresponding time complexity will apply, O(n) in both the cases. However, even if your time complexity is O(n), if your code is too complex, your tutor may provide you some feedback and deduct marks.

For example, you should consider separate ADTs (*.c and *.h files) for the structures like InvertedIndexBST, FileList, TfIdfList, etc. Each of these ADTs need to support operations required for this assignment (only).

For "Style", we will consider your program layout, meaningful variable names, suitable comments, etc.


Submission

Additional files: You can submit additional supporting files, *.c and *.h, for this assignment.

IMPORTANT: Make sure that your additional files (*.c) DO NOT have "main" function.

For example, you may implement your binary search tree ADT in files myBST.c and myBST.h and submit these two files along with other required file invertedIndex.c. However, make sure that files you submit do not have "main" function.

I explain below how we will test your submission, hopefully this will answer most of your questions.

You need to submit the following file, along with your supporting files (*.c and *.h):

Now say we want to mark your functions from your submitted file invertedIndex.c. The auto marking program will take all your supporting files (other *.h and *.c) files, along with invertedIndex.c and execute the following command to generate executable file (say called invertedIndex).

% gcc -Wall -Werror -lm -std=c11  *.c  -o invertedIndex

So we will not use your Makefile (if any). The above command will generate object files from your supporting files and the file to be tested invertedIndex.c, links these object files and generates executable file (invertedIndex in the above example). Again, please make sure that you DO NOT have main function in your supporting files (other *.c files you submit).

How to Submit

Go to the following page, select the tab "Make Submission", select "Browse" to select all the files you want to submit and submit ising "Submit" button. The submission system will try to compile each required file, and report the outcome (ok or error). Please see the output, and correct any error. If you do not submit a file(s) for a task(s), it will report it as an error(s).

You can now submit this assignment, click on "Make Submission" tab, and follow the instructions.


Plagiarism

This is an individual assignment. You are not allowed to use code developed by persons other than you. In particular, it is not permitted to exchange code or pseudocode between students. You are allowed to use code from the course material (for example, available as part of the labs, lectures and tutorials). If you use code from the course material, please clearly acknowledge it by including a comment(s) in your file. If you have questions about the assignment, ask your tutor.

Before submitting any work you should read and understand the sub section named Plagiarism in the section "Student Conduct" in the course outline. We regard unacknowledged copying of material, in whole or part, as an extremely serious offence. For further information, see the course outline.


-- end --