COMP9814 11s1 NLP Assignment

As for the COMP9414 NLP Assignment, plus the following:

Using the tagged corpus at http://www.cse.unsw.edu.au/~billw/cs9414/notes/corpora/austen_tagged as your test data, write a program (in Perl, C, or Java) to compute lexical category and bigram statistics.

Use your program to produce a report, listing the lexical categories (tags) in alphabetical order, along with the number of each lexical category that was found. After all the lexical category counts are printed, your report should list the bigram counts found, also in alphabetical order. E.g.

...
VERB: 273
...
AUX ADV: 23
...
(The figures above are fictitious.)

It is OK to "hard-wire" the tagset into your program, if you wish to do it that way.

Due date

Submit your program using the UNIX command:

give cs9414 extnlp statscode.c statsreport.txt
Where statscode.c (or statscode.java or statscode.pl or ...) is your program. You can include a Makefile if you feel it necessary. If you wish to write your solution in another language please send a request to billw.

Note that two submissions are required for this final assignment for COMP9814 students - you also need to submit the COMP9414 part using

give cs9414 nlp nlp-soln.pl

Due date: 11.30pm on Friday of week 13 (Friday 3 June, 2011).

Don't forget to check that your code works on a CSE machine, in exactly the form that you will be handing it in, immediately before submission. Even if you just add an extra line of comments, re-test before submission!

The work you submit must be your own, except where you acknowledge another source.

This specification © UNSW & Bill Wilson, 2011