Assignment 1

DOM Parsing for XML Documents
Due Date: March 25th, 2009


There are two main models for parsing XML documents: DOM and SAX.
In this assignment you are asked to write a program which uses the DOM to the nodes of the input XML file. Your program should provide the following functions. Your program should take a command line option which is one of [s|p1|p2|p3] (with "s" being the default)
and as argument the file name of an XML document.

Note: Do NOT worry about a DTD that could be present in the XML file. Your program will be tested with DTD-less files only!

Write your program in C/C++ or Java, and use the correspondig Xerces DOM parser libraries: Submit your code using one of the following commands on the CSE system:
% give cs4317 ass1 filename.cpp
% give cs4317 ass1 filename.java
where filename is the (arbitrary) name of your source file.
Java programs MUST compile (on CSE machines) with javac filename.java
(some machines need javac -cp /usr/share/java/xerces.jar filename.java (or xercesImpl.jar))
C++ programs MUST compile (on CSE machines) with g++ -lxerces-c filename.cpp

Example Runs

Assume you have a file "test.xml" which consists of the following XML snipplet:

<book isbn="1-2345-6789-0" year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>John</first></author>
<publisher>Addison-Wesley</publisher>
<price currency="USD">65.95</price>
</book>

Your program should behave as follows (assuming the executable is named "DOMcat"):

> DOMcat test.xml
Total number of nodes: 15
Number of element nodes: 7
Number of attribute nodes: 3
Number of text nodes: 5
Number of empty text nodes: 5
Maximal height: 4
Maximal length of sibling list: 4
Number of distinct element names: 7
Number of distinct attribute names: 3
author, 1, {first, last}, {first, last}
book, 1, {author, price, publisher, title}, {author, first, last, price, publisher, title}
first, 1, {}, {}
last, 1, {}, {}
price, 1, {}, {}
publisher, 1, {}, {}
title, 1, {}, {}
> DOMcat -p1 test.xml
<book isbn="1-2345-6789-0" year="1994"><title>TCP/IP Illustrated</title><author><last>Stevens</last><first>John</first></author><publisher>Addison-Wesley</publisher><price currency="USD">65.95</price></book>
> DOMcat -p2 test.xml
<book isbn="1-2345-6789-0" year="1994">
<title>
TCP/IP Illustrated
</title>
<author>
<last>
Stevens
</last>
<first>
John
</first>
</author>
<publisher>
Addison-Wesley
</publisher>
<price currency="USD">
65.95
</price>
</book>
> DOMcat -p3 test.xml
<book isbn="1-2345-6789-0" year="1994">
	<title>
		TCP/IP Illustrated
	</title>
	<author>
		<last>
			Stevens
		</last>
		<first>
			John
		</first>
	</author>
	<publisher>
		Addison-Wesley
	</publisher>
	<price currency="USD">
		65.95
	</price>
</book>
As mentioned above, it is perfectly OK if your program always prints out as first line "<?xml..." for the p1,p2,p3 options.
This might be easier for you to program. Thus, the these test runs are also correct.

Node Types

As mentioned above, when counting the total number of nodes you should only consider the following six types of node: Note that CDATA-nodes can be within text nodes as the following two examples t1.xml and t2.xml show. As example t3.xml shows, a CDATA-node can also replace a text node, that is, if the text solely consists of a CDATA node, then there is no surrounding text node, but ONLY a CDATA node. This means that when looking for text within the DOM you might have to consider CDATA nodes.

> cat t1.xml
<?xml version="1.0"?><book><![CDATA[Bla<<>>>>>blub]]>b</book>
> java DOMTravel -s t1.xml
Total number of nodes: 3
Number of element nodes: 1
Number of attribute nodes: 0
Number of text nodes: 1
Number of empty text nodes: 0
Maximal height: 2
Maximal length of sibling list: 1
Number of distinct element names: 1
Number of distinct attribute names: 0
book, 1, {}, {}
> cat t2.xml
<?xml version="1.0"?><book>b<![CDATA[Bla<<>>>>>blub]]>b</book>
> java DOMTravel -s t2.xml
Total number of nodes: 4
Number of element nodes: 1
Number of attribute nodes: 0
Number of text nodes: 2
Number of empty text nodes: 0
Maximal height: 2
Maximal length of sibling list: 2
Number of distinct element names: 1
Number of distinct attribute names: 0
book, 1, {}, {}
> cat t3.xml
<?xml version="1.0"?><book><![CDATA[Bla<<>>>>>blub]]></book>
> java DOMTravel -s t3.xml
Total number of nodes: 2
Number of element nodes: 1
Number of attribute nodes: 0
Number of text nodes: 0
Number of empty text nodes: 0
Maximal height: 1
Maximal length of sibling list: 0
Number of distinct element names: 1
Number of distinct attribute names: 0
book, 1, {}, {}

Runs on Larger Files

The following 2 XML files are taken from the "Testbed" zip file from this page.

For this xml file you should get the following for the "s" option:

Total number of nodes: 291
Number of element nodes: 146
Number of attribute nodes: 29
Number of text nodes: 116
Number of empty text nodes: 146
Maximal height: 4
Maximal length of sibling list: 29
Number of distinct element names: 6
Number of distinct attribute names: 1
Course, 29, {Day, Instructor, Place, Time}, {Day, Instructor, Place, Time}
Day, 29, {}, {}
Instructor, 29, {}, {}
Place, 29, {}, {}
Time, 29, {}, {}
arizona, 1, {Course}, {Course, Day, Instructor, Place, Time}

For the p1, p2, and p3 options you should get p1, p2, and p3 (download and view these using a text editor).

For this xml file you should get the following for the "s" option:

Total number of nodes: 433
Number of element nodes: 244
Number of attribute nodes: 27
Number of text nodes: 162
Number of empty text nodes: 244
Maximal height: 6
Maximal length of sibling list: 27
Number of distinct element names: 10
Number of distinct attribute names: 1
bu, 1, {course}, {college, course, courseInfo, days, instructor, room, schedule, time, title}
college, 27, {}, {}
course, 27, {college, courseInfo}, {college, courseInfo, days, instructor, room, schedule, time, title}
courseInfo, 27, {instructor, schedule, title}, {days, instructor, room, schedule, time, title}
days, 27, {}, {}
instructor, 27, {}, {}
room, 27, {}, {}
schedule, 27, {days, room, time}, {days, room, time}
time, 27, {}, {}
title, 27, {}, {}

For the p1, p2, and p3 options you should get p1, p2, and p3.

Warning: download the files linked below and inspect them manually. Otherwise your browser might crash..

For this xml file you should get the following for the "s" option:

Total number of nodes: 514800
Number of element nodes: 277072
Number of attribute nodes: 733
Number of text nodes: 236994
Number of empty text nodes: 263595
Maximal height: 9
Maximal length of sibling list: 733
Number of distinct element names: 24
Number of distinct attribute names: 1
a, 23280, {}, {}
b, 4431, {}, {}
bib, 3487, {}, {}
cr, 9168, {}, {}
def, 4074, {cr}, {cr}
dictionary, 1, {e}, {a, b, bib, cr, def, e, et, hw, hwg, i, loc, pos, pr, q, qd, qp, qt, s, ss, vd, vf, vfl, w}
e, 733, {et, hwg, ss, vfl}, {a, b, bib, cr, def, et, hw, hwg, i, loc, pos, pr, q, qd, qp, qt, s, ss, vd, vf, vfl, w}
et, 512, {cr}, {cr}
hw, 983, {}, {}
hwg, 733, {hw, pos, pr}, {hw, pos, pr}
i, 4326, {}, {}
loc, 42632, {}, {}
pos, 729, {}, {}
pr, 790, {}, {}
q, 42632, {a, bib, loc, qd, qt, w}, {a, b, bib, cr, i, loc, qd, qt, w}
qd, 42632, {}, {}
qp, 4074, {q}, {a, b, bib, cr, i, loc, q, qd, qt, w}
qt, 42632, {b, cr, i}, {b, cr, i}
s, 4074, {def, qp}, {a, b, bib, cr, def, i, loc, q, qd, qp, qt, w}
ss, 733, {s}, {a, b, bib, cr, def, i, loc, q, qd, qp, qt, s, w}
vd, 284, {}, {}
vf, 1308, {}, {}
vfl, 192, {vd, vf}, {vd, vf}
w, 42632, {}, {}

For the p2 and p3 options you should get p2 and p3.
CRICOS Provider Number: 00098G