Assignment 1, Bonus Points

DOM Parsing for XML Documents
Due Date: March 15th, 2010


You can earn 2 bonus points by adding to the statistics of the first assignment the following four calculations. As before, only consider element and (non-empty) text nodes when you look for root-to-leaf paths, sibling (children) lists, or nodes on the same level.
By "maximal breadth" we mean the maximal number of nodes that are on the same level, i.e., have same distance to the root node.
The "average breadth" is the average of numbers of nodes on a same level, taken over all levels in the tree.
For the average length of sibling lists do not count the length/breadth 1 of the top-most node (viz. the book node below), but
simply sum the number of (element or text)-children of all element nodes, and divide by the total number of element nodes.
Your program of assignemnt 1 should print these additional numbers when using the -sb option, precisely as shown below (with correct numbers of course).
Round the average numbers to four digits after the dot, and always print 4 digits after the dot, as shown below.

Note: we will only test your program on ASCII files. If your last part (number of bytes for text/attr. values) also works correctly for other character encodings, that would be nice.

In the example run, why is the average height equal to 3.4000? We need to sum the lengths of root-to-leaf paths, divided by the number of leaves. Any non-empty text node is a leaf, and every element without children elment nodes and non-empty text is a leaf. In our example, there are no element nodes that are leaves (because they all have non-empty text below). Thus, the number of leaves equals the number of text nodes: 5. The first leaf is the next node "TCP.."; its root-to-leaf path has length 3. Also the text nodes "Addison" and "65.." have length 3, while those of "Stevens" and "John" have length 4. Thus, the sum is 9+8=17, divided by 5 equals 3.4000.
Similarly, for the average lengths of sibling lists, you need to sum their lengths: 4 for the book-node, 1 for title, last, first, publisher, price, and 2 for author. Thus, 11 divided by 7 which equals 1.5714.


Example Runs

Assume the file "test.xml" consists of the following XML snipplet:

<book isbn="1-2345-6789-0" year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>John</first></author>
<publisher>Addison-Wesley</publisher>
<price currency="USD">65.95</price>
</book>

Your program should behave as follows (assuming the executable is named "DOMcat"):

> DOMcat -sb test.xml
Total number of nodes: 15
Number of element nodes: 7
Number of attribute nodes: 3
Number of text nodes: 5
Number of empty text nodes: 5
Maximal height: 4
Maximal length of sibling list: 4
Number of distinct element names: 7
Number of distinct attribute names: 3
author, 1, {first, last}, {first, last}
book, 1, {author, price, publisher, title}, {author, first, last, price, publisher, title}
first, 1, {}, {}
last, 1, {}, {}
price, 1, {}, {}
publisher, 1, {}, {}
title, 1, {}, {}
Average height: 3.4000
Average length of sibling lists: 1.5714
Maximal breadth: 5
Average breadth: 3.0000
Total bytes in text/attribute values: 73
Proportion text/attribute of document: 34.3%

CRICOS Provider Number: 00098G