Assignment 2

SAX Parse into Memory: Trees vs DAGs
Due Date: April 14th, 2008


We deal with very simple XML files: they only contain element nodes (plus whitespace and return ('n') characters).
For instance:

<a>
  <b><c/><d/></b>
  <b><d/><c/></b>
  <b><d/><c/></b>
</a>

We want to find the minimal unique DAG (directed acyclic graph) and
  1. print statistics about it (4 points)
  2. print the DAG into a (fixed) table (4 points)
  3. implement multiplicity counters and print statistics and table (4 points)
  4. find the minimal DAG of the firstChild-nextSibling binary tree encoding of the XML element tree, with multiplicities. (Bonus: 2 points)
In the minimal unique DAG every distinct subtree is represented only once (and this also holds for every subtree of that subtree, recursively). The minimal DAG of the above tree has one occurrence of the subtree consisting of a c-node, and has one occurrence of the subtree consisting of a d-node. Further, it has one occurrence of the subtree consisting of a b-node with left-child a c-node and right-child a d-node. It has one occurrence of the subtree consisting of a b-node with left-child d-node and right-child c-node; this tree is reference twice in the DAG (it is "shared") , because the subtree occurs two times in the original tree (as second and as third subtree of the root a-node). Last but not least, there is the tree with root a-node; its first child points to the DAG representing the first b-rooted subtree, and second and third child pointers are both to the DAG representation of the second b-rooted subtree.

You should print the following statistics about the DAG.
If tree2dag is your executable and "tiny.xml" is the above XML document, then running

tree2dag -s tiny.xml

should generate exactly this output:

Tree nodes: 10
DAG nodes: 5
DAG edges: 7
Height: 3
Number of labels: 4

Any DAG can conveniently be represented in a table. Simply number subtrees, starting with 1 at the leftmost (leaf) subtree. In our example this means that the subtree consisting of one c-node is at position 1 of the table. The next complete subtree that we meet in a depth-first left-to-right traversal (=document order=order of SAX events) is a subtree consisting of one d-node: it gets number 2 in the table. The next complete subtree is the b-subtree with c- and d-child. It is represented at position 3 of the DAG table; we write b[1,2] to mean a b-node with first child at position 1 and second child at position 2 (which are the c- and d-nodes)

You should print the following, when using "-p" option.
That is, running

tree2dag -p tiny.xml

should generate exactly this output:

1:c
2:d
3:b[1,2]
4:b[2,1]
5:a[3,4,4]

As discussed in the lecture, your program should build the minimal unique DAG during SAX parsing,
using a hash table to store all distinct subtrees.
It should NOT build up the whole tree in memory, and then compute the DAG!
Hence, your program should use only memory proportional to the size of the minimal unique DAG.

Instead of making your own hash table/function, you may use java/C++ internals such as, e.g., java.util.Hashtable.

DAGs with Multiplicity Counters

Consider the following tree

<a>
  <b><c/><c/><d/><d/><d/></b>
  <b><d/><c/></b>
  <b><d/><c/></b>
  <b><d/><c/></b>
  <b><d/><c/></b>
  <c/>
</a>

We want to count the (maximal) numbers of consecutive same subtrees ("multiplicities") at a node. The first b-node has 2 consecutive c-subtrees, followed by 3 consecutive d-node subtrees. Thus, we change in the table the b-subtree from b[1,2] into b[1:2,2:3]. The latter is a shorthand notation for b[1,1,2,2,2]. The 2nd, 3rd, 4th, and 5th subtrees of the root a-node are all identical (namely, the tree at pointer "4"). Hence, we get 4:4 at the second subtree position of the a-node; such a "multi-edge" is counted as one edge only (it is a labeled edge now, by its multiplicity counter).

When using the option "-mp", you should print the following table for the DAG with multiplicities.
Running

tree2dag -mp tiny.xml

should generate exactly this output:

1:c
2:d
3:b[1:2,2:3]
4:b[2,1]
5:a[3,4:4,1]

When using the option "-ms" , you should print the following statistics about the DAG with multiplicities;
Running

tree2dag -ms tiny.xml

should generate exactly this output:

Tree nodes: 20
DAG nodes: 5
DAG edges: 7
Height: 3
Number of labels: 4
Multiplicities: 3
Max Multiplicity: 4

Binary Tree Encodings, and their DAGs with Multiplicity Counters

This is about point 4 of the assignment. You should now SAX parse the XML input into a binary tree (using the first-child/next-sibling encoding) and simultaneously compute the minimal DAG (with multiplicities) of that binary tree.

Consider again our first example:

<a>
  <b><c/><d/></b>
  <b><d/><c/></b>
  <b><d/><c/></b>
</a>

The binary encoding of this tree, written as an expression, is a(b(c(_,d(_,_),b(d(_,c(_,_)),b(d(_,c(_,_))))))).
Written in XML syntax we obtain:

<a><b><c><_/><d><_/><_/></d></c>
<b><d><_/><c><_/><_/></c></d> <b><d><_/><c><_/><_/></c></d></b>
</b><_/></a>

Here is a pictorial representation of this binary tree (where two times a left _-child of a d-node was skipped)

      a
     / \
    b   _
  /   \      
 c     b
/ \   /  \
_ d  d    b
  /\  \   /\
 _  _  c d  _
       /\ \
      _  _ c
           /\
          _  _

As before, you can run your algorithm to obtain the minimal unique DAG of this tree:

1:_
2:d[1,1]
3:c[1,2]
4:c[1,1]
5:d[1,4]
6:b[5,1]
7:b[5,6]
8:b[3,7]
9:a[8,1]

which has the following statistics:

Tree nodes: 10
Binary nodes: 21
DAG nodes: 9
DAG edges: 16
Height: 7
Number of labels: 4

Multiplicities now mean repetition of labels with their left subtree, along a right-branch path of the tree.
Thus, in the above tree we can not use multiplicities and represent d[1,1] as d[1:2].
But, we can use a multiplicity for the two b-nodes which both have pointer 5 as left subtree:

6:b[5,1]
7:b[5,6]
8:b[3,7]

becomes, when using multiplicities:

6:b[5,1]
8:b[3,6:2]

Hence, this is the correct DAG with multiplicities that you should print.
For binary DAGs with multiplicities we use the "-bp" option. Thus, running

tree2dag -bp tiny.xml

should generate exactly this output:

1:_
2:d[1,1]
3:c[1,2]
4:c[1,1]
5:d[1,4]
6:b[5,1]
7:b[3,6:2]
8:a[7,1]

and the "-bs" option should generate the corresponding statistics. Thus, running

tree2dag -bs tiny.xml

should generate exactly this output:

Tree nodes: 10
Binary nodes: 21
DAG nodes: 8
DAG edges: 14
Height: 7
Number of labels: 4
Multiplicities: 1
Max Multiplicity: 2

As another example consider the following XML document (named tiny2.xml)

<a>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
</a>

The corresponding binary tree encoding is:

     a
    / \
   b   _
  / \
 c   .
/\    .
_ _   b
     / \
    c   _
   /\  
  _ _

with 8 b-nodes on the right-branch which all have the same left subtree c(_,_).

This is the final table that you should get when running tree2dag -bp file.xml

1:_
2:c[1,1]
3:b[2,1]
4:a[3:8,1]

and these are the statistics you should get when running tree2dag -bs file.xml

Tree nodes: 17
Binary nodes: 35
DAG nodes: 4
DAG edges: 6
Height: 11
Number of labels: 3
Multiplicities: 1
Max Multiplicity: 8

Notes

Please make sure that your program accepts exactly the options as mentioned above (-s,-p,-ms,-mp,-bs,-bp)
and that it generates EXACTLY, letter by letter, the same output as shown here.
Otherwise the automarker will subtract points from your grade. Other comments:

Test Runs

Here are some runs on the file 1998statistics.xml

wagner % java tree2dag -s 1998statistics.xml
Tree nodes: 28306
DAG nodes: 82
DAG edges: 1377
Height: 6
Number of labels: 46
wagner % java tree2dag -ms 1998statistics.xml
Tree nodes: 28306
DAG nodes: 82
DAG edges: 726
Height: 6
Number of labels: 46
Multiplicities: 305
Max Multiplicity: 9
wagner % java tree2dag -bs 1998statistics.xml
Tree nodes: 28306
Binary nodes: 56613
DAG nodes: 654
DAG edges: 1306
Height: 83
Number of labels: 46
Multiplicities: 289
Max Multiplicity: 9
wagner % 

Here the run on a small random tree:

wagner % java tree2dag -s s200.xml
Tree nodes: 209
DAG nodes: 24
DAG edges: 208
Height: 7
Number of labels: 3
wagner % java tree2dag -ms s200.xml
Tree nodes: 209
DAG nodes: 24
DAG edges: 146
Height: 7
Number of labels: 3
Multiplicities: 41
Max Multiplicity: 5
wagner % java tree2dag -bs s200.xml
Tree nodes: 209
Binary nodes: 419
DAG nodes: 129
DAG edges: 256
Height: 37
Number of labels: 3
Multiplicities: 41
Max Multiplicity: 5
wagner % java tree2dag -p s200.xml
1:b
2:a
3:c
4:a[1,2,1,3,3,2,3,3,2,2,1,3,1,3,3,3]
5:b[1,1,2,2,2,2,1,2,1,1,3,2,4,1,2,3,3,1,3,3]
6:c[3,2,1,3,2,3,3]
7:c[1,2,1,3,3,1]
8:c[5,6,7,2,3,2,2,2,1,2,2,2,2,2,1,1,3,2,2,2]
9:a[1,1,8]
10:a[2]
11:c[9,1,1,10]
12:b[3,1]
13:c[2,2,1,3,2,12,3,2]
14:b[1,1,1,3,3,2,3,1,1,2]
15:b[2,3,3,2,3,3,2,3,3,3]
16:b[3,3,14,1,2,1,2,1,3,3,1,1,15,2,2,3,2]
17:a[3]
18:a[1,3,2,2,2,1,3,2]
19:b[3,1,2,3,2,3,17,3,3,2,18,3,2,3,2,2,2,1]
20:b[2,3,19,2,1,1,3]
21:a[1,2,3,2,1,3,3,3,1,1,1,3,3,2,2,2,1,1,1]
22:b[2,21,3,1,3,3,3,3]
23:a[16,20,22,3,1,2,3,3,3,3,2,3,3,3,2]
24:b[11,2,3,13,2,3,3,23]
wagner % java tree2dag -mp s200.xml
1:b
2:a
3:c
4:a[1,2,1,3:2,2,3:2,2:2,1,3,1,3:3]
5:b[1:2,2:4,1,2,1:2,3,2,4,1,2,3:2,1,3:2]
6:c[3,2,1,3,2,3:2]
7:c[1,2,1,3:2,1]
8:c[5,6,7,2,3,2:3,1,2:5,1:2,3,2:3]
9:a[1:2,8]
10:a[2]
11:c[9,1:2,10]
12:b[3,1]
13:c[2:2,1,3,2,12,3,2]
14:b[1:3,3:2,2,3,1:2,2]
15:b[2,3:2,2,3:2,2,3:3]
16:b[3:2,14,1,2,1,2,1,3:2,1:2,15,2:2,3,2]
17:a[3]
18:a[1,3,2:3,1,3,2]
19:b[3,1,2,3,2,3,17,3:2,2,18,3,2,3,2:3,1]
20:b[2,3,19,2,1:2,3]
21:a[1,2,3,2,1,3:3,1:3,3:2,2:3,1:3]
22:b[2,21,3,1,3:4]
23:a[16,20,22,3,1,2,3:4,2,3:3,2]
24:b[11,2,3,13,2,3:2,23]
wagner % java tree2dag -bp s200.xml
1:_
2:c[1,1]
3:b[1,2:3]
4:c[1,3]
5:b[1,4]
6:a[1,5]
7:c[1,6:2]
8:a[1,7:2]
9:c[1,8]
10:b[1,9:2]
11:a[1,10]
12:b[1,11]
13:b[1,2:2]
14:c[1,13]
15:a[1,14:2]
16:b[1,15]
17:a[12,16]
18:a[1,17]
19:c[1,18]
20:b[1,19]
21:a[1,20:2]
22:b[1,21]
23:a[1,22]
24:b[1,23:4]
25:a[1,2:2]
26:c[1,25]
27:b[1,26]
28:a[1,27]
29:c[1,28]
30:b[1,1]
31:c[1,30]
32:b[1,31:2]
33:a[1,32]
34:b[1,33]
35:a[1,1]
36:c[1,35:3]
37:b[1,36]
38:a[1,37:2]
39:b[1,38:5]
40:a[1,39]
41:c[1,40:3]
42:a[1,41]
43:c[34,42]
44:c[29,43]
45:b[24:2,44]
46:c[45,1]
47:b[1,46]
48:a[35,1]
49:b[1,48]
50:a[47:2,49:2]
51:c[1,35]
52:b[31,51]
53:a[1,52]
54:c[1,53]
55:b[1,54]
56:a[1,55]
57:b[1,35]
58:c[1,57:2]
59:a[1,58]
60:c[1,59]
61:b[1,60:2]
62:a[1,2:3]
63:c[1,62]
64:a[1,63:2]
65:c[1,64]
66:a[1,65:2]
67:a[1,51]
68:b[66,67:2]
69:b[1,68]
70:c[1,69:2]
71:b[1,70:2]
72:a[1,71]
73:b[1,72]
74:a[1,73]
75:b[1,74]
76:b[61:3,75]
77:c[1,76]
78:b[1,51]
79:a[1,78]
80:c[1,79:3]
81:b[1,80]
82:a[1,30]
83:c[1,82:3]
84:a[1,83]
85:c[1,84]
86:a[81,85]
87:a[1,86]
88:c[1,87]
89:a[2,88:2]
90:c[1,89]
91:a[1,90]
92:c[1,91]
93:a[1,92]
94:b[1,93]
95:c[1,94]
96:b[1,2]
97:a[1,96:2]
98:b[95,97]
99:c[1,98]
100:a[1,99]
101:a[1,30:3]
102:c[1,101:3]
103:b[1,102:2]
104:c[1,103:3]
105:b[1,104:3]
106:a[1,105]
107:c[1,106]
108:a[1,107]
109:b[1,108]
110:b[1,2:4]
111:c[1,110]
112:a[109,111]
113:a[1,112]
114:a[1,51:3]
115:c[1,114]
116:a[1,115:4]
117:b[1,116]
118:c[1,117]
119:b[113,118]
120:b[100,119]
121:b[77:2,120]
122:a[121,1]
123:c[1,122]
124:a[1,123:2]
125:c[56:2,124]
126:c[1,125]
127:a[1,126]
128:c[50,127]
129:b[128,1]

CRICOS Provider Number: 00098G