Assignment 2

SAX Parse into Memory: Trees vs DAGs
Due Date: March 29th, 2010


We deal with very simple XML files: they only contain element nodes (plus whitespace and return ('\n') characters).
For instance:

<a>
  <b><c/><d/></b>
  <b><d/><c/></b>
  <b><d/><c/></b>
</a>

We want to find the minimal unique DAG (directed acyclic graph) and
  1. print statistics about it (4 points)
  2. print the DAG into a (fixed) table (2 points)
  3. implement multiplicity counters and print statistics and table (4 points)
  4. find the minimal DAG of the firstChild-nextSibling binary tree encoding of the XML element tree, with multiplicities. (Bonus: 2 points)
In the minimal unique DAG every distinct subtree is represented only once (and this also holds for every subtree of that subtree, recursively). The minimal DAG of the above tree has one occurrence of the subtree consisting of a c-node, and has one occurrence of the subtree consisting of a d-node. Further, it has one occurrence of the subtree consisting of a b-node with left-child a c-node and right-child a d-node. It has one occurrence of the subtree consisting of a b-node with left-child d-node and right-child c-node; this tree is reference twice in the DAG (it is "shared") , because the subtree occurs two times in the original tree (as second and as third subtree of the root a-node). Last but not least, there is the tree with root a-node; its first child points to the DAG representing the first b-rooted subtree, and second and third child pointers are both to the DAG representation of the second b-rooted subtree.

You should print the following statistics about the DAG.
If tree2dag is your executable and "tiny.xml" is the above XML document, then running

tree2dag -s tiny.xml

should generate exactly this output:

Tree nodes: 10
DAG nodes: 5
DAG edges: 7
Height: 3
Number of labels: 4
Max. sharing: 3 (node 1)
Max. size of sharing: 3 (node 4)

The "Max. sharing" means the maximal number of times that a node is referenced. This number is not obtained by simply counting the number of times that "1" appears within a bracket [..] in the DAG table below! Instead, you need to take care of the fact that node 4 is referenced twice, and node 4 itself references node 1; thus, node 1 is referenced three times: once by node 3, and twice through node 4. If there are several nodes with this property, then take the one with the lowest node number (in the example: both nodes 1 and 2 are shared 3 times, but node "1" has the lower node number and therefore is printed).
The "Max. size of sharing" is the size of the largest shared tree ("shared" means referenced more than once). In the example the tree of node "4" is referenced more than once, and is the largest such tree. As before, if there are several nodes with this property, then print the one with the lowest node number.

Any DAG can conveniently be represented in a table. Simply number subtrees, starting with 1 at the leftmost (leaf) subtree. In our example this means that the subtree consisting of one c-node is at position 1 of the table. The next complete subtree that we meet in a depth-first left-to-right traversal (=document order=order of SAX events) is a subtree consisting of one d-node: it gets number 2 in the table. The next complete subtree is the b-subtree with c- and d-child. It is represented at position 3 of the DAG table; we write b[1,2] to mean a b-node with first child at position 1 and second child at position 2 (which are the c- and d-nodes)

You should print the following, when using "-p" option.
That is, running

tree2dag -p tiny.xml

should generate exactly this output:

1:c
2:d
3:b[1,2]
4:b[2,1]
5:a[3,4,4]

As discussed in the lecture, your program should build the minimal unique DAG during SAX parsing,
using a hash table to store all distinct subtrees.
It should NOT build up the whole tree in memory, and then compute the DAG!
Hence, your program should use only memory proportional to the size of the minimal unique DAG.

Instead of making your own hash table/function, you may use java/C++ internals such as, e.g., java.util.Hashtable.

DAGs with Multiplicity Counters

Consider the following tree in the file "tiny02.xml":

<a>
  <b><c/><c/><d/><d/><d/></b>
  <b><d/><c/></b>
  <b><d/><c/></b>
  <b><d/><c/></b>
  <b><d/><c/></b>
  <c/>
</a>

We want to count the (maximal) numbers of consecutive same subtrees ("multiplicities") at a node. The first b-node has 2 consecutive c-subtrees, followed by 3 consecutive d-node subtrees. Thus, we change in the table the b-subtree from b[1,2] into b[1:2,2:3]. The latter is a shorthand notation for b[1,1,2,2,2]. The 2nd, 3rd, 4th, and 5th subtrees of the root a-node are all identical (namely, the tree at pointer "4"). Hence, we get 4:4 at the second subtree position of the a-node; such a "multi-edge" is counted as one edge only (it is a labeled edge now, by its multiplicity counter).

When using the option "-mp", you should print the following table for the DAG with multiplicities.
Running

tree2dag -mp tiny02.xml

should generate exactly this output:

1:c
2:d
3:b[1:2,2:3]
4:b[2,1]
5:a[3,4:4,1]

When using the option "-ms" , you should print the following statistics about the DAG with multiplicities;
Running

tree2dag -ms tiny02.xml

should generate exactly this output:

Tree nodes: 20
DAG nodes: 5
DAG edges: 7
Height: 3
Number of labels: 4
Multiplicities: 3
Max. Multiplicity: 4
Sharings wo Multiplicities: 3

Here, "Sharings wo Multiplicities" means the number of references to shared nodes that have multiplicity 1, i.e., count how many times a shared node appears within brackets [..] in the DAG table, but is not followed by a multiplicity ":x" (x is always >2). In the example, node 1 is a shared node and is referenced by nodes 5 and 4, and node 2 is a shared node which is referenced by node 4, thus in total 3 (all other shared nodes are with multiplicities).
Update: By "shared", we mean that a node is shared if the corresponding tree appears at least twice in the original document.
For instance:
<a>
  <b><c/></b>
  <b><c/></b>
</a>

corresponds to the dag:
1:c
2:b[1]
3:a[2:2]
Here the node 1:c is shared since it is referenced by node 2 which is referenced twice in node 3. A simple way to see if a node is shared is to maintain a counter for each node in your hash table. When you introduce a new node (say c with ID 1), you create a counter to 1. And every time you see a node, you increment its counter, so when you see c[] for the second time, the counter becomes two. Now to display the "Sharing wo Multiplicities" you can apply the simple algorithm:
for each row in the table
  for each number i between brackets in the row
   if i doesn't have a multiplicity and its counter is >=2
    increment Sharing Wo Multiplicities.

Binary Tree Encodings, and their DAGs with Multiplicity Counters (Bonus Part)

This only concerns the Bonus part. You should now SAX parse the XML input into a binary tree (using the first-child/next-sibling encoding) and simultaneously compute the minimal DAG (with multiplicities) of that binary tree.

Consider again our first example:

<a>
  <b><c/><d/></b>
  <b><d/><c/></b>
  <b><d/><c/></b>
</a>

The binary encoding of this tree, written as an expression, is a(b(c(_,d(_,_),b(d(_,c(_,_)),b(d(_,c(_,_))))))).
Written in XML syntax we obtain:

<a><b><c><_/><d><_/><_/></d></c>
<b><d><_/><c><_/><_/></c></d>
<b><d><_/><c><_/><_/></c></d><_/>
</b></b></b><_/></a>

Here is a pictorial representation of this binary tree (where two times a left _-child of a d-node was skipped)

      a
     / \
    b   _
  /   \      
 c     b
/ \   /  \
_ d  d    b
  /\  \   /\
 _  _  c d  _
       /\ \
      _  _ c
           /\
          _  _

As before, you can run your algorithm to obtain the minimal unique DAG of this tree:

1:_
2:d[1,1]
3:c[1,2]
4:c[1,1]
5:d[1,4]
6:b[5,1]
7:b[5,6]
8:b[3,7]
9:a[8,1]

which has the following statistics:

Tree nodes: 10
Binary nodes: 21
DAG nodes: 9
DAG edges: 16
Height: 7
Number of labels: 4
Max. sharing: 11 (node 1)
Max. size of sharing: 5 (node 5)

Note that this is the output that you would get if you fed the binary tree to your program in "normal mode" ("-p" and "-s").
It is different from the output we ask you to give from the binary tree computed in memory, which is described after.

Multiplicities now mean repetition of labels with their left subtree, along a right-branch path of the tree.
Thus, in the above tree we can not use multiplicities and represent d[1,1] as d[1:2].

But, we can use a multiplicity for the two b-nodes which both have pointer 5 as left subtree:

6:b[5,1]
7:b[5,6]
8:b[3,7]

becomes, when using multiplicities:

6:b[5,1]
8:b[3,6:2]

Hence, this is the correct DAG with multiplicities that you should print.
For binary DAGs with multiplicities we use the "-bp" option. Thus, running

tree2dag -bp tiny.xml

should generate exactly this output:

1:_
2:d[1,1]
3:c[1,2]
4:c[1,1]
5:d[1,4]
6:b[5,1]
7:b[3,6:2]
8:a[7,1]

and the "-bs" option should generate the corresponding statistics. Thus, running

tree2dag -bs tiny.xml

should generate exactly this output:

Tree nodes: 10
Binary nodes: 21
DAG nodes: 8
DAG edges: 14
Height: 7
Number of labels: 4
Max. sharing: 11 (node 1)
Max. size of sharing: 7 (node 6)
Multiplicities: 1
Max. Multiplicity: 2

Remark The number of labels is 4, be careful not to count "_" as an actual label, since it is not present in the original document.

As another example consider the following XML document (named "tiny03.xml")

<a>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
  <b><c/></b>
</a>

The corresponding binary tree encoding is:

     a
    / \
   b   _
  / \
 c   .
/\    .
_ _   b
     / \
    c   _
   /\  
  _ _

with 8 b-nodes on the right-branch which all have the same left subtree c(_,_).

This is the final table that you should get when running tree2dag -bp file.xml

1:_
2:c[1,1]
3:b[2,1]
4:a[3:8,1]

and these are the statistics you should get when running tree2dag -bs file.xml

Tree nodes: 17
Binary nodes: 35
DAG nodes: 4
DAG edges: 6
Height: 11
Number of labels: 3
Max. sharing: 18 (node 1)
Max. size of sharing: 5 (node 3)
Multiplicities: 1
Max. Multiplicity: 8

Notes

Please make sure that your program accepts exactly the options as mentioned above (-s,-p,-ms,-mp,-bs,-bp)
and that it generates EXACTLY, letter by letter, the same output as shown here.
Otherwise the automarker will subtract points from your grade. Other comments:

Test Runs

Here are some runs on the file 1998statistics.xml

wagner % java tree2dag -s 1998statistics.xml
Tree nodes: 28306
DAG nodes: 82
DAG edges: 1377
Height: 6
Number of labels: 46
Max. sharing: 1226 (node 6)
Max. size of sharing: 23 (node 28)
wagner % java tree2dag -ms 1998statistics.xml
Tree nodes: 28306
DAG nodes: 82
DAG edges: 726
Height: 6
Number of labels: 46
Multiplicities: 305
Max. Multiplicity: 9
Sharings wo Multiplicities: 382
wagner % java tree2dag -bs 1998statistics.xml
Tree nodes: 28306
Binary nodes: 56613
DAG nodes: 654
DAG edges: 1306
Height: 83
Number of labels: 46
Max. sharing: 28307 (node 1)
Max. size of sharing: 2071 (node 502)
Multiplicities: 289
Max. Multiplicity: 9
wagner % 

Here the run on a small random tree:

wagner % java tree2dag -s s200.xml
Tree nodes: 209
DAG nodes: 24
DAG edges: 208
Height: 7
Number of labels: 3
Max. sharing: 73 (node 3)
Max. size of sharing: 1 (node 1)
wagner % java tree2dag -ms s200.xml
Tree nodes: 209
DAG nodes: 24
DAG edges: 146
Height: 7
Number of labels: 3
Multiplicities: 41
Max. Multiplicity: 5
Sharings wo Multiplicities: 85
wagner % java tree2dag -bs s200.xml
Tree nodes: 209
Binary nodes: 419
DAG nodes: 129
DAG edges: 256
Height: 37
Number of labels: 3
Max. sharing: 210 (node 1)
Max. size of sharing: 231 (node 123)
Multiplicities: 41
Max. Multiplicity: 5
wagner % java tree2dag -p s200.xml
1:b
2:a
3:c
4:a[1,2,1,3,3,2,3,3,2,2,1,3,1,3,3,3]
5:b[1,1,2,2,2,2,1,2,1,1,3,2,4,1,2,3,3,1,3,3]
6:c[3,2,1,3,2,3,3]
7:c[1,2,1,3,3,1]
8:c[5,6,7,2,3,2,2,2,1,2,2,2,2,2,1,1,3,2,2,2]
9:a[1,1,8]
10:a[2]
11:c[9,1,1,10]
12:b[3,1]
13:c[2,2,1,3,2,12,3,2]
14:b[1,1,1,3,3,2,3,1,1,2]
15:b[2,3,3,2,3,3,2,3,3,3]
16:b[3,3,14,1,2,1,2,1,3,3,1,1,15,2,2,3,2]
17:a[3]
18:a[1,3,2,2,2,1,3,2]
19:b[3,1,2,3,2,3,17,3,3,2,18,3,2,3,2,2,2,1]
20:b[2,3,19,2,1,1,3]
21:a[1,2,3,2,1,3,3,3,1,1,1,3,3,2,2,2,1,1,1]
22:b[2,21,3,1,3,3,3,3]
23:a[16,20,22,3,1,2,3,3,3,3,2,3,3,3,2]
24:b[11,2,3,13,2,3,3,23]
wagner % java tree2dag -mp s200.xml
1:b
2:a
3:c
4:a[1,2,1,3:2,2,3:2,2:2,1,3,1,3:3]
5:b[1:2,2:4,1,2,1:2,3,2,4,1,2,3:2,1,3:2]
6:c[3,2,1,3,2,3:2]
7:c[1,2,1,3:2,1]
8:c[5,6,7,2,3,2:3,1,2:5,1:2,3,2:3]
9:a[1:2,8]
10:a[2]
11:c[9,1:2,10]
12:b[3,1]
13:c[2:2,1,3,2,12,3,2]
14:b[1:3,3:2,2,3,1:2,2]
15:b[2,3:2,2,3:2,2,3:3]
16:b[3:2,14,1,2,1,2,1,3:2,1:2,15,2:2,3,2]
17:a[3]
18:a[1,3,2:3,1,3,2]
19:b[3,1,2,3,2,3,17,3:2,2,18,3,2,3,2:3,1]
20:b[2,3,19,2,1:2,3]
21:a[1,2,3,2,1,3:3,1:3,3:2,2:3,1:3]
22:b[2,21,3,1,3:4]
23:a[16,20,22,3,1,2,3:4,2,3:3,2]
24:b[11,2,3,13,2,3:2,23]
wagner % java tree2dag -bp s200.xml
1:_
2:c[1,1]
3:b[1,2:3]
4:c[1,3]
5:b[1,4]
6:a[1,5]
7:c[1,6:2]
8:a[1,7:2]
9:c[1,8]
10:b[1,9:2]
11:a[1,10]
12:b[1,11]
13:b[1,2:2]
14:c[1,13]
15:a[1,14:2]
16:b[1,15]
17:a[12,16]
18:a[1,17]
19:c[1,18]
20:b[1,19]
21:a[1,20:2]
22:b[1,21]
23:a[1,22]
24:b[1,23:4]
25:a[1,2:2]
26:c[1,25]
27:b[1,26]
28:a[1,27]
29:c[1,28]
30:b[1,1]
31:c[1,30]
32:b[1,31:2]
33:a[1,32]
34:b[1,33]
35:a[1,1]
36:c[1,35:3]
37:b[1,36]
38:a[1,37:2]
39:b[1,38:5]
40:a[1,39]
41:c[1,40:3]
42:a[1,41]
43:c[34,42]
44:c[29,43]
45:b[24:2,44]
46:c[45,1]
47:b[1,46]
48:a[35,1]
49:b[1,48]
50:a[47:2,49:2]
51:c[1,35]
52:b[31,51]
53:a[1,52]
54:c[1,53]
55:b[1,54]
56:a[1,55]
57:b[1,35]
58:c[1,57:2]
59:a[1,58]
60:c[1,59]
61:b[1,60:2]
62:a[1,2:3]
63:c[1,62]
64:a[1,63:2]
65:c[1,64]
66:a[1,65:2]
67:a[1,51]
68:b[66,67:2]
69:b[1,68]
70:c[1,69:2]
71:b[1,70:2]
72:a[1,71]
73:b[1,72]
74:a[1,73]
75:b[1,74]
76:b[61:3,75]
77:c[1,76]
78:b[1,51]
79:a[1,78]
80:c[1,79:3]
81:b[1,80]
82:a[1,30]
83:c[1,82:3]
84:a[1,83]
85:c[1,84]
86:a[81,85]
87:a[1,86]
88:c[1,87]
89:a[2,88:2]
90:c[1,89]
91:a[1,90]
92:c[1,91]
93:a[1,92]
94:b[1,93]
95:c[1,94]
96:b[1,2]
97:a[1,96:2]
98:b[95,97]
99:c[1,98]
100:a[1,99]
101:a[1,30:3]
102:c[1,101:3]
103:b[1,102:2]
104:c[1,103:3]
105:b[1,104:3]
106:a[1,105]
107:c[1,106]
108:a[1,107]
109:b[1,108]
110:b[1,2:4]
111:c[1,110]
112:a[109,111]
113:a[1,112]
114:a[1,51:3]
115:c[1,114]
116:a[1,115:4]
117:b[1,116]
118:c[1,117]
119:b[113,118]
120:b[100,119]
121:b[77:2,120]
122:a[121,1]
123:c[1,122]
124:a[1,123:2]
125:c[56:2,124]
126:c[1,125]
127:a[1,126]
128:c[50,127]
129:b[128,1]

CRICOS Provider Number: 00098G