Assignment 4

XPath Evaluation over Main Memory Structures
Due Date: May 18th, 2009


An XPath query (or expression) selects a set of nodes in a given XML document.
An XPath evaluator takes as arguments an XML document and a query, and determines
as result the nodes in the document that are selected by the query.
You are asked to implement XPath evaluators for four restricted classes of XPath expressions.
The first two deal with element and text nodes only; the last two deal with element nodes only.
Thus, you may assume that all XML documents used in this assignment contain element and text nodes only.
Note: every query contains at least one node test ("/" alone is not used; hence, the root node is never selected).
  1. Simple Root Paths are of the form /a/b/a/c/*/d or /e/b/d/*/*/e/text() (3 Points)
  2. Simple Paths are of the form //a/b/a/c/*/d or //e/b/d/*/*/e/text().
    Thus, each query starts with "//", and otherwise has the same form as a simple root path.
    (3 Points)
  3. Slash, Slashslash, and Simple Filters of the form //a[./d/e]/b//a[./f/*//g/h][./u/h]//c/*/d (3 Points)
  4. Slash, Slashslash, Simple Filters, and following-sibling (evaluated over DAG)
    //a/following-sibling::e//f[./d/e/following-sibling::*]//a[./*//g/h][./h]//c/*/d
    Same restrictions as Part 3 for node tests and filters (but with following-sibling axis now).
    The challenge of Part 4 is to implement an evaluator that works over a DAG data structure.
    Instead of an XML file your input is a DAG table of the form outputted in Assignment 2 (option "-mp", w. multiplicities!)
    Your program should read the DAG into memory (*without* unfolding it into a tree!), and should then
    evaluate the XPath expression on the DAG data structure.
    (3 Points)
  5. Streaming for Simple Paths (2 Bonus Points) See below for details.
For points 1-3, the arguments to your program are the name of an XML file, and an XPath query string.
The program is supposed to load the XML file into memory, using either DOM, or SAX with your own data structure.
To implement the evaluator you can follow the idea of top-down evaluation, as shown for simple paths in Lecture 7.
Alternatively, you can use the node set based approach discussed in that lecture, or any other approach you find suitable.
(of course you are NOT allowed to use Xalan or other libraries for doing the XPath evaluation!!! --you may use it for
parsing XPath expressions, if you like)

Hints for handling star ("*") If you want to precompute the KMP-table, then in the presence of stars ("*"s) you need
to compute several tables (essentially, instantiate each star with every possible letter that occurs in the query). Thus, if there
are several stars, then you will have many tables. Another possibility is to compute a table dynamically during matching,
when it is clear which letter appears at each star-position. --- If you follow the automaton approach, then you need to split
a star-transition into several transitions (each to their own state), one for each letter that occurs in the query.

How to Print Query Results:

The result nodes of a given query should be reported in document order, and by printing the subtree at that node.
Before the n-th result subtree, print the number n followed by a colon, followed by a return.
The result subtrees should be printed line by line, as in "-p2" of Assignment 1.

Consider the following XML file, called "h.xml".

<a>
  <b>Hello</b>
  <c></c>
  <b>World</b>
</a>

Here are examples of what your program should print as result, assuming the executable is called XPathEval.
Note that we will always enclose the query in quotes ("") when we test your program.

> XPathEval h.xml "/a/b"
1:
<b>
Hello
</b>
2:
<b>
World
</b>

> XPathEval h.xml "/a/*/text()"
1:
Hello
2:
World

> XPathEval h.xml "//*"
1:
<a>
<b>
Hello
</b>
<c>
</c>
<b>
World
</b>
</a>
2:
<b>
Hello
</b>
3:
<c>
</c>
4:
<b>
World
</b>

For Point 4, we would like you to implement two different ways of outputting the results:

(1) as above, i.e., XML line-by-line fragments of the subtrees at the selected nodes ("-t" option) and
(2) as a list of DAG nodes; a node is of the form N1.N2.N3...Nn where N1 is the last row number of the DAG,
N2 is a child number of the root node, N3 is a child number of the N2th child of the root, etc.

As before, always print the result nodes in document order.

Here is an example, consider this following DAG (stored in a file "dag.txt").

1:c
2:d
3:b[1:2,2:3]
4:b[2,1]
5:a[3,4:4,1]

> XPathEval -t dag.txt "//b[./following-sibling::c]"
1:
<b>
<c>
</c>
<c>
</c>
<d>
</d>
<d>
</d>
<d>
</d>
</b>
2:
<b>
<d>
</d>
<c>
</c>
</b>
3:
<b>
<d>
</d>
<c>
</c>
</b>
4:
<b>
<d>
</d>
<c>
</c>
</b>
5:
<b>
<d>
</d>
<c>
</c>
</b>

> XPathEval -d dag.txt "//b[./following-sibling::c]"
[5.1,5.2,5.3,5.4,5.5]

> XPathEval -d dag.txt "//b[./following-sibling::c]/d[./following-sibling::c]"
[5.2.1,5.3.1,5.4.1,5.5.1]

> XPathEval -t dag.txt "//b[./following-sibling::c]/*[./following-sibling::d]"
1:
<c>
</c>
2:
<c>
</c>
3:
<d>
</d>
4:
<d>
</d>

> XPathEval -d dag.txt "//b[./following-sibling::c]/*[./following-sibling::d]"
[5.1.1,5.1.2,5.1.3,5.1.4]

> XPathEval -d dag.txt "//f"
[]

More Sample Runs

For this document and For this DAG and

Bonus Points

Here you are asked to implement a streaming XPath evaluator. This means that the evaluator should use as little memory as possible;
in particular, it should not store any XML tree in memory. For XML documents (of bounded depth, say, <=20), your program should
only use constant amount of memory. We will test your program with bounded depth documents of arbitrary size (say >1GB) to check
whether the memory consumption of your program is really independent of the size of the document (but only depends on the depth). Example Runs

> cat h.xml | XPathEval -s "/a/b"
1: 1
2: 3

> cat h.xml | XPathEval -s "/a/*/text()"
1: 1.1
2: 3.1

> cat h.xml | XPathEval -s "//*"
1: 0
2: 1
3: 2
4: 3

> cat tr.xml
<a>
  <b><c/><c/><d/><d/><d/></b>
  <e><d/><c/></e>
  <e><d/><c/></e>
  <e><d/><c/></e>
  <e><d/><c/></e>
  <c/>
  <b><b><c/><d/></b></b>
</a> 

> cat tr.xml | XPathEval -s "//c"
1: 1.1
2: 1.2
3: 2.2
4: 3.2
5: 4.2
6: 5.2
7: 6
8: 7.1.1

> cat tr.xml | XPathEval -s "//b/d"
1: 1.3
2: 1.4
3: 1.5
4: 7.1.2


CRICOS Provider Number: 00098G