Assignment 4

XPath Evaluation over Main Memory Structures
Due Date: May 19th, 2010


An XPath query (or expression) selects a set of nodes in a given XML document.
An XPath evaluator takes as arguments an XML document and a query, and determines
as result the nodes in the document that are selected by the query.
You are asked to implement XPath evaluators for five restricted classes of XPath expressions.
The evaluators should print result nodes by their pre-order number, each number in a seperate line (and all in pre-order).
The "virtual" root node has pre-order number 0, the first actual element node of the document has pre-oder number 1.
All XPath expressions deal with element nodes only. We will only test with documents that consist of element nodes only.
All element names are one-letter only, in all documents and in all queries.
In queries, all node tests are one-letter element names (like "a" or "e") or the star ("*").
Note: every query contains at least one node test, i.e., "/" alone is not used as a query.
  1. Simple Root Paths are of the form, e.g., /a/b/a/c/*/d (3 Points)
  2. Simple Paths are of the form, e.g., //a/b/a/c/*/d (4 Points)
  3. Slash, Slashslash, and Simple Filters of the form, e.g., //a[./d/e]/b//a[./f/*//g/h][./u/h]//c/*/d (4 Points)
  4. Streaming for Slash, Slashslash, and Parent/Ancestor-Filters of the form, e.g.,
    //a/e//f[./ancestor::f/parent::*]//a[./ancestor::g][./parent::g]//c/*/d
    (5 Points)
  5. Streaming as in 4. but also Following-Sibling and Preceding-Sibling-Filters of the form, e.g.,
    /a/e//f[./ancestor::f/preceding-sibling::g]//a/following-sibling::c/*/d
    (4 Points)
IMPORTANT: "Streaming" in 4. and 5. means that your program should only use memory proportional to the
height of the XML document. Thus, for 4. and 5., you may NOT load the document into memory!
If the document is not deep, then your program should use little memory (even for huge documents of several Gigabytes)!

The arguments to your program are the name of an XML file, and an XPath query string.
For Parts 1., 2., and 3., the the program may load the XML file into memory using either DOM, or SAX with your own data structure.
To implement the evaluator you can follow the idea of top-down evaluation, as shown for simple paths in Lecture 7.
Alternatively, you can use the node set based approach discussed in that lecture, or any other approach you find suitable.
(of course you are NOT allowed to use Xalan or other libraries for doing the XPath evaluation!!!)

Hints for handling star ("*") If you want to precompute the KMP-table, then in the presence of stars ("*") you need
to compute several tables (essentially, instantiate each star with every possible letter that occurs in the query). Thus, if there
are several stars, then you will have many tables. Another possibility is to compute a table dynamically during matching,
when it is clear which letter appears at each star-position. --- If you follow the automaton approach, then you need to split
a star-transition into several transitions (each to their own state), one for each letter that occurs in the query.

How to Print Query Results:

The result nodes of a given query should be reported in document order, and by printing the pre-order numbers of the selected nodes.
Each pre-order number should be printed in a seperate line. The "virtual" root node has pre-order number 0,
the first actual element node of the document has pre-oder number 1.

Consider the following XML file, called "h.xml".

<a><b></b><c></c><b></b></a>

Here are examples of what your program should print as result, assuming the executable is called XPathEval.
Note that we will always enclose the query in quotes ("") when we test your program.

> XPathEval h.xml "/a/b"
2
4

> XPathEval h.xml "//c"
3

> XPathEval h.xml "//*"
1
2
3
4

More Sample Runs

For this document and:
CRICOS Provider Number: 00098G