Assignment 4

XPath Evaluation over Main Memory Structures
Due Date: May 19th, 2010

An XPath query (or expression) selects a set of nodes in a given XML document.
An XPath evaluator takes as arguments an XML document and a query, and determines
as result the nodes in the document that are selected by the query.
You are asked to implement XPath evaluators for five restricted classes of XPath expressions.
The evaluators should print result nodes by their pre-order number, each number in a seperate line (and all in pre-order).
The "virtual" root node has pre-order number 0, the first actual element node of the document has pre-oder number 1.
All XPath expressions deal with element nodes only. We will only test with documents that consist of element nodes only.
All element names are one-letter only, in all documents and in all queries.
In queries, all node tests are one-letter element names (like "a" or "e") or the star ("*").
Note: every query contains at least one node test, i.e., "/" alone is not used as a query.

Simple Root Paths are of the form, e.g., /a/b/a/c/*/d
- they start with "/", i.e., evaluation starts at the root node
- the only available axis is the child axis ("/")
(3 Points)
Simple Paths are of the form, e.g., //a/b/a/c/*/d
- they start with "//" and otherwise are as Simple Root Paths of 1.
(4 Points)
Slash, Slashslash, and Simple Filters of the form, e.g., //a[./d/e]/b//a[./f/*//g/h][./u/h]//c/*/d
- they start with "/" or with "//"
- slash ("/") and slashslash ("//") may appear everywhere in the query
- filters start with "[./t" or with "[.//t", where t is the star ("*") or a one-letter element name
- a filter contains only one single XPath expression over / and //. I.e., there is no nesting of filters.
(4 Points)
Streaming for Slash, Slashslash, and Parent/Ancestor-Filters of the form, e.g.,
//a/e//f[./ancestor::f/parent::*]//a[./ancestor::g][./parent::g]//c/*/d
- they start with "/" or with "//"
- slash ("/") and slashslash ("//") may appear everywhere in the query (but not inside filters!)
- filters start with "[./parent::t" or with "[./ancestor::t" where t is the star ("*") or a one-letter element name
- a filter contains one single XPath expression which uses only the parent and ancestor axes.
(5 Points)
Streaming as in 4. but also Following-Sibling and Preceding-Sibling-Filters of the form, e.g.,
/a/e//f[./ancestor::f/preceding-sibling::g]//a/following-sibling::c/*/d
- they start with "/" or with "//"
- slash ("/"), slashslash ("//"), and following-sibling may appear everywhere in the query (but not inside filters!)
- filters start with "[./parent::t" or with "[./ancestor::t" or with "[./preceding-sibling::t" or with where t is the star ("*") or a one-letter element name
- a filter contains one single XPath expression which uses only the parent, the ancestor, and the preceding-sibling axes.
(4 Points)

IMPORTANT: "Streaming" in 4. and 5. means that your program should only use memory proportional to the
height of the XML document. Thus, for 4. and 5., you may NOT load the document into memory!
If the document is not deep, then your program should use little memory (even for huge documents of several Gigabytes)!

The arguments to your program are the name of an XML file, and an XPath query string.
For Parts 1., 2., and 3., the the program may load the XML file into memory using either DOM, or SAX with your own data structure.
To implement the evaluator you can follow the idea of top-down evaluation, as shown for simple paths in Lecture 7.
Alternatively, you can use the node set based approach discussed in that lecture, or any other approach you find suitable.
(of course you are NOT allowed to use Xalan or other libraries for doing the XPath evaluation!!!)

Hints for handling star ("*") If you want to precompute the KMP-table, then in the presence of stars ("*") you need
to compute several tables (essentially, instantiate each star with every possible letter that occurs in the query). Thus, if there
are several stars, then you will have many tables. Another possibility is to compute a table dynamically during matching,
when it is clear which letter appears at each star-position. --- If you follow the automaton approach, then you need to split
a star-transition into several transitions (each to their own state), one for each letter that occurs in the query.

How to Print Query Results:

The result nodes of a given query should be reported in document order, and by printing the pre-order numbers of the selected nodes.
Each pre-order number should be printed in a seperate line. The "virtual" root node has pre-order number 0,
the first actual element node of the document has pre-oder number 1.

Consider the following XML file, called "h.xml".

<a><b></b><c></c><b></b></a>

Here are examples of what your program should print as result, assuming the executable is called XPathEval.
Note that we will always enclose the query in quotes ("") when we test your program.

> XPathEval h.xml "/a/b"
2
4

> XPathEval h.xml "//c"
3

> XPathEval h.xml "//*"
1
2
3
4

More Sample Runs

For this document and:

the Simple Root Path simpleRoot1=/*/a/b/* this is the result
simpleRoot2=/b/b/c this is the result
simpleRoot3=/*/*/*/*/*/*/*/*/b/b/c this is the result
Simple1=//a/c/* this is the result
Simple2=//b/c/b this is the result
thirdPart1=//a[./b/c] this is the result
thirdPart2=//a[./b]//c[./c/b] this is the result
fourthPart1=//a[./parent::b]//b[./ancestor::a/parent::c] this is the result
fourthPart2=//b[./parent::b/parent::a/parent::b]/c/*[./ancestor::a] this is the result
fifthPart1=//a[./parent::b]/following-sibling::b[./ancestor::a/preceding-sibling::c] this is the result
fifthPart2=/*//a/following-sibling::c[./ancestor::a/parent::b]//*[./preceding-sibling::c] this is the result

CRICOS Provider Number: 00098G

Assignment 4

XPath Evaluation over Main Memory Structures Due Date: May 19th, 2010

How to Print Query Results:

More Sample Runs

XPath Evaluation over Main Memory Structures
Due Date: May 19th, 2010