Assignment 5

XPath to SQL Translation
Due Date: June 3rd, 2010




For a starting point, here is a sample solution to Assignment 3.

In this assignment we want to translate XPath queries that use /, //, following, preceding, following-sibling, and filters ([..]) into SQL queries.
In a filter there is only one path expression which starts with "." or with "/".
Filter are not nested, but may appear in sequence, e.g., //a[./b][//c]//d.
Node tests are restricted to text(), or star ("*"), or one letter (e.g., "a").
Your program will receive as input the name of an XML file, and the XPath expression (written in quotes "/a//b/...").
The program should generate as output a sequence of SQL INSERT-statements, followed by one SQL query.
The INSERT-statements generate a table representation of the XML file. For instance, you can generate
a PRE/POST/PARENT-table as in Assignment 3. You may also add other columns to that table, if you feel appropriate.
The SQL query should produce as results exactly the pre-number of the nodes that are selected by the given XPath query.

Your program only needs to output text, as shown below. You do not need to interfere with the database.
If the output text is fed into mysql, as shown below, then it must produce the correct pre-numbers selected by the XPath query.

Note that we do not care about attributes here. So you do not need to represent them in a table,
but may simply ignore them.

If your program works for //, following, and preceding, then this is worth 3 Points.
If your program works for / and following-sibling, then that is worth 3 Points.
If additionally filters are supported then that is worth another 4 Points.

Bonus You can earn 2 bonus points, if you manage to add in filters (1) equality test of string values (2) the contains-predicate.

As an example, consider the small "book.xml" document from Assignment 1:

<book isbn="1-2345-6789-0" year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>John</first></author>
<publisher>Addison-Wesley</publisher>
<price currency="USD">65.95</price>
</book>

As with Assignment 3, you should generate a table with name z1234567_book_tbl, where the number after the "z" is your student ID.
Do not forget to insert the additional root node with PRE-value 0.

Your program should print out text, similar to the one below (assuming the executable is named "XPath2Rel"):

> XPath2Rel book.xml "//author"
CREATE TABLE z1234567_book_tbl (
  pre INTEGER PRIMARY KEY,
  post INTEGER,
  parent INTEGER,
  tag VARCHAR(256),
  text VARCHAR(1000)
);

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (0, 13, null, null, null);

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (1, 12, 0, "book", null);

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (2, 2, 1, "title", null);

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (3, 1, 2, null, "TCP/IP Illustrated");

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (4, 7, 1, "author", null);

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (5, 4, 4, "last", null);

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (6, 3, 5, null, "Stevens");

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (7, 6, 4, "first", null);

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (8, 5, 7, null, "John");

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (9, 9, 1, "publisher", null);

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (10, 8, 9, null, "Addison-Wesley");

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (11, 11, 1, "price", null);

INSERT INTO z1234567_book_tbl (pre, post, parent, tag, text)
VALUES (12, 10, 11, null, "65.95");

SELECT DISTINCT r2.pre FROM z1234567_book_tbl r1, z1234567_book_tbl r2
WHERE r1.pre=0
AND r2.pre>r1.pre
AND r2.post<r1.post
AND r2.tag="author"
ORDER BY r2.pre;

DROP TABLE z1234567_book_tbl;

When we execute these SQL statements on the database, then the following output is obtained:

> XPath2Rel book.xml "//author" | mysql -h cs4317.srvr -u stu1 -pstu1pass studentdb1
pre
4
>

NOTE it is not important that you produce exactly the insert statements or query as above, the only thing that matters
is that the result of your query is the correct sequence of pre-numbers!
Thus, for the above example, the query must produce "4", because that is the pre-number of the only author-node.
As with Assignment 3, use the following logins for mysql on CSE machines:

server: cs4317.srvr
database: studentdb1
user: stu1
password: stu1pass

More Examples

Note that the wildcard ("*") only selects element nodes, and NOT text nodes.
Thus, you have to explicitly check in the query that nodes selected by "*" are element nodes.
For instance, if we translate the query "//*" over the document of above, then we should obtain this query:
SELECT DISTINCT r2.pre FROM book_tbl r1, book_tbl r2
WHERE r1.pre=0
AND r2.pre>r1.pre
AND r2.post<r1.post
AND r2.tag IS NOT NULL
ORDER BY r2.pre;
Executing the above query gives:

> XPath2Rel book.xml "//*" | mysql -h cs4317.srvr -u stu1 -pstu1pass studentdb1
pre
1
2
4
5
7
9
11
>

Here is an example that uses // and preceding. Remember, you are allowed to generated SQL queries different from the one below, as long as the results are the same!

XPath: "//a//b//preceding::*//d//preceding::e"
SELECT DISTINCT r8.pre
FROM z12345678_book_tbl r1, z12345678_book_tbl r2, z12345678_book_tbl r3, z12345678_book_tbl r4, z12345678_book_tbl r5, z12345678_book_tbl r6, z12345678_book_tbl r7, z12345678_book_tbl r8
WHERE r1.pre = 0
AND r2.pre > r1.pre AND r2.post < r1.post AND r2.tag = "a"
AND r3.pre > r2.pre AND r3.post < r2.post AND r3.tag = "b"
AND r4.pre >= r3.pre AND r4.post <= r3.post
AND r5.pre < r4.pre AND r5.post < r4.post AND r5.tag IS NOT NULL
AND r6.pre > r5.pre AND r6.post < r5.post AND r6.tag = "d"
AND r7.pre >= r6.pre AND r7.post <= r6.post
AND r8.pre < r7.pre AND r8.post < r7.post AND r8.tag = "e"
ORDERED BY r8.pre;

Here are some examples on book.xml document; recall, however, that your program need only
support 1-letter node tests, that is, we will not use queries as below which are using longer element names.

XPath: "//book/author[./first]"
SELECT DISTINCT r3.pre
FROM z12345678_book_tbl r1, z12345678_book_tbl r2, z12345678_book_tbl r3
WHERE r1.pre = 0
AND r2.pre > r1.pre AND r2.post < r1.post AND r2.tag = "book"
AND r3.parent = r2.pre AND r3.tag = "author"
AND EXISTS
(SELECT DISTINCT sr1.pre
FROM z12345678_book_tbl sr1
WHERE sr1.parent = r3.pre AND sr1.tag = "first")
ORDER BY r3.pre;

XPath: "//*[/book/price]"
SELECT DISTINCT r2.pre
FROM z12345678_book_tbl r1, z12345678_book_tbl r2
WHERE r1.pre = 0
AND r2.pre > r1.pre AND r2.post < r1.post AND r2.tag IS NOT NULL
AND EXISTS
(SELECT DISTINCT sr2.pre
FROM z12345678_book_tbl sr1, z12345678_book_tbl sr2
WHERE sr1.parent = r1.pre AND sr1.tag = "book"
AND sr2.parent = sr1.pre AND sr2.tag = "price")
ORDER BY r2.pre;

XPath: "//*/text()"
SELECT DISTINCT r3.pre
FROM z12345678_book_tbl r1, z12345678_book_tbl r2, z12345678_book_tbl r3
WHERE r1.pre = 0
AND r2.pre > r1.pre AND r2.post < r1.post AND r2.tag IS NOT NULL
AND r3.parent = r2.pre AND r3.tag IS NULL
ORDER BY r3.pre;

XPath: "/book/title/following-sibling::*/text()"
SELECT DISTINCT r5.pre
FROM z12345678_book_tbl r1, z12345678_book_tbl r2, z12345678_book_tbl r3, z12345678_book_tbl r4, z12345678_book_tbl r5
WHERE r1.pre = 0
AND r2.parent = r1.pre AND r2.tag = "book"
AND r3.parent = r2.pre AND r3.tag = "title"
AND r4.parent = r3.parent AND r4.pre > r3.pre AND r4.tag IS NOT NULL
AND r5.parent = r4.pre AND r5.text IS NOT NULL
ORDER BY r5.pre;

XPath: "//*[. = "StevensJohn"]"
SELECT DISTINCT r2.pre
FROM z12345678_book_tbl r1, z12345678_book_tbl r2
WHERE r1.pre = 0
AND r2.pre > r1.pre AND r2.post < r1.post
AND
(SELECT GROUP_CONCAT(text separator '')
FROM z12345678_book_tbl sr1
WHERE sr1.pre > r2.pre AND sr1.post < r2.post) = 'StevensJohn'
ORDER BY r2.pre;

Some sample XML files: books.xml s200.xml u1000.xml
CRICOS Provider Number: 00098G