Wei Wang     (PhD, HKUST, 2004)

Final Year Projects For Honor Students

If you are interested in doing your final year projects in Database, please feel free to drop by and talk to me.

Below is a non-exhaustive list of projects. Of course, you are welcome to propose your own projects and discuss them with me.

All the projects are research-oriented. The minuses are that you need to read a lot, think a lot and implement a solution. The pluses are that you will be exposed to the frontier of database research, that you will learn how to do database research and that you will see how an observation or idea will make things different.

List of Projects:

  1. Efficient Processing of XSLT Queries over Large XML Document Repositories
  2. Efficient Query Processing over Graph-structured XML Data
  3. Intelligent Text Mining
  4. Approximate Aggregate Query Processing in the Data Warehouse Environment


1. Efficient Processing of XSLT Queries over Large XML Document Repositories

For a light-weight introduction to XML, please click here.

XLST stands for XSL Transformations. It is a recommended standard by W3C. XSLT is a (query) language for transforming XML documents into other XML documents.

The prevelant usage of XSLT is to transform XML documents into customized HTML pages which are displayed on the Web. As a metaphor in the relational database context, one can think of the original XML documents as the relational tables in the RDBMS, and think of the resultant HTML pages as the views defined over the tables. Most advantages of using relational views are still valid for using XSLT-based solutions in the XML context.

However, most current XLST evaluation engines only employ naive methods, in particular, no database techniques are used. This project is thus aimed at investigating the possibility of constructing high-performance XSLT processing engines utilizing traditional and novel database techniques.

Good understanding of relational query processing and optimization techniques are required. Familiarity with functional programming language is a plus.

References:

  1. XML Bibliography Site by Vassilis Christophides.
  2. The Sarvega XSLT Benchmark Study.


2. Efficient Query Processing over Graph-structured XML Data

For a light-weight introduction to XML, please click here.

Currently, the most commonly used model for an XML document is a ordered, node-labeled tree. However, there are many data that can only be modeled as graphs. For example, if we look at the IMDB movie database, it is clear that the same actor will appear in multiple movies. This means, if we model movies and actors as nodes, there will be multiple incoming nodes to an actor node. In XML, we can use IDREF or XPointer/XLink to model the graph-structured IMDB movie database in XML format. The graph model help us to easily formulate and efficiently execute interesting queries.

However, not much work has been devoted to study the query processing and optimization issues for such graph-structured XML data. Even worse, existing techniques for tree-structured XML data fail to work in this more general model. This project is thus aimed to re-visit the query processing and optimization issue again for such graph-structured XML data. In particular, we will start with the storage issue, the indexing issue before moving on to tackle the difficult query processing issues.

Good knowledge in discrete mathematics (including algebra and graph theory) is required.

References:

  1. XML Bibliography Site by Vassilis Christophides.


3. Intelligent Text Mining

To have a feeling of text mining and its research, please click here.

Data mining is another hot research area yet with numerous applications. For instance, if you get a promotion email or mail, you are probably "classified" as potential customers for that goods/services/... by a data mining tool. Text mining deals with textual data, either plain text documents or text documents with structures (say, a LaTeX source file). It is also related to "Web Mining", where we have additional pointer among HTML pages.

However, the results produced by current text mining tools are often unsatifactory. One fundamental reason might be the fact that their methods are based on the "bag of words" model. In a word, in such a model, each document is viewed as a bag (i.e., multiset) of words. This model is simple and thus enable efficient processing of large collection of documents; however, it ignores important linguistic characteristics as well as semantic meanings which might end up with imprecise results. Google is a good example of the advantages and disadvantages of the model.

Therefore, this project aims at developing advanced models, effective methods and novel applications that distill and extract meaningful information from large collection of text documents.

Good knowledge in data structure and algorithm is indispensible. Knowledge about data/text mining, information retrieval, machine learning/artificial intelligence is definitely useful.

References:

  1. A Roadmap to Text Mining and Web Mining by U. Y. Nahm.
  2. XML Bibliography Site by Vassilis Christophides.
  3. XKeyword Project from UCSD and its online demo.


4. Approximate Aggregate Query Processing in the Data Warehouse Environment

A Data Warehouse is a copy of transaction data specifically structured for querying and reporting. It is designed for managers and decision makers to extract information quickly and easily in order to answer questions about their business. As such, data warehouses are the corner stone of modern DSS (Decision Support System) and BI (Business System) systems.

Data warehouses have unique features that differ greatly from relational databases. For instance, while normalization theory is central to relational database design, tables in a data warehouse are always organized in a highly de-normalized way!

Approximate query processing is yet another feature in the data warehouse environment. "The total sales of electronics in Australia is 1.2 Million dollars" is an example approximate answer. It is often not necessary to be precise to a dollar, as far as decision making is concerned.

Answering queries approximately requires very different techniques. For example, if we keep a small random sample of the original (huge) table, we can give an approximate answer to the above aggregate SUM query such that, with a high probability, the difference with the real SUM value is small.

This project aims at investigating approximate aggregate query processing in a data warehouse environment. Students are expected to integrate and improve existing approximate query processing techniques, as well as discovering novel ones.

Good knowledge in probability theory and statistics is required. Students with good mathematics background and/or data stream are prefered.

Reference:

  1. Data Warehousing and OLAP: A Research-Oriented Bibliography.


>>> This page is under construction.
Last modified: Tue May 25 12:35:23 EST 2004

You are Visitor No: or since July 1, 2001. ::Click Statistics:: since Dec. 8, 2001.