Thesis Topic Details

Topic ID:
Efficient search on multi-lingual documents
Raymond Wong
Research Area:
Information Retrieval, Web Applications, Algorithms
Associated Staff
Wei Wang
Topic Details
R & D
Group Suitable:
Good knowledge on data structures plus one programming language (prefer Java or C/C++)
Large character sets (for example, Chinese, Vietnamese, Japanese and Korean) have several thousands of characters. Since an 8 bit byte can only represent 256 code points, we need more space to represent these large character sets.

Wide character encodings, which are fixed width, use correspondingly more bytes to hold the code points of the encoding being used. For example, 4 bytes can represent 4,294,967,296 unique code points. In fact, most Microsoft Windows application programming interfaces, as well as the Java and .Net Framework platforms, require that wide char- acter variables be defined as 16-bit values, and that charac- ters be encoded using UTF-16. Modern Unix-like systems generally require that 32 bit values are encoded using UTF- 32.

On the other hand, multi-byte encodings (usually variable length) use a sequence of bytes to represent a code point that cannot be represented in 8 bits. The first use of multi-byte encodings was in fact for the encoding of Chinese, Japanese and Korean, which have large character sets that are well in excess of 256 characters. Multi-byte encodings can be made more space efficient, but at the cost of increased complex- ity of processing. Furthermore, existing software that can handle 8 bit wide chunks can generally be ported to handle multi-byte encodings with relative ease, since each "chunk" in a multi-byte encoding is typically 8 bits in width.
As more and more information is encoded in these wider encodings, being able to search keywords from texts encoded in these encodings is vital.

Recently, the Burrows Wheeler transform (BWT) has become popular in text compression, full-text search, indexing and compressing XML documents, and DNA sequence matching. In particular, full-text search on BWT encoded documents can be performed very efficiently using backward search.

This project will extend our previous efforts on using BWT to search for multi-byte encoded texts to become a full multi-byte, compact search engine. Related documentation and source code can be found at:
Past Student Reports
No Reports Available. Contact the supervisor for more information.

Check out all available reports in the CSE Thesis Report Library.

NOTE: only current CSE students can login to view and select reports to download.