Learning-based Transformation for Text Documents

Liping Ma, John Shepherd, Raymond Wong

Content- and Semantic-based Information Retrieval, held in conjunction with 6th World Multi-conference on Systemics, Cybernetics, and Informatics (SCI 2002),
Orlando, Florida, July 2002

(Compressed Postscript ... 87KB)


This paper presents a method to automatically transform semistructured (not necessarily tagged) text documents into content-tagged documents based on techniques from machine learning and computational linguistics. The method consists of two phases. First, a learning-based segmentation module is used to extract regions and sequences from the documents. Second, translation from region-marked documents to XML is done by a transformation-based learning (TBL) translator that is very effective even with a small set of training examples.

Keys: information extraction, machine learning, semi-structured data


Recent Publications ... John Shepherd ... CSE ... UNSW