Learning-based Transformation for Text Documents
Liping Ma,
John Shepherd,
Raymond Wong
Content- and Semantic-based Information Retrieval,
held in conjunction with
6th World Multi-conference on Systemics, Cybernetics, and Informatics (SCI 2002),
Orlando, Florida, July 2002
(Compressed Postscript ... 87KB)
This paper presents a method to automatically transform
semistructured (not necessarily tagged) text documents
into content-tagged documents based
on techniques from machine learning and computational
linguistics. The method consists of two phases. First, a
learning-based segmentation module is used to extract
regions and sequences from the documents. Second,
translation from region-marked documents to XML is
done by a transformation-based learning (TBL) translator
that is very effective even with a small set of training
examples.
Keys:
information extraction, machine learning, semi-structured data
Recent Publications ...
John Shepherd ...
CSE ...
UNSW