Most concept learners are propositional. However, there are many domains that do not easily fit in this model, for example: multivariate time series, optical character recognition, sequence recognition, basket analysis and web logs. Consequently, researchers hoping to use concept learners on these domains have few choices: apply ad-hoc preprocessing, write a learner specifically designed for the domain, or use a learner with a more powerful representation, such as relational learning or graph-based induction.
Each of these has problems. Ad-hoc preprocessing is frequently used. A review of the UCI repository [Blake & Merz, 1998] reveals that there are at least seven domains that are not in a propositional form, but are preprocessed into it1. Writing a custom learner works, but is labour-intensive. Relational learning techniques tend to be very sensitive to noise and to the particular clausal representation selected. They are typically unable to process large data sets in a reasonable time frame, and/or require the user to set limits on the search such as refinement rules [Cohen, 1995]. Furthermore, their most powerful feature - the use of relations - is rarely used; it is their more expressive observation language that is used.
Therefore, we propose an generic preprocessing technique that expands the observation language to allow for domains where instances exhibit recurring substructures. For example, with Chinese character recognition, the recurring substructure is a stroke. These substructures are extracted, and a novel clustering algorithm is used to construct synthetic attributes based on the presence or absence of certain substructures. Standard propositional learners can then be applied. Not only is coverage expanded, but concept descriptions can be generated for many domains when the only previously applicable techniques - typically hidden Markov models or other statistical approaches - were not able to. Furthermore, these concepts are expressed using the same substructures identified by the user. Since these substructures are frequently the same concepts humans use themselves in classifying instances, this results in very readable descriptions.
We begin with an overview of applying metafeatures. A more formal explanation is given, and some refinements of metafeature application are discussed. Experimental results are presented, and we review related work. Finally, we conclude and make some suggestions for future work.