The formal definition of a metafeature is very simple. It consists of:
For example, each of the entries in Table 2 is
an instantiated feature for the metafeature LoudRun, and
Figure 1 shows the parameter space. A
simplified version of extraction function is:
Once these features are extracted, all of the instantiated features from all of the training instances are plotted in the parameter space. Selection of appropriate synthetic features is then performed. In Figure 2, the space was segmented into three regions based on the selection of three centroids. We did not discuss how we arrived at these; in fact, A, B and C were arrived at by applying K-means clustering.
However, clustering is not the most appropriate means of selecting centroids. Our objective is to create useful and discriminative features for learning, and clustering instances regardless of class may not produce discriminative features. For example, if each cluster has a uniform distribution of classes, then learning will be difficult since there is no basis for discrimination between classes.
Hence a more effective approach is to segment the space in a manner directed towards generating features that are specifically discriminative. We term this approach directed segmentation. One way to accomplish this is to choose a segmentation that separates the instantiated features so that in any one region, there are only instances belonging to a single class. An example of this is shown in Figure 4 shows an example of such a segmentation. Clearly, a segmentation such as this would make work easier for the learner.
However, there are three problems with this approach. The first is that it may not be possible to segment the space in such a way. Secondly, there is an obvious segmentation that is totally discriminatory: make each instantiated feature a centroid; while this is possible, it would perhaps generate too many features for the learner to use. Both of these problems are addressed by realising that a good segmentation has the same properties a good decision tree has: it segments the data into regions with disparate class distributions. Hence, any information-based measure, such as gain, gain ratio, or chi-squared distribution can be applied; simply treating each region as a ``split'' and using the class distribution of each region to calculate the information-based measure.
The third problem is that there is no obvious search mechanism to find a good directed segmentation. Hence in this work, we use a very simple algorithm: select a random subset with, say, between 2 and 10 instantiated features, calculate the segmentation and evaluate the disparity measure. This process is repeated, retaining the best segmentation seen so far. Although this sounds very rough, such random algorithms have been used before with some success in machine learning (e.g. ashwin:aleph), and indeed is the subject of a field of study known as ordinal optimisation [Ho et al., 1992]. It is also easy to show that if we examine 1000 random subsets, there is a better than 99.99% chance that a segmentation will be found that is in the top 1%.
Once the synthetic features have been selected, attribution is performed. Each synthetic feature becomes a column in the attribute vector for each instance. For each synthetic feature, the attribute value is set to true if it has an instance that lies in the region around that synthetic feature and false otherwise. This creates a feature vector which can be provided to a learner. Finally, we can run a learner on the output; or alternatively, combine this feature vector with the results of other metafeatures or conventional attribute values to make a single vector, and then use the learner.