The key insight to the application of metafeatures is realising that the distribution of instantiated features in the parameter space is probably not uniform. This can be exploited to create features that may help in classification.
An initial approach might be to use normal clustering in the parameter space; i.e., clustering that groups points in such a way that distance between points within the same cluster is small, and distances between points in different clusters is large. However, this is merely one approach to answering the real question that we are interested in: ``Which are the instantiated features whose presence or absence indicate that the instances belong to a particular class?'' The clustering approach can certainly be used to answer this question - the theory being that the different clusters represent the different types of instantiated features. This is an example of undirected segmentation.
However, a more refined approach is possible: one that takes into account the class of the instantiated features. One way to do this is to augment the parameter space with the class as an extra dimension. However, this would likely not work either, as the distance metric between classes is not obvious; and the weight given to the class label relative to other features in the parameter space is not easy to determine. Likely, such an algorithm would be caught between ignoring class completely, leading to traditional clustering, and paying too much attention to the class, leading to too many clusters.
Another approach is to re-pose the question as: ``Are there regions in the parameter space where the class distribution is significantly different to what is statistically expected?'' This question is very similar to typical problems of supervised classification; in particular, if we pose the question as ``Are there attribute-value pairs for which the class distribution is significantly different to what we would expect?'', then this is the same question that is the basis of many decision tree and rule induction systems. This is an example of directed segmentation: segmentation directed towards creating good features for learning.
This suggests that rather than looking at Figure 4.1 where class is completely ignored, a more appropriate way to look at the data is that shown in Figure 4.2.
Suddenly, one can see immediately what a learner could employ in order to make the classification. We see that all the instantiated features coming from training instances whose class label was Happy occupy the top left hand region of the space. However, there are some practical issues. How, given a parameter space, is a region defined?
The solution used in this work is as follows: We have to select a
group of synthetic events to query on, and we have to
divide the space into regions. Hence, one approach is to select a
number of instantiated features which we will term
centroids
. Let the set of all instantiated features be
. Let the
set of points we select as centroids be
, where there are
points selected. Define the set of regions
be defined as follows:
In other words, each region
is defined as the set of points in
the parameter space
for which
is the closest point (using
the distance metric
) in the set
. This is the
definition of a Voronoi diagram. A good review of Voronoi diagrams and
their (many) applications can be found in [AK00].
If these regions were selected randomly, it would be expected that the class distribution in each region would be similar to the global class distribution. However, if the distribution of instantiated features differed significantly from the global class distribution, then this would be interesting and useful for a machine learning algorithm. Asking whether a training stream has an instantiated feature within a region would be informative of class of the training stream that the instantiated feature came from.
In general, the objective is to find the subset of the instantiated
features that would give us the most ``different'' class distribution
for each region. This would then give something for the learner in the
subsequent stage to use for the basis for learning. However, one
might then suggest that the solution to this problem is simple: Make
- in other words, make each instantiated feature a centroid.
Then each region will have exactly one instantiated feature in it;
with a very non-random class distribution. Hence we need a measure
that ``trades off'' number of regions for size. We will term this
measure the disparity measure. The disparity measure measures
the difference between the expected and observed class distribution.
For now, let us assume that we have such a measure, say, called
. Then we are looking to find:
The above equation looks simple, but there is a
problem. The problem is the
in the above equation: this
is the set of all subsets of
. In general, if there are
elements
in
(i.e.
), then there are
elements in
. If we wanted to do an exhaustive search, we would
therefore have to look at a number of subsets that was exponential in
the number of instantiated features. For example, if there are only
100 elements in
, we would still have to check more than
possible sets. Even if we were to constrain it to a reasonable size,
say no more than 10 regions,
is still combinatorially large in
.
Still, it might be possible to search part of the feature space and
come up with a reasonable set
that is effective
in constructing features.