next up previous contents
Next: Disparity Measures Up: Metafeatures: A Novel Feature Previous: Extraction from a noisy   Contents


Using Metafeatures

The key insight to the application of metafeatures is realising that the distribution of instantiated features in the parameter space is probably not uniform. This can be exploited to create features that may help in classification.

An initial approach might be to use normal clustering in the parameter space; i.e., clustering that groups points in such a way that distance between points within the same cluster is small, and distances between points in different clusters is large. However, this is merely one approach to answering the real question that we are interested in: ``Which are the instantiated features whose presence or absence indicate that the instances belong to a particular class?'' The clustering approach can certainly be used to answer this question - the theory being that the different clusters represent the different types of instantiated features. This is an example of undirected segmentation.

However, a more refined approach is possible: one that takes into account the class of the instantiated features. One way to do this is to augment the parameter space with the class as an extra dimension. However, this would likely not work either, as the distance metric between classes is not obvious; and the weight given to the class label relative to other features in the parameter space is not easy to determine. Likely, such an algorithm would be caught between ignoring class completely, leading to traditional clustering, and paying too much attention to the class, leading to too many clusters.

Another approach is to re-pose the question as: ``Are there regions in the parameter space where the class distribution is significantly different to what is statistically expected?'' This question is very similar to typical problems of supervised classification; in particular, if we pose the question as ``Are there attribute-value pairs for which the class distribution is significantly different to what we would expect?'', then this is the same question that is the basis of many decision tree and rule induction systems. This is an example of directed segmentation: segmentation directed towards creating good features for learning.

This suggests that rather than looking at Figure 4.1 where class is completely ignored, a more appropriate way to look at the data is that shown in Figure 4.2.

Suddenly, one can see immediately what a learner could employ in order to make the classification. We see that all the instantiated features coming from training instances whose class label was Happy occupy the top left hand region of the space. However, there are some practical issues. How, given a parameter space, is a region defined?

The solution used in this work is as follows: We have to select a group of synthetic events to query on, and we have to divide the space into regions. Hence, one approach is to select a number of instantiated features which we will term centroids[*]. Let the set of all instantiated features be $ I$. Let the set of points we select as centroids be $ C = \{c_1, c_2, c_3, ...,
c_k\}$, where there are $ k$ points selected. Define the set of regions $ R = \{R_1, ..., R_k\}$ be defined as follows:

$\displaystyle R_i = \{ x \in P \vert \mathit{closest}(x, C) = c_i\}
$

and

$\displaystyle \mathrm{closest}(x, C) = \mathop{\rm argmin}_{c \in C} \mathit{dist}(x, c)
$

where $ \mathit{dist}$ is the distance metric we are using for the parameter space.

In other words, each region $ R_i$ is defined as the set of points in the parameter space $ P$ for which $ c_i$ is the closest point (using the distance metric $ \mathit{dist}$) in the set $ C$. This is the definition of a Voronoi diagram. A good review of Voronoi diagrams and their (many) applications can be found in [AK00].

If these regions were selected randomly, it would be expected that the class distribution in each region would be similar to the global class distribution. However, if the distribution of instantiated features differed significantly from the global class distribution, then this would be interesting and useful for a machine learning algorithm. Asking whether a training stream has an instantiated feature within a region would be informative of class of the training stream that the instantiated feature came from.

In general, the objective is to find the subset of the instantiated features that would give us the most ``different'' class distribution for each region. This would then give something for the learner in the subsequent stage to use for the basis for learning. However, one might then suggest that the solution to this problem is simple: Make $ C=I$ - in other words, make each instantiated feature a centroid. Then each region will have exactly one instantiated feature in it; with a very non-random class distribution. Hence we need a measure that ``trades off'' number of regions for size. We will term this measure the disparity measure. The disparity measure measures the difference between the expected and observed class distribution. For now, let us assume that we have such a measure, say, called $ \ensuremath{\mathit{DispMeas}}$. Then we are looking to find:

$\displaystyle R = \mathop{\rm argmax}_{C \in \mathbb{P}(I)} \ensuremath{\mathit{DispMeas}}(C)
$

In other words, we are looking to find the subset of $ I$ (the set of all instantiated features), C, for which the disparity measure is the greatest.

The above equation looks simple, but there is a problem. The problem is the $ \mathbb{P}(I)$ in the above equation: this is the set of all subsets of $ I$. In general, if there are $ n$ elements in $ I$ (i.e. $ n = \Vert I\Vert$), then there are $ 2^n$ elements in $ \mathbb{P}(I)$. If we wanted to do an exhaustive search, we would therefore have to look at a number of subsets that was exponential in the number of instantiated features. For example, if there are only 100 elements in $ I$, we would still have to check more than $ 10^{30}$ possible sets. Even if we were to constrain it to a reasonable size, say no more than 10 regions, $ C$ is still combinatorially large in $ I$.

Still, it might be possible to search part of the feature space and come up with a reasonable set $ C$ that is effective in constructing features.


next up previous contents
Next: Disparity Measures Up: Metafeatures: A Novel Feature Previous: Extraction from a noisy   Contents
Mohammed Waleed Kadous 2002-12-10