Another way to improve the accuracy and comprehensibility that has been explored in machine learning is the use of feature subset selection. Sometimes, using all the features available has a detrimental effect on accuracy when compared to using some subset of the features. This is because some features are dependent on other features - and hence, there is no need to include the feature - or are noisy. Furthermore, the more features there are, the more likely that some feature will randomly fit the data and hence the probability of overfitting increases. Thus removing these features would lead to both improved accuracy, and clearer descriptions of learnt concepts, since there are less features that the user has to deal with. Furthermore, since we are generating many hundreds of features, many of which are likely to be irrelevant, we would expect to benefit from finding some optimal subset of features.
However, there is a problem: feature subset selection is a difficult
problem. If there are
features, then there are
possible
feature subsets. Searching this space for a ``good'' set of features
can be extremely time-consuming. There are two possible approaches to
feature subset selection: the wrapper approach and the filter approach
[JKP94]. In the wrapper approach, the learner is
applied to subsets of features and tested on an hold-out set (or, if
execution time is not an issue, using cross-validation). From the
results of these tests, a good subset of features is selected. For
example, for forward selection, a classifier is built for each
feature individually; and the most accurate feature is ``accepted''
into the subset of good features. That feature is removed, and the
process is repeated, adding each of the remaining features and
evaluating its performance. The ``best'' set of two features is thus
created. This proceeds incrementally until a feature set with maximal
accuracy is achieved. Similarly, backward selection proceeds by
eliminating one feature at a time, finding the least beneficial
feature and eliminating it, and repeating the process, eliminating the
least accurate features until eliminating further features decreases
accuracy.
The other approach to feature subset selection is the ``filter''
approach, which looks at correlations between features and other
performance measures to decide a priori, independent of any
particular learner, what a good subset of features is. Although such
techniques do not perform nearly as well as the wrapper approach, they
are much faster. In general, if there are
features, the wrapper
approach requires
runs of the learner, whereas the filter
approach requires only one run of the learner and some simple
statistics on the dataset to be calculated.