next up previous contents
Next: Improving the directed segmentation Up: Improving directed segmentation Previous: Improving directed segmentation   Contents

Issues with region description

A side effect of region segmentation is that all regions are accepted as synthetic features. Consider Figure A.3. Consider now if one centroid $ c_1$ is placed at (5, 0.35), $ c_2$ at (10, 0.1) and $ c_3$ at (40,-0.1).

It is clear that the centroids $ c_1$ and $ c_2$ make good synthetic features. But $ c_3$, although important for a good segmentation (indeed, it is necessary to define the concept boundaries), does not itself make a useful synthetic feature. This is because the region around $ c_3$ contains a mix of instances from different clases, hence its usefulness as a discriminative attribute is not likely to be high. However, without $ c_3$, all of the points formerly considered as associated with $ c_3$ would now be associated with $ c_2$, including the mixed class area of centred around (40, -0.1). This would make $ c_2$ a less useful discriminative features. Hence, $ c_3$ is useful overall for providing a highly discriminative features - in particular making $ c_2$ more discriminative - but $ c_3$ itself is not a discriminative feature.

If the learner we are using were smart enough, this synthetic feature should be eliminated. Quinlan points out [Qui93] that the more attributes that are given to the learner, the greater the probability that amongst them there will be one attribute that performs highly on the train data but poorly on the test data. Hence, if we can remove $ c_3$ before it gets to the learner, it would likely improve our classification performance. Using fewer features may also aid in the production of more comprehensible rules, since it would likely lead to shallower trees.


next up previous contents
Next: Improving the directed segmentation Up: Improving directed segmentation Previous: Improving directed segmentation   Contents
Mohammed Waleed Kadous 2002-12-10