Disparity measures are a recurring theme in supervised learning. For example, in decision-tree building algorithms like C4.5 [Qui93], the algorithm proceeds by dividing the instances into two or more regions, based on attribute values. The disparity measure used by C4.5 is the gain ratio; however there are several other possible disparity measures.
In fact there is a whole area of study of different disparity
measures. White and Liu [WL94] provide an extensive survey
of disparity measures, and we adopt their notation. There are many
similarities between decision tree induction (which is built on
segmentation based on attribute values) and the creation of regions
around centroids (which is built on segmentation based on a Euclidean
distance measure). The only adaptation we make is that rather than
talking about attributes
, we use regions
; and, indeed, this points to an interesting analogy.
Suppose that we are dealing with a problem with
classes and that
we have
different regions. Then Table 4.4 represents
the cross-classification of classes and regions.
In the contingency table,
and
represent
class and region respectively;
represent the
frequency counts of the instantiated features in region
coming
from an instance with class
. Also, define: