next up previous contents
Next: Gain Ratio Up: Metafeatures: A Novel Feature Previous: Using Metafeatures   Contents


Disparity Measures

Disparity measures are a recurring theme in supervised learning. For example, in decision-tree building algorithms like C4.5 [Qui93], the algorithm proceeds by dividing the instances into two or more regions, based on attribute values. The disparity measure used by C4.5 is the gain ratio; however there are several other possible disparity measures.

In fact there is a whole area of study of different disparity measures. White and Liu [WL94] provide an extensive survey of disparity measures, and we adopt their notation. There are many similarities between decision tree induction (which is built on segmentation based on attribute values) and the creation of regions around centroids (which is built on segmentation based on a Euclidean distance measure). The only adaptation we make is that rather than talking about attributes $ a_1, ..., a_m$, we use regions $ R_1, ... ,
R_m$; and, indeed, this points to an interesting analogy.

Suppose that we are dealing with a problem with $ k$ classes and that we have $ m$ different regions. Then Table 4.4 represents the cross-classification of classes and regions.


Table 4.4: A general contingency table
  $ R_1$ $ R_2$   $ R_m$  
$ L_1$ $ n_{11}$ $ n_{12}$ $ \hdots$ $ n_{1m}$ $ n_{1.}$
$ L_2$ $ n_{21}$ $ n_{22}$ $ \hdots$ $ n_{2m}$ $ n_{2.}$
$ \vdots$ $ \vdots$ $ \vdots$ $ \hdots$ $ \vdots$ $ \vdots$
$ L_k$ $ n_{k1}$ $ n_{k2}$ $ \hdots$ $ n_{km}$ $ n_{k.}$
  $ n_{.1}$ $ n_{.2}$ $ \hdots$ $ n_{.m}$ $ n_{..}$


In the contingency table, $ L_i(i=1, k)$ and $ R_j(j=1,m)$ represent class and region respectively; $ n_{ij} (i=1,k; j=1,m)$ represent the frequency counts of the instantiated features in region $ R_j$ coming from an instance with class $ L_i$. Also, define:

$\displaystyle n_{i.} = \sum_{j=1}^m n_{ij}
$

$\displaystyle n_{.j} = \sum_{i=1}^k n_{ij}
$

and

$\displaystyle n_{..} = \sum_{i=1}^k\sum_{j=1}^m n_{ij} = N
$

We can also define the following probabilities:

$\displaystyle p_{ij} = \frac{n_{ij}}{n_{..}}
$

$\displaystyle p_{i.} = \frac{n_{i.}}{n_{..}}
$

$\displaystyle p_{.j} = \frac{n_{.j}}{n_{..}}
$



Subsections
next up previous contents
Next: Gain Ratio Up: Metafeatures: A Novel Feature Previous: Using Metafeatures   Contents
Mohammed Waleed Kadous 2002-12-10