next up previous contents
Next: Chi-Square test Up: Disparity Measures Previous: Disparity Measures   Contents

Gain Ratio

Given this notation, we can define the following information measures, for each of the cell, class and regions :

$\displaystyle H_{cell} = - \sum_{i=1}^k\sum_{j=1}^m p_{ij} \log_2({p_{ij}})
$

$\displaystyle H_{C} = - \sum_{i=1}^k p_{i.} \log_2({p_i.})
$

$\displaystyle H_{A} = - \sum_{j=1}^m p_{.j} \log_2({p_.j})
$

The information gain is the difference between the information stored in the cells and the information about class, that is to say:

$\displaystyle H_{T} = H_{C} + H_{A} - H_{cell}
$

This is the heuristic that was originally used by Quinlan in ID3 [Qui86]. However, it has one significant drawback: it does not take into account the number of regions. If information gain were to be used ``raw'' without regard to the number of regions, then this would lead to a bias to having a huge number of regions. Imagine a region for each point in the space, such that each region only has one instantiated feature and therefore one class. By substituting in the above formula, we get that such a selection of regions has an information gain of 1, which is the most possible. But it is of no use to us. Hence, in C4.5, Quinlan introduces the gain ratio. The gain ratio compensates for the number of attributes by normalising by the information encoded in the split itself. It can be shown that using the above formula the gain ratio is $ \frac{H_T}{H_A}$.

In these experiments with metafeatures, therefore, we used the gain ratio as one of our disparity measures. The higher the gain ratio, the more likely the subdivision into regions is useful for classification.


next up previous contents
Next: Chi-Square test Up: Disparity Measures Previous: Disparity Measures   Contents
Mohammed Waleed Kadous 2002-12-10