next up previous contents
Next: Doing the search Up: Disparity Measures Previous: Gain Ratio   Contents

Chi-Square test

Another approach, this one from statistics, is the chi-square test. The chi-square test measures the difference between the expected values in each of the cells in the contingency table. By comparing this against a chi-square statistic, a measure of the probability that the distribution of instantiated features we see in the contingency table is random. The smaller that this probability is, the less likely it is that the distribution is due to chance. This probability is called the power of the test.

The $ \chi^2$ statistic can be computed as:

$\displaystyle \chi^2 = \sum_i \sum_j \frac{(E_{ij}-O_{ij})^2}{E_{ij}}
$

where $ O_{ij}$ is the observed number of instantiated features belonging to region $ R_j$ in class $ L_i$; i.e. $ O_{ij} = n_{ij}$ and $ E_{ij}$ is the number of instantiated features that we would expect for $ n_{ij}$ if the region was independent of the class. Hence

$\displaystyle E_{ij} = \frac{n_{.j}n_{i.}}{n_{..}}
$

In statistics, there is not one chi-square distribution but one for each degree of freedom. It can be shown that the degrees of freedom $ \nu$ in this case are $ (k-1)(m-1)$. Once we have computed the $ \chi^2$ statistic for our contingency table, we can compute the probability from the definition of the $ \chi^2$ distribution, the probability that this particular contingency table was the result of random chance. The smaller that this probability is, the more confident we can be that the distribution we have before is likely to be useful for discriminative purposes.

Unlike information gain, the $ \chi^2$ distribution does not suffer from the same issues of bias to more regions, at least theoretically. However, it is more difficult and time-consuming to calculate than the information gain or gain ratio (since to compute the probability mentioned above requires computing the integral of the probability density function of the of $ \chi^2$ function).


next up previous contents
Next: Doing the search Up: Disparity Measures Previous: Gain Ratio   Contents
Mohammed Waleed Kadous 2002-12-10