next up previous contents
Next: Further refinements Up: Examples Previous: With K-Means   Contents

With directed segmentation

Now, we do the same thing, using a directed segmentation approach. We will set $ n_{min} = 2$, $ n_{max} = 3$ and $ numTrials = 3$. We will look at several disparity measures.

It should be noted that one of the advantages of directed over undirected segmentation is that it chooses an instance which is an observed instantiated feature. Quinlan [Qui93] notes that experts who look at the results of learning with continuous attributes find it easier to understand cutpoints that can be found in the data, rather than averages, since it may be that certain averages are physically impossible (e.g. in the case of the previous result, a LoudRun that has a duration of 3.33 timesteps). Although his results were with cutpoints, and ours are with regions, the result still holds. We will use the random search algorithm proposed above, but run three iterations in parallel for ease of comparison.

The first stage of the algorithm is to generate 3 random subsets with between 2 and 3 elements. The results of this step are shown in Table 4.5.


Table 4.5: Random sets of centroids generated for the Tech Support domain.
Trial Centroids
1 $ {c_1=(4,4),c_2=(5,1)}$
2 $ {c_1=(4,4),c_2=(3,1),c_3=(9,4)}$
3 $ {c_1=(3,3),c_2=(12,1)}$


For each of these, it is possible to generate a region boundary diagram. Figure 4.14, 4.15 and 4.16 show the parameter space using trials 1,2 and 3 respectively.

Figure 4.14: Trial 1 points and region boundaries.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{blues-rand1.eps}\par\centering\centering\end{center}\end{figure}

Figure 4.15: Trial 2 points and region boundaries.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{blues-rand2.eps}\par\centering\centering\end{center}\end{figure}

Figure 4.16: Trial 3 points and region boundaries.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{blues-rand3.eps}\par\centering\centering\end{center}\end{figure}

Now we can calculate the contingency tables for each trial. We do this by counting the number of instances of each class in each region. For example, in Table 4.6, there is 1 instantiated feature in the region around centroid 1 ($ R_1$) which came from a stream whose class was Angry, and 3 whose original class was Happy[*]. These go into the first column. Similarly, in region 2 ($ R_2$), the region surrounding $ c_2$, there are 7 ``Angry'' instantiated features and no ``Happy'' instantiated features (counting the border case as belonging to this region). This goes into the second column. The final column is just the row totals (i.e. $ 1 + 7$ and $ 3 + 0$) and the final row is just the column totals (i.e. $ 1 + 3$ and $ 7 + 0$). The value in the bottom right hand corner is the sum of either the row totals or the column totals. These should of course be the same value and it should be equal to the total number of instantiated features.


Table 4.6: Contingency table for Trial 1.
  $ R_1$ $ R_2$  
Angry $ 1$ $ 7$ $ 8$
Happy $ 3$ 0 $ 3$
  $ 4$ $ 7$ $ 11$



Table 4.7: Contingency table for Trial 2.
  $ R_1$ $ R_2$ $ R_3$  
Angry 0 $ 5$ $ 3$ $ 8$
Happy $ 3$ 0 0 $ 3$
  $ 3$ $ 5$ $ 3$ $ 11$



Table 4.8: Contingency table for Trial 3.
  $ R_1$ $ R_2$  
Angry $ 5$ $ 3$ $ 8$
Happy $ 3$ 0 $ 3$
  $ 8$ $ 3$ $ 11$


Finally, we can calculate a number of disparity measures from the contingency tables, based on the methods discussed in Section 4.7. The results are shown in Table 4.9.


Table 4.9: Evaluation of trials 1, 2 and 3 using Gain, Gain Ratio and Chi-Squared disparity measures.
Trial Information Gain Chi-squared
  Gain Ratio Statistic DoF Prob $ -\log_{2}$ Prob
1 0.550 0.582 9.79 1 0.0176 9.15
2 0.845 0.549 11.0 2 0.0408 7.93
3 0.151 0.179 1.55 1 0.22 2.18


For example, to work out the information gain in the first trial, this can be computed using:

\begin{displaymath}
\begin{array}{lll}
\mathit{Gain} & = & H_{C} + H_{A} - H_{ce...
...50 \\
& & \mathrm{where}\ i(x) = x \log_2(x)\\
\end{array}\end{displaymath}

The gain ratio can be computed by dividing the information gain by $ H_A$, which in this case is: $ -(i(\frac{4}{11}) + i(\frac{7}{11})) =
0.945$. Hence the gain ratio is $ \frac{0.550}{0.945} = 0.582$.

Computing the $ \chi^2$ heuristic is more involved. For first cell in trial 1, $ E_{ij} = \frac{8 \times 4}{11} = 2.91$. We can then compute the $ \chi^2$ component for that particular cell using $ \frac{(e-o)^2}{e}$ as $ \frac{(1-2.91)^2}{2.91} = 1.25$. If we repeat this total for the remaining three cells in trial 1, and sum them, we get a total of 9.79. This number is known as the $ \chi^2$ statistic. Note also that there are two classes and two regions, hence the degree of freedom of the statistic is $ (2-1)(2-1) = 1$.

Usually this statistic is compared against a value in a critical-values table in a significance test. For example, it is used to decide whether there is a less than 5 per cent chance that such a distribution is random. However, we want to know for a particular $ \chi^2$ statistic and the degrees of freedom probability that this process was the result of random chance. This can be computed easily using the cumulative distribution function of the $ \chi^2$ distribution. The result is shown in the fourth column of Table 4.9. Finally, for comparison, we take the negative log of the probability. This is mainly for to match the direction of the other heuristics (bigger is better) and because it just as easy to compute without suffering floating-point difficulties in the implementation.

Looking at Table 4.9, it seems that the results for Trial 3 are inferior to the other two. However, Trials 1 and 2 are quite close; information gain puts Trial 2 ahead of Trial 1, unlike the other two measures. A tradeoff has occurred: even though Trial 2 produced a ``purer'' result (in that there is only one class in each region), it did so using 3 centroids, Trial 1 accomplishes an ``almost as pure'' result using only 2 centroids.

These regions, much as for the previous case with K-means, can now be used to create synthetic events. If we use Trial 2, for example, we get the features shown in Table 4.10.


Table 4.10: Attribution of synthetic features for the Tech Support domain in Trial 2.
Stream Class Synthetic Events
  Region 1 Region 2 Region 3
1 Happy Yes No No
2 Angry No Yes Yes
3 Angry No Yes Yes
4 Happy Yes No No
5 Happy Yes No No
6 Angry No Yes Yes


Feeding it to C4.5 produces Figure 4.17. Note that we can easily post-process the result to produce something that is far more readable by using the LoudRun metafeature, which is shown in Figure 4.18.

Figure 4.17: Rule for telling happy and angry customers apart, using synthetic features from trial 2.
\begin{figure}\begin{center}
\begin{boxedverbatim}rgn1 = yes: Happy (3.0)
rgn1 = no: Angry (3.0)\end{boxedverbatim}\end{center}\end{figure}

Note that if we did the same for the K-means value, we would get the slightly nonsensical definition shown in Figure 4.19.

Figure 4.18: Human readable form of rules in Figure 4.17.
\begin{figure}\begin{centering}
\begin{boxedverbatim}Does have a Loud run
ap...
...Happy (3.0)
Otherwise: Angry (3.0)\end{boxedverbatim}\end{centering}\end{figure}

Figure 4.19: Human readable form of rules in Figure 4.4.
\begin{figure}\begin{centering}
\begin{boxedverbatim}Does have a True run
ap...
...Angry (3.0)
Otherwise: Happy (3.0)\end{boxedverbatim}\end{centering}\end{figure}


next up previous contents
Next: Further refinements Up: Examples Previous: With K-Means   Contents
Mohammed Waleed Kadous 2002-12-10