next up previous contents
Next: Bounds on variation Up: Further refinements Previous: Further refinements   Contents


Region membership measures

In the earlier parts of this chapter, training streams either had an instantiated feature belonging to a particular region, or they didn't. For this reason, Table 4.10 contains only yes and no values - a binary membership test. However, it is possible to ascribe to each event a membership value that denotes the confidence that the point belongs to a particular region. It is generally the case that the closer the instance is to a particular centroid, the more confident one can be that it belongs to that region. Work in areas such as fuzzy set theory [Zad65] and rough sets [Paw82] have similarities to the idea of using non-binary region memberships.

A first guess at a region membership measure is normalised Euclidean distance from the centroid. By normalised, we mean that the distance along each axis is divided by its standard deviation. Doing so ensures that no one feature dominates the calculation of distance. This copes with differences in scale of different measurements. For example, let us assume that position is initially measured in metres; but is later changed to use millimetres. If we fail to adjust for the fact that millimetres are three orders of magnitude as large as metres, then in our distance calculations the position would have three orders of magnitude more impact on the distance. However, if we divide by the observed standard deviation in the data, then this will adjust for the fact that different axes may have different units.

Normalised distance from the centroid is a good first approximation, however it has some notable problems. Consider a simple vertical region boundary as shown in Figure 4.20.

Figure 4.20: Distance measures in the parameter space. $ c_1$ and $ c_2$ are two centroids, A and B are instantiated features.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =4in \epsfbox{distance-measure.eps}\par\centering\centering\end{center}\end{figure}

Point A is approximately three times as close to $ c_1$ as B. Hence normalised distance implies that A is far more likely than B to be a member of the region around $ c_1$. We know that though A is closer to $ c_1$, it is also much closer to a region boundary than B is, hence it would intuitively make sense if we were less confident about the membership of A then we were of B. How can we define a more appropriate region membership measure?

Recall that the region boundary tells us the points which are equidistant from two centroids, in other words, for points along the region boundary the distance between two of the centroids is equal. This suggests an alternative region membership measure: the relationship between the normalised distance to the nearest centroid and the normalised distance to the second nearest centroid.

Consider the following region membership measure. Let $ d_{1}$ be the distance to the nearest centroid $ c_{n1}$ and $ d_{2}$ be the distance to the second nearest centroid $ c_{n2}$. Then consider the measure:

$\displaystyle D = \log_2(\frac{d_2}{d_1})
$

We will term this measure the relative membership. When a point is on the the region boundary by definition the distance to $ c_{n1}$ and $ c_{n2}$ is equal. Hence the ratio $ \frac{d_2}{d_1}$ will be equal to 1, hence $ D=0$. Consider a point like A in Figure 4.20. It is slightly closer to $ c_1$ than $ c_2$; assume that the distance to $ c_1$ is 5 and the distance to $ c_2$ is 6. Then the relative membership is $ \log_2(6/5) = 0.263$. Now consider a point like B, which is, say 30 units from $ c_2$ and 15 units from $ c_1$. Its relative membership will be $ \log_2(30/15) = 1$. Hence, one is more confident that B is in the region defined by $ c_1$ than A, which matches our intuitions.

Note that if the point we are looking at is actually the centroid of a region, then the relative membership is $ \log_2(\frac{d2}{0}) = \infty$ This also makes sense; as we are absolutely sure that the point $ c_1$ lies in its own region $ R_1$.

With this minor change, relative membership can be used in place of binary membership. There are some complications, however. What happens if there is more than one instantiated feature lying in a particular region? With relative membership, two instances in the same region can have different relative memberships, unlike binary memberships. For example, consider $ s_3$ from Tech Support domain. Its events are $ \{(2,1), (5,1), (12,3)\}$. Now consider that we use the segmentation that we examined in Trial 2 of the random search (see Figure 4.15). Then both of the instances $ (2,1)$ and $ (5,1)$ lie within region 2. If the relative membership for each of these is calculated, we would get two different values. For the point $ (2,1)$, the relative membership (using unnormalised Euclidean distance for ease) is 1.84, while for $ (5,1)$ it is 0.66. Table 4.12 shows the relative membership for each point within $ R_2$ for trial 2. There are several possible solutions to this problem; but a simple one is to take the point with the greatest relative membership[*]. So, when constructing a table for Trial 2 as before, rather than a yes/no value, we would now put in a ``1.84''. Table 4.11 shows the table with relative rather than binary attributes. Note that if there is no instance in the space, we simply enter a ``0''. This is equivalent to saying that there are no instantiated features within that region.


Table 4.11: Numerical attribution of synthetic features for the Tech Support domain in Trial 2.
Stream Class Synthetic Events
  Region 1 Region 2 Region 3
1 Happy 0.5 0 0
2 Angry 0 $ \infty$ $ \infty$
3 Angry 0 1.84 1.35
4 Happy $ \infty$ 0 0
5 Happy $ \infty$ 0 0
6 Angry 0 0.265 2.35


Feeding this into C4.5, the results in Figure 4.21 are produced.

Figure 4.21: Rule for telling happy and angry customers apart, using synthetic features from trial 2 and using relative membership.
\begin{figure}\begin{centering}
\par\begin{boxedverbatim}rgn1 <= 0: Angry (3.0)
rgn1 > 0: Happy (3.0)\end{boxedverbatim}\par\end{centering}\end{figure}

This result seems surprising. In this case, the rule learnt is identical to the binary attribute rule learnt. rgn1 > 0 means that there is an instance within region 1; whereas rgn1 <=0 means that there is no instance in region 1. This also correctly indicates that by using relative membership, the hypothesis language expands greatly and is a strict superset of the binary membership system. Although there is no use made of it in this particular case, it is quite possible that a rule may take the form rgn1 > 1, in which case it is not simply sufficient for there to be an instance within region 1, but it would also be necessary for there to be an instance within region 1 which also has a relative membership greater than 1. In other words, not only must the instance lie in region 1, but it must be twice as close to the point $ c_1$ as to any of the other centroids.

While it may at first seem that this can impede readability, it can be employed, as with the next section to assist in making the concept description more useful.


next up previous contents
Next: Bounds on variation Up: Further refinements Previous: Further refinements   Contents
Mohammed Waleed Kadous 2002-12-10