In the earlier parts of this chapter, training streams either had an instantiated feature belonging to a particular region, or they didn't. For this reason, Table 4.10 contains only yes and no values - a binary membership test. However, it is possible to ascribe to each event a membership value that denotes the confidence that the point belongs to a particular region. It is generally the case that the closer the instance is to a particular centroid, the more confident one can be that it belongs to that region. Work in areas such as fuzzy set theory [Zad65] and rough sets [Paw82] have similarities to the idea of using non-binary region memberships.
A first guess at a region membership measure is normalised Euclidean distance from the centroid. By normalised, we mean that the distance along each axis is divided by its standard deviation. Doing so ensures that no one feature dominates the calculation of distance. This copes with differences in scale of different measurements. For example, let us assume that position is initially measured in metres; but is later changed to use millimetres. If we fail to adjust for the fact that millimetres are three orders of magnitude as large as metres, then in our distance calculations the position would have three orders of magnitude more impact on the distance. However, if we divide by the observed standard deviation in the data, then this will adjust for the fact that different axes may have different units.
Normalised distance from the centroid is a good first approximation, however it has some notable problems. Consider a simple vertical region boundary as shown in Figure 4.20.
![]() |
Point A is approximately three times as close to
as B. Hence
normalised distance implies that A is far more likely than B to be a
member of the region around
. We know that though A is closer to
, it is also much closer to a region boundary than B is, hence it
would intuitively make sense if we were less confident about
the membership of A then we were of B. How can we define a more
appropriate region membership measure?
Recall that the region boundary tells us the points which are equidistant from two centroids, in other words, for points along the region boundary the distance between two of the centroids is equal. This suggests an alternative region membership measure: the relationship between the normalised distance to the nearest centroid and the normalised distance to the second nearest centroid.
Consider the following region membership measure. Let
be the
distance to the nearest centroid
and
be the distance
to the second nearest centroid
. Then consider the measure:
We will term this measure the relative membership. When a
point is on the the region boundary by definition the distance to
and
is equal. Hence the ratio
will
be equal to 1, hence
. Consider a point like A in Figure
4.20. It is slightly closer to
than
; assume that the distance to
is 5 and the distance to
is 6. Then the relative membership is
. Now
consider a point like B, which is, say 30 units from
and 15
units from
. Its relative membership will be
.
Hence, one is more confident that B is in the region defined by
than A, which matches our intuitions.
Note that if the point we are looking at is actually the centroid of a
region, then the relative membership is
This also makes sense; as we are absolutely sure that the point
lies in its own region
.
With this minor change, relative membership can be used in place of
binary membership. There are some complications, however. What happens
if there is more than one instantiated feature lying in a particular
region? With relative membership, two instances in the same region
can have different relative memberships, unlike binary memberships.
For example, consider
from Tech Support domain. Its events are
. Now consider that we use the segmentation
that we examined in Trial 2 of the random search (see Figure
4.15). Then both of the instances
and
lie within region 2. If the relative membership for each of
these is calculated, we would get two different values. For the point
, the relative membership (using unnormalised Euclidean
distance for ease) is 1.84, while for
it is 0.66. Table
4.12 shows the relative membership for each point
within
for trial 2. There are several possible solutions to this
problem; but a simple one is to take the point with the greatest
relative membership
. So, when constructing a table for
Trial 2 as before, rather than a yes/no value, we would now put in a
``1.84''. Table 4.11 shows the table with
relative rather than binary attributes. Note that if there is no
instance in the space, we simply enter a ``0''. This is equivalent to
saying that there are no instantiated features within that region.
Feeding this into C4.5, the results in Figure 4.21 are produced.
![]() |
This result seems surprising. In this case, the rule learnt is
identical to the binary attribute rule learnt. rgn1 > 0 means
that there is an instance within region 1; whereas rgn1 <=0
means that there is no instance in region 1. This also correctly
indicates that by using relative membership, the hypothesis language
expands greatly and is a strict superset of the binary membership
system. Although there is no use made of it in this particular case,
it is quite possible that a rule may take the form rgn1 > 1, in
which case it is not simply sufficient for there to be an instance
within region 1, but it would also be necessary for there to be an
instance within region 1 which also has a relative membership
greater than 1. In other words, not only must the instance lie in
region 1, but it must be twice as close to the point
as to any
of the other centroids.
While it may at first seem that this can impede readability, it can be employed, as with the next section to assist in making the concept description more useful.