The output of the k-means algorithm is a set of clusters, each cluster described in terms of its centroid (described as a point in the parameter space) and the points that belong to cluster. The standard deviation for each dimension of each cluster is also computed.
It is possible to make a rough estimate of the confidence of membership of a particular event to a given cluster. This estimate is based on the following assumptions:
Our estimate of confidence is based on calculating the probability that an element of this cluster could be as far or further out from the centroid of this cluster as the given event.
Figure 6.4: An example illustrating how the
confidence estimate can be made
In figure 6.4
let x1 and x2 be two dimensions of the
parameter space. The probability we are trying to estimate is the
probability that a point in the shaded area belongs to the cluster.
Let
be the standard deviation of x1 and
be the standard deviation of x2. We can express the difference along
the x1 axis between p and c as a multiple of the standard deviation,
so that
. Similarly, we can express the
distance along the x2 axis as
. If
x1 and x2 are independent and the cluster has a Gaussian
distribution, then we can estimate the probability that an element
belonging to the cluster is in the grey area as:
where Z is the standard normal distribution (i.e. a normal distribution with a mean of zero and a standard deviation of 1).
In other words, it's the probability that the point of consideration
is more than a standard deviations away from the centroid in the
x1 dimension and more than b standard deviations away from the
centroid in the x2 dimension. The values of
can be
found easily using table lookup. This method can be generalised to
higher dimensions.
To understand how this works, consider some examples: imagine we have
an event that matches the centroid of the cluster exactly. Then
a = b = 0. But
will occur with a probability 1. Thus
our confidence that an event that matches the centroid of the cluster
belongs to the cluster is 1. This is intuitive, since the centroid can
be thought of as the most obvious example of an element belonging to
the cluster.
Now consider what happens when p is
away
from c in the x1 dimension and
in the x2
dimension. From a statistics text it can be determined that
. Thus the probability that a point lies
or further away from the centroid, but
still belongs to this cluster is
. Alternatively, a
point which is
away in the x1 dimension, but has the
same x2 value as the centroid will have a confidence of
.
For convenience, we take the logarithm of the confidence. This
simplifies the mathematics (since multiplication becomes addition)
and also avoids underflow problems (with very small probabilities)
when these techniques are implemented using floating point
operations. Thus, since the range of confidence is [0..1], the range
of the logarithm of the confidence is
.
Again, we use a per-class technique. Thus we only check for the presence or absence of clusters coming from the appropriate class. For each training instance and for each cluster, we find the event (if there is one) in that training instance which has the highest probability of belonging to that cluster. This becomes our estimate of the confidence that the training instance has an event belonging to that cluster.