In defining the above task, we have implied that in some way, the
function
should ``approximate'' the function
. What exactly do we mean by this, and how do we
measure how closely the predicted class functions match the actual
class functions?
In general we can't, unless
.
However, we can at least define the a theoretical measure for the
accuracy.
On a single element S, one way to measure success would be to
say that if
then it is accurate and
inaccurate otherwise.
This works for most cases. In some domains, however, the above is too
simplistic - not all inaccuracies are of equal badness. Some errors
may be worse than others. For example, consider working on a medical
temporal classification application involving a diagnosis, where
, with ``yes'' indicating they have some
condition and ``no'' indicating they do not. A ``false positive''
classification (i.e. misclassifying a negative as a positive) may not
be as bad as a ``false negative'' (i.e. misclassifying a positive as
a negative). In the sign language domain, misclassifying ``bad'' as
``unwell'' may not be as bad as misclassifying ``bad'' as ``good''.
To solve this problem, we introduce a function
which tells us what the cost of misclassifying an i as a j. The
function need not be i-j symmetric, i.e.
. Typically, of course, cost(i,i) = 0.
We can represent the above simple case (where all errors are equally bad) as:
Another complication is that sometimes it does not make sense to
optimise for the whole space equally over the whole of
. For
example, it would be better to get higher accuracy on frequently
occurring signs more than infrequent ones. So to give a more accurate
measure of accuracy, this too must be included. We use the function
to indicate the probability that a stream S has of
occurring in the stream set
.
Our goal can therefore be defined as finding:
In other words, we are trying to find the function
which
minimises the sum of the cost of misclassification times the
probability of occurrence over the whole of
.