Hidden Markov models (HMMs) are the most popular means of temporal classification. They have found application in areas like speech, handwriting and gesture recognition. An excellent text on HMMs is [RJ86].
Informally speaking, a hidden Markov model is a variant of a finite state machine. However, unlike finite state machines, they are not deterministic. A normal finite state machine emits a deterministic symbol in a given state. Further, it then deterministically transitions to another state. Hidden Markov models do neither deterministically, rather they both transition and emit under a probabilistic model.
Usually, with a finite state machine, a string of symbols can be given and it can be easily determined (a) whether the string could have been generated by the finite state machine in the first place (b) if it could have been generated by the finite state machine, what the sequence of state transitions it undertook were. With a hidden Markov model, (a) is replaced with a probability that the HMM generated the string and (b) is replaced with nothing: in general the exact sequence of state transitions undertaken is ``hidden'', hence the name.
![]() |
Formally, a HMM consists of the following parts:
Figure 3.1 shows an example of an HMM. Assume that
the starting state is 1 and the ending state is 3. For this simple
example, the output symbols (
) are
.
In Figure
3.1 the transition probabilities (the
) are
shown above each transition. Under each state are the
emission probabilities.
The HMM can be used to calculate the probability that a particular output
sequence was generated by the HMM. Consider the sequence aabc.
Some different state sequences that generate the sequence aabc are
also shown in Figure 3.1. For the moment, assume that
we know the state sequence as well as the the observed sequence. Under
these circumstances, the total probability of the observed sequence
can easily be determined - it is the product of the probability of
emission in each state multiplied by the probability of transition to
the next state, then multiplied by the emission probability in the
next state. For example, consider the sequence aabc and the
state sequence 1, 1, 2, 3. Then we start in state 1, emit an
(probability 0.8), then a transition to state 1 (probability 0.5),
then emit another
in state 1 (probability 0.8), then transition to
state 2 (probability 0.3), then emit a
in state 2 (probability 0.6),
then transition to state 3 (probability 0.5), then emit an
in state
3 (probability 0.1). We multiply all of these together to get the
final probability for that sequence of states. The total probability
that the HMM will generate the sequence aabc is the sum of the
probabilities of all the paths that produce the sequence aabc.
In general, given the transition and the emission probabilities of an
HMM, we can work out the probability that a sequence of observations
(where
is the lengths
of the sequence; or in our parlance,
is the number of frames) was
generated by a given model
, often expressed as
:
where
are all the valid state sequences starting
in a starting state and ending in an ending state. Each state sequence
therefore has the form
, where
is the state at
time
.
There are algorithms for calculating the sum of the probabilities over all possible paths. However, the algorithms are slow, so typically, only the probability of most probable path is calculated as the ``merit'' figure. This has been shown not to impact too adversely on accuracy, while decreasing the computational complexity substantially. The algorithm for doing so is called the Viterbi algorithm.
HMMs can be employed for classification in the following way: given
several models
, it is
possible to determine the model which will produce a given sequence of
observations with the highest probability. Thus, if for each class
there is a model with the states, transitions and probabilities set
appropriately, the Viterbi algorithm can be used to calculate the
model that most probably ``generated'' the sequence of observations.
It is easy to see how this can be extended to a weak temporal
classification system - a single model is constructed for each class,
and given an unlabelled instance, the probability that that sequence
of output symbols was generated by each HMM is calculated. The model
with the highest probability is the class prediction.
There are three remaining issues to resolve. Firstly, how is the correct model chosen, i.e. what are the appropriate states and transitions? There is no definitive answer to this problem, and in general, experts use domain knowledge, experience and trial and error.
Secondly, once we have selected a model, how do we set the transition
(
) and emission probabilities (
) so as to maximise
the observed sequences that are associated with a particular class?
The Baum-Welch reestimation algorithm provides a mechanism
for doing so efficiently. It takes the sequence of observations and
allocates observations to particular states based on the current
values of
and
using the Viterbi algorithm. It then
updates
and
to reflect this allocation. The process
is run again until the probability converges. From the description it
should be clear that it is an E-M (expectation-maximisation) style
algorithm.
Thirdly, so far we have only considered simple observed symbols from a fixed alphabet. In real life, the observed symbols are actually far more complex than discrete symbols; using our notation, each observed symbol corresponds to an entire frame; usually consisting of a vector of real and discrete values. Can hidden Markov models be extended to cope with these as observations instead of single symbols?
One approach that is sometimes used is vector quantisation, which is a general approach for discretisation of multivariate data. Using this approach, a certain number of vectors is chosen, typically a power of 2; e.g. 64. These 64 vectors form the codebook. Each frame is represented as the element of the codebook that is most similar to it (according to some metric). Each entry in the codebook becomes an output symbol. There are well-developed techniques for generating appropriate vector codebooks from data.
The most common approach is to realise that the important aspect of
the observed symbol is not the symbol itself, but the probability
associated with that symbol. All that the algorithm requires to
function is an estimate of the emission probability in a given state.
Hence, if
is represented as a function over frames that
returns the probability of that frame, rather than over individual
symbols, the probability can be estimated as before, with no other
changes to the algorithm.
This is accomplished by modelling the distribution of each state as a
Gaussian mixture (the same as a Gaussian distribution, but for
vectors, rather than a scalars). Thus, for each element of the feature
vector (or frame) the mean and standard deviation is calculated for
that state. Hence, the total probability of a given frame can be
calculated using the Gaussian mixture
. Such an HMM is termed a
continuous-density hidden Markov model (CD-HMM) and it is the most
commonly used HMM model.
These solutions have all been implemented in practical systems. HTK (hidden Markov model toolkit) is a freely available but very complete implementation of an HMM system that includes all of the above [YKO$^+$].