Hidden Markov models (HMMs) are a well-developed technology for classification of multivariate data that have been used extensively in speech recognition, handwriting recognition and even sign recognition. They have been applied for decades. Due to space considerations, an in-depth analysis of HMMs isn't possible. The interested reader may wish to have a look at [RJ86].
HMMs consist of states, possible transitions between states (and the probability of those transitions being taken) and a probability that in a particular state, a particular observation is made. An observation can be anything of interest. For example, it could be a frame in our definition, or it could be a symbol of some kind. HMMs are termed hidden because the state of the HMM can not, in general, be known by looking at the observations. They are Markov in the sense that the probability of observing an output depends only on the current state and not on previous states. By looking at the observations, using an algorithm known as the Viterbi Algorithm, an estimate of the probability that a particular instance or stream observed was generated by that HMM can be computed. We can have one HMM for every class we are interested in, then compute the probability of the particular stream we saw being generated by each of the HMMs, then choose the most probable one.
Figure 2.1: A simple HMM where the only observations are j, k,
or l
For example, consider figure 2.1. This is a very
simple hidden Markov model, with three states. At each timestep, a j,
k, or l is observed. We can work out the probability that the
sequence [j,j,k,l] was generated by this HMM by considering
the most probable state transitions that would generate this output,
and computing the probability that it would occur (which is the
probability of each output times the probability of the
transitions). In this case, the most probable sequence of state
transitions (given we start in state 1) is: [1,2,3,3]. The probability
of this is
. If we had other HMMs, we could compare the
probability for each, and choose the HMM that produced the highest
probability for the given observed sequence. If we had trained each
HMM to have an appropriate set of parameters for a particular class,
then by choosing the HMM which generated the observations with the
highest probability, we would also know the class.
Note that observations can be very complex, and need not be as simple as a single letter j, k or l. In general, people who use HMMs model multivariate streams by representing each frame as an observation. The probability of a particular frame is estimated by using a Gaussian mixture over the channels.
The Baum-Welch reestimation algorithm provides a way for the probabilities of both transitions and of observations within states to be estimated from training data.
HMMs have proved to be very effective in practice, producing high levels of classification accuracy, but they fail to meet several of the goals discussed in section 1.1, namely: