next up previous contents
Next: Advantages of HMMs Up: Established techniques Previous: Established techniques   Contents


Hidden Markov Models

Hidden Markov models (HMMs) are the most popular means of temporal classification. They have found application in areas like speech, handwriting and gesture recognition. An excellent text on HMMs is [RJ86].

Informally speaking, a hidden Markov model is a variant of a finite state machine. However, unlike finite state machines, they are not deterministic. A normal finite state machine emits a deterministic symbol in a given state. Further, it then deterministically transitions to another state. Hidden Markov models do neither deterministically, rather they both transition and emit under a probabilistic model.

Usually, with a finite state machine, a string of symbols can be given and it can be easily determined (a) whether the string could have been generated by the finite state machine in the first place (b) if it could have been generated by the finite state machine, what the sequence of state transitions it undertook were. With a hidden Markov model, (a) is replaced with a probability that the HMM generated the string and (b) is replaced with nothing: in general the exact sequence of state transitions undertaken is ``hidden'', hence the name.

Figure 3.1: A diagram illustrating an HMM and the different ways a,a,b,c can be generated by the HMM.
\begin{figure}\begin{center}
\leavevmode \epsfxsize =5in \epsfbox{hmms-1.eps}\par\centering\centering\end{center}\end{figure}

Formally, a HMM consists of the following parts:

Figure 3.1 shows an example of an HMM. Assume that the starting state is 1 and the ending state is 3. For this simple example, the output symbols ($ Y$) are $ \{a, b, c\}$. In Figure 3.1 the transition probabilities (the $ a_{ij}$) are shown above each transition. Under each state are the emission probabilities.

The HMM can be used to calculate the probability that a particular output sequence was generated by the HMM. Consider the sequence aabc. Some different state sequences that generate the sequence aabc are also shown in Figure 3.1. For the moment, assume that we know the state sequence as well as the the observed sequence. Under these circumstances, the total probability of the observed sequence can easily be determined - it is the product of the probability of emission in each state multiplied by the probability of transition to the next state, then multiplied by the emission probability in the next state. For example, consider the sequence aabc and the state sequence 1, 1, 2, 3. Then we start in state 1, emit an $ a$ (probability 0.8), then a transition to state 1 (probability 0.5), then emit another $ a$ in state 1 (probability 0.8), then transition to state 2 (probability 0.3), then emit a $ b$ in state 2 (probability 0.6), then transition to state 3 (probability 0.5), then emit an $ c$ in state 3 (probability 0.1). We multiply all of these together to get the final probability for that sequence of states. The total probability that the HMM will generate the sequence aabc is the sum of the probabilities of all the paths that produce the sequence aabc.

In general, given the transition and the emission probabilities of an HMM, we can work out the probability that a sequence of observations $ O = [o_1, ..,o_{t_{max}}]$ (where $ t_{max}$ is the lengths of the sequence; or in our parlance, $ t_{max}$ is the number of frames) was generated by a given model $ \lambda$, often expressed as $ \mathit{Pr}(O\vert\lambda)$:

$\displaystyle \mathit{Pr}(\mathbf{O}\vert\lambda)= \sum_{s \in \mathit{valid}(S)} \prod_{t=1}^T a_{s_{t-1}s_t}b_{s_t}(o_t)
$

where $ \mathit{valid}(S)$ are all the valid state sequences starting in a starting state and ending in an ending state. Each state sequence therefore has the form $ [s_1,...,s_{t_{max}}]$, where $ s_i$ is the state at time $ i$.

There are algorithms for calculating the sum of the probabilities over all possible paths. However, the algorithms are slow, so typically, only the probability of most probable path is calculated as the ``merit'' figure. This has been shown not to impact too adversely on accuracy, while decreasing the computational complexity substantially. The algorithm for doing so is called the Viterbi algorithm.

HMMs can be employed for classification in the following way: given several models[*], it is possible to determine the model which will produce a given sequence of observations with the highest probability. Thus, if for each class there is a model with the states, transitions and probabilities set appropriately, the Viterbi algorithm can be used to calculate the model that most probably ``generated'' the sequence of observations. It is easy to see how this can be extended to a weak temporal classification system - a single model is constructed for each class, and given an unlabelled instance, the probability that that sequence of output symbols was generated by each HMM is calculated. The model with the highest probability is the class prediction.

There are three remaining issues to resolve. Firstly, how is the correct model chosen, i.e. what are the appropriate states and transitions? There is no definitive answer to this problem, and in general, experts use domain knowledge, experience and trial and error.

Secondly, once we have selected a model, how do we set the transition ($ a_{ij}$) and emission probabilities ($ b_j(o_t)$) so as to maximise the observed sequences that are associated with a particular class? The Baum-Welch reestimation algorithm provides a mechanism for doing so efficiently. It takes the sequence of observations and allocates observations to particular states based on the current values of $ a_{ij}$ and $ b_j(o_t)$ using the Viterbi algorithm. It then updates $ a_{ij}$ and $ b_j(o_t)$ to reflect this allocation. The process is run again until the probability converges. From the description it should be clear that it is an E-M (expectation-maximisation) style algorithm.

Thirdly, so far we have only considered simple observed symbols from a fixed alphabet. In real life, the observed symbols are actually far more complex than discrete symbols; using our notation, each observed symbol corresponds to an entire frame; usually consisting of a vector of real and discrete values. Can hidden Markov models be extended to cope with these as observations instead of single symbols?

One approach that is sometimes used is vector quantisation, which is a general approach for discretisation of multivariate data. Using this approach, a certain number of vectors is chosen, typically a power of 2; e.g. 64. These 64 vectors form the codebook. Each frame is represented as the element of the codebook that is most similar to it (according to some metric). Each entry in the codebook becomes an output symbol. There are well-developed techniques for generating appropriate vector codebooks from data.

The most common approach is to realise that the important aspect of the observed symbol is not the symbol itself, but the probability associated with that symbol. All that the algorithm requires to function is an estimate of the emission probability in a given state. Hence, if $ b_j(o_t)$ is represented as a function over frames that returns the probability of that frame, rather than over individual symbols, the probability can be estimated as before, with no other changes to the algorithm.

This is accomplished by modelling the distribution of each state as a Gaussian mixture (the same as a Gaussian distribution, but for vectors, rather than a scalars). Thus, for each element of the feature vector (or frame) the mean and standard deviation is calculated for that state. Hence, the total probability of a given frame can be calculated using the Gaussian mixture[*]. Such an HMM is termed a continuous-density hidden Markov model (CD-HMM) and it is the most commonly used HMM model.

These solutions have all been implemented in practical systems. HTK (hidden Markov model toolkit) is a freely available but very complete implementation of an HMM system that includes all of the above [YKO$^+$].



Subsections
next up previous contents
Next: Advantages of HMMs Up: Established techniques Previous: Established techniques   Contents
Mohammed Waleed Kadous 2002-12-10