Next: Recurrent Neural Networks
Up: Hidden Markov Models
Previous: Advantages of HMMs
  Contents
However, there are several problems with HMMs:
- They make very large assumptions about the data:
- They make the Markovian assumption: that the emission and the
transition probabilities depend only on the current state. This
has subtle effects; for example, the probability of staying in a
given state falls off exponentially (for example the transition
probability of staying in state 1 for
timesteps is
),
which does not map well to many real-world domains; where a linear
decrease in probability in duration is appropriate.
- The Gaussian mixture assumption for continuous-density hidden
Markov models a huge one. We cannot
always assume that the values are distributed in a normal
manner. Because of the way Gaussian mixture models work, they
must either assume that the channels are independent of one
another, or use full covariance matrices, which introduces many more
parameters.
- The number of parameters that need to be set in an HMM is huge.
For example, the very simple three-state HMM shown in Figure
3.1 there are a total of 15 parameters that need to
be evaluated. For a simple four-state HMM, with five continuous
channels, there would be a total of 50 parameters that would need to
be evaluated
. Note also
that 40 of the parameters are means and standard deviations, which
are themselves aggregate values. Also, because of the way the
Viterbi algorithm allocates frames to states, the frames associated
with a state can often change, causing further susceptibility to the
parameters. Those involved in HMMs often use the technique of
``parameter-tying'' to reduce the number of variables that need to
be learnt by forcing the emission probabilities in one state to be
the same as those in another. For example, if one had two words:
cat and mad, then the parameters of the states
associated with the ``a'' sound could be tied together.
- As a result of the above, the amount of data that is required to
train an HMM is very large. This can be seen by considering typical
speech recognition corpora that are used for training. The TIMIT
database [DAR], for instance, has a total of 630
readers reading a text; the ISOLET database [CMF90] for
isolated letter recognition has 300 examples per letter. Many other
domains do not have such large datasets readily available.
- HMMs only use positive data to train. In other words, HMM
training involves maximising the observed probabilities for examples
belonging to a class. But it does not minimise the probability of
observation of instances from other classes.
- While in some domains, the number of states and transitions can
be found using an educated guess or trial and error, in general,
there is no way to determine this. Furthermore, the states and
transitions depend on the class being learnt. For example, is there
any reason why the words cat and watermelon would
have similar states and transitions
?
- While the basic theory is elegant, by the time you get to an
implementation, several additions have been made to the simple
algorithm. We have already discussed parameter-tying and handling of
continuous values, but there are also adjustments to the state
duration model and adding null emissions.
- The concept learnt by a hidden Markov model is the emission
and transition probabilities. If one is
trying to understand the concept learnt by the hidden Markov model,
then this concept representation is difficult to understand. In speech
recognition, this issue is of little significance, but in other
domains, it may be even more important than accuracy.
Next: Recurrent Neural Networks
Up: Hidden Markov Models
Previous: Advantages of HMMs
  Contents
Mohammed Waleed Kadous
2002-12-10