A very new (February 1995) approach was suggested by Starner in his Master's Thesis [Sta95], and together with Pentland in another Tech Report [SP95].
First, they extended the HMM to be able to handle not just simple output symbols, but distributions of multiple variables, thus allowing them to use features extracted directly from the data, instead of having to do pre-process the data into a sequence of output symbols.
In terms of hardware, a colour camera was used and users wore a yellow glove on their right hand and an orange glove on their left. Five images are taken per second and fed into an SGI Indigo 2. A small selection of features are extracted from these images, such as the bounding ellipse and its eccentricity, the x and y positions for each hand and the axis of the bounding ellipse.
In this case, a raw correct rate was achieved of 91.3 per cent. By
imposing a strict grammar on this, it was shown that accuracy rates in
excess of 99 per cent were possible, with real-time performance. A
selection of 40 signs were used and the simplifying assumption was
made that signs have one grammatical class
.
The signs were selected to allow a large set of coherent sentences to
be constructed. Furthermore, the grammar was strictly ``pronoun,
verb, noun, adjective, pronoun'', with pronouns and adjectives
possibly empty. It is suggested that the bigram and trigram
techniques could be used (a well-known method), whereby the preceding
sign(s) are used to consider what the probability of the current sign
is.
The system makes no attempt to consider the movements of the fingers, however. This is a limitation in a number of ways, since for large-lexicon system, finger position becomes increasingly important. Furthermore, there is no way that the system can handle finger-spelling.
Still, the application of the use of Hidden Markov Models was shown to
be very promising and it may be effective to try to use Hidden Markov
Models, even on glove-based sign recognition systems, since the HMM's
are not directly related to the use of video at all; they are used on
attributes extracted from the motion. This means that the HMM
technique would be particularly well-suited to migrating to
large-lexicon Auslan recognition
. In fact, with very little modification,
the features explored in this thesis can be used as the basis for the
features used in the HMM.