Ambiguity Resolution  Statistical Methods
Reference: Allen, Chapter 7
Aim: 
To develop enough theory to describe algorithms for disambiguating the
syntactic categories of the words, choosing the most likely parse, and
enhancing parser efficiency, using statistical techniques, and frequency data
on the number of times that:
 a word occurs in different lexical categories;
 a lexical category occurs following another lexical category (or sequence
of categories).
 a particular grammar rule is used

Keywords:
Bayes' rule,
bigram,
conditional probability,
corpus,
hidden Markov model,
HMM,
independence (statistical),
lexical generation probability,
ngram,
output probability,
partofspeech tagging,
statistical NLP,
structural ambiguity,
tagged corpus,
trigram,
Viterbi algorithm

Plan
 revision of probability: conditional probability, Bayes' rule, independence
 partofspeech identification, corpus, tagged corpus
 estimating probabilities from word frequencies in a corpus
 partofspeech tagging
 bigrams, trigrams, ngrams
 estimating bigram probabilities
 experiments with a synthetic corpus
 graphical representation of bigram probabilities
 Markov assumption, hidden Markov models
 Viterbi algorithm
 forward probability, backward probability
 probabilistic contextfree grammars
 inside probability
 bestfirst parsing
 contextdependent bestfirst parser
7.1 Basic Probability Theory
 Probabilities are expressed as numbers between 0 (impossible) and 1 (certain).
 A random variable basically means a set of mutually exclusive
possible outcomes of some event, such as tossing a coin.
 Example: tossing a coin  the outcomes are H (heads) and T (tails).
 Each outcome has a probability: since H and T are equally likely, and the
outcome {H or T} is certain (probability 1)
Pr(H) = Pr(T) = 0.5.
 In general, if there are n possible outcomes
e_{1}, ..., e_{n}, then
Pr(e_{1}) + ... + Pr(e_{n}) = 1.
Revision of Conditional Probability
Revision of Bayes' Rule
By the definition,
Pr(A  B) = Pr(A ∧ B) / Pr(B)
Pr(B  A) = Pr(B ∧ A) / Pr(A)
so
Pr(A  B) / Pr(B  A) = Pr(A) / Pr(B), or
Pr(A  B) = Pr(B  A) × Pr(A) / Pr(B) 
This is called Bayes' rule, and it is sometimes useful in estimating
probabilities in the absence of some of the information.
Independence
Applying Statistical Concepts to NLP
 Problem: Given a sentence with ambiguous words, determine the
most likely lexical category for each word. One reason for wanting
do this is that a common reason for multiple parses is lexical category
ambiguity.
 This is called partofspeech (POS) identification.
 Say you need the partofspeech of words that can
be either N or V.
 This can be formalized using two random variables:
 one, C, ranges over the parts of speech (N and V)
 the other, W, ranges over all the possible words.
Applying Statistical Concepts to NLP 2
 Consider an example where W = flies.
The problem can be stated as determining which is greater, out of
Pr(C = N  W= flies)
Pr(C = V  W = flies)
 Often we shall write things like this as Pr(N  flies) for short.
 Using the definition of conditional probability, we have
Pr(N  flies) = Pr(flies ∧ N) / Pr(flies)
Pr(V  flies) = Pr(flies ∧ V) / Pr(flies)
 So what we really want is Pr(flies ∧ N) versus Pr(flies
∧ V).
Corpora and Tagging
 These probabilities can be estimated from a tagged corpus.
 A corpus is a collection of natural language sentences.
The collection might be of newspaper articles, for example.
 "Tagged" means that each word in the corpus is labelled
("tagged") with its part of speech.
 The label on a word is called its "tag".

Starting point for links to Jane Austen (early writings) Corpus Pages
Corpora and Tagging 2
 Suppose we have a corpus of simple sentences containing 1,273,000
words, including say 1000 uses of the word flies:
400 of them as an N and 600 of them as a V.
Pr(flies) = 1000 / 1,273,000 = 0.00078554595
Pr(flies ∧ N) = 400 / 1,273,000 = 0.00031421838
Pr(flies ∧ V) = 600 / 1,273,000 = 0.00047132757
 Our best guess will thus be that V is the most likely
partofspeech. (And we should be right about 60% of the time.)
Corpora and Tagging 3
7.2 Estimating Probabilities
 Probabilitybased methods rely on our ability to estimate the
probabilities. Accurate estimates assume vast amounts of data.
This is OK for common words. For uncommon words, a problem arises.
 The Brown corpus is a fairly typical corpus of about
a million words, but only about 49,000 distinct words.
 So, on average, each word occurs about 20 times.
 But in fact, over 40,000 words occur 5 times or less (with
any part of speech). For these words, the probability
estimates will be very crude.
7.2 Estimating Probabilities 2
 If you increase the size of the corpus, the uncommon words that we
were worried about will have more occurrences, but new, even more
unusual words will be introduced, and the problem is still there.
 Some lowfrequency words may occur zero times with one of the
partsofspeech that they can in fact take. The probability estimate
is then 0 for that part of speech.
 One technique for dealing with this is to
add a small number, like 0.5, to the count of each part of speech for a word
before computing the probabilities.
7.3 Automatic PartofSpeech Tagging
 Automatic POStagging can be done by selecting the most likely sequence
of syntactic categories for the words in a sentence, and using the syntactic
categories in that sequence as the tags for the words in the sentence.
 Our method in section 7.1 (always select the most frequent
category) works 90% of the time on a per word basis (mainly because lots of
words are unambiguous). So to be worth considering, a method needs to improve
on 90%.
 The standard method is to use some of the local context for the word.
With our flies example, if the word was preceded by the, then the
probability of N increases dramatically.
Computing Probabilities of Tag Sequences

Let w_{1}, ..., w_{T} be a sequence of words. We want to
find the sequence of lexical categories C_{1}, ..., C_{T}
that maximizes
1. Pr(C_{1}, ..., C_{T}  w_{1}, ..., w_{T})

Unfortunately, it would take far too much data to generate reasonable estimates
for such sequences directly from a corpus, so direct methods cannot be applied.
Computing Probabilities of Tag Sequences 2

However, there are
approximation methods. To develop them, we use
Bayes' rule, which says that this conditional probability equals
2. Pr(C_{1}, ..., C_{T}) × Pr(w_{1}, ..., w_{T}  C_{1}, ..., C_{T}) / Pr(w_{1}, ..., w_{T})

The common denominator Pr(w_{1}, ..., w_{T}) does not affect the maximisation problem, so in
effect we want to find a sequence C_{1}, ..., C_{T} that maximizes
3. Pr(C_{1}, ..., C_{T}) × Pr(w_{1}, ..., w_{T}  C_{1}, ..., C_{T})

These can be approximated by probabilities that are easier to
collect by making some independence assumptions. These independence assumptions
are not valid, but the estimates work well in practice.
Bigrams and Trigrams
 The first factor in formula 3, Pr(C_{1}, ..., C_{T}),
can be approximated by a sequence of probabilities based on a limited
number of previous categories, say 1 or 2 previous categories.
 The bigram model uses Pr(C_{i}  C_{i–1}):
i.e. (Pr(w_{i} is C_{i}  w_{i–1} is C_{i–1}).
 The trigram model uses Pr(C_{i}  C_{i–2}
C_{i–1}). Such models are called ngram models.
Bigrams
 Using bigrams, the following approximation can be used:
Pr(C_{1}, ..., C_{T}) ≈
Π _{i=1,T} Pr(C_{i}
 C_{i–1})
 To account for the beginning of a sentence, we invent a pseudocategory
<start> at position 0 as the value of C_{0}.
 Then for the sequence ART N V N, using bigrams:
Pr(ART N V N) ≈
Pr(ART  <start>) × Pr(N  ART) × Pr(V  N) × Pr(N  V)
Approximating Probabilities
Approximating Probabilities 2
Bigram Frequencies with a Toy Corpus
 Table 7.4 in Allen gives some bigram frequencies for an artificially
generated corpus of simple sentences.
 The corpus consisted of 300 sentences and had words in only four
categories: N, V, ART, and P, including 833 nouns, 300 verbs, 558
articles, and 307 prepositions for a total of 1998 words.
 To deal with the problem of sparse data, any bigram not listed in
Figure 7.4 was assumed to have a token probability of 0.0001.
Table 7.4 from Allen
Category
 Count at i
 Pair
 Count at i,i+1
 Bigram
 Estimate

<start>
 300
 <start>,ART
 213
 Pr(Art<start>)
 .71

<start>
 300
 <start>,N
 87
 Pr(N<start>)
 .29

ART
 558
 ART,N
 558
 Pr(NART)
 1

N
 833
 N,V
 358
 Pr(VN)
 .43

N
 833
 N,N
 108
 Pr(NN)
 .13

N
 833
 N,P
 366
 Pr(PN)
 .44

V
 300
 V,N
 75
 Pr(NV)
 .35

V
 300
 V,ART
 194
 Pr(ARTV)
 .65

P
 307
 P,ART
 226
 Pr(ARTP)
 .74

P
 307
 P,N
 81
 Pr(NP)
 .26

Table 7.4 Bigram probabilities from the generated corpus
Estimating Lexical Generation Probabilities
 The lexical generation probabilities Pr(W_{i}  C_{i})
can be estimated simply by counting the number of occurrences of each word
by category.
 Figure 7.5 in Allen gives some counts for individual words, from
which the probability estimates in Table 7.6 in Allen are generated.


Table 7.5 Some corpus word counts
 N
 V
 ART
 P
 TOTAL


flies
 21
 23
 0
 0
 44
 fruit
 49
 5
 1
 0
 55
 like
 10
 30
 0
 31
 61
 a
 1
 0
 201
 0
 202
 the
 1
 0
 300
 2
 303
 flower
 53
 15
 0
 0
 68
 flowers
 42
 16
 0
 0
 58
 birds
 64
 1
 0
 0
 65
 others
 592
 210
 56
 284
 1142
 TOTAL
 833
 300
 558
 307
 1998


Estimating Lexical Generation Probabilities 2
Table 7.6 The lexical generation probabilities
Pr(the  ART)  0.54   Pr(a  ART)  0.360

Pr(flies  N)  0.025   Pr(a  N)  0.001

Pr(flies  V)  0.076   Pr(flower  N)  0.063

Pr(like  V)  0.1   Pr(flower  V)  0.05

Pr(like  P)  0.068   Pr(birds  N)  0.076

Pr(like  N)  0.012   


Given N categories and T words, there are N^{T} possible tagsequences.
 Luckily, it is not necessary to consider all of these because of the
independence assumptions that were made about the data.
Finitestate Machine Model
 Since we are only dealing with bigram probabilities, the probability
that the ith word is in category C_{i} depends only
on the category of the i–1^{st} word,
C_{i–1}.
 Thus the process can be modelled by a special type of finite state
machine, as shown in Fig. 7.7 (below).
 Each node represents a possible lexical category, and the transition
probabilities (i.e. the bigram probabilities from Table 7.4) show the
probability of one category following another.
Figure 7.7 from Allen
Figure 7.7: A network capturing the bigram probabilities
Probabilities from Transition Probability Networks
 With such a network, you can compute the probability of any sequence of
categories simply by finding the path through the network indicated by the
sequence and multiplying the transition probabilities together.
 Thus, the sequence ART N V N would have probability
0.71 × 1 × 0.43 × 0.35 = 0.107.
 The validity of this depends on the assumption that the probability
of a category occurring depends only on the immediately preceding
category.
 This assumption is referred to as the Markov assumption,
and networks like Fig. 7.7 are called Markov Chains.
Probabilities from Transition Probability Networks 2
 The network representation can now be extended to include the
lexical generation probabilities as well: we allow each node to have
an output probability which gives a probability to each
possible output that could correspond to the node.
 For instance the node N in Fig. 7.7 would be associated with a
probability table that gives, for each word, how likely that word is to
be selected if we randomly select a noun.
 The output probabilities are exactly the lexical generation
probabilities from Fig. 7.6. A network like that in
Figure 7.7 with output probabilities associated with each node
is called a Hidden Markov Model (HMM).
Probabilities from Transition Probability Networks 3
Probabilities from Transition Probability Networks 4
Back to Mostlikely Tag Sequences
 Now we can resume the discussion of how to find the most
likely sequence of tags for a sequence of words.
 The key insight is that because of the Markov assumption, you do not
have to enumerate all the possible sequences. Sequences that end in the
same category can be collapsed together since the next category
only depends on the previous one in the sequence.
 So if you just keep track of the most likely sequence found so far
for each possible ending category, you can ignore all the other less
likely sequences.
 Fig. 7.8 does Flies like a flower in this way.
Figure 7.8: Encoding 256 possible sequences, using the Markov
assumption
Finding the Most Likely Tag Sequence
 To find the most likely sequence, sweep forward through the words
one at a time finding the most likely sequence for each ending category.
 In other words, you find the four best sequences for the two words
Flies like: the best ending with like as a V, the best as an N,
the best as a P and the best as an ART.
 You then use this information to find the four best sequences for
the the words flies like a, each one ending in a different category.
 This process is repeated until all the words are accounted for.
Finding the Most Likely Tag Sequence 2
 This algorithm is usually called the Viterbi algorithm.
 For a problem involving T words, and N lexical
categories, the Viterbi algorithm is guaranteed to find the most
likely sequence using kTN^{2} steps for some constant
k.
The Viterbi Algorithm
(Fig 7.9 in Allen)
Given word sequence
w_{1}, ..., w_{T},
lexical categories
L_{1}, ..., L_{N},
lexical probabilities Prob(w_{t}  L_{i})
and bigram probabilities
Prob(L_{i}  L_{j}), find
the most likely sequence of lexical categories
C_{1}, ..., C_{T}
for the word sequence.
Initialization Step
for lexcat = 1 to N do
SeqScore(lexcat, 1) =
Prob(w_{1}  L_{lexcat}) ×
Prob(L_{lexcat}  <start>)
BackPtr(lexcat, 1) = 0
Iteration Step
for t = 2 to T do
for lexcat = 1 to N do
SeqScore(lexcat, t) =
Max_{j=1,N}(SeqScore(j, t–1) ×
Prob(L_{lexcat}  L_{j})) ×
Prob(w_{t}  L_{lexcat})
BackPtr(lexcat, t) =
index of j that gave the max above
Sequence Identification Step
C_{T} = lexcat that maximizes SeqScore(i, T)
for word = T – 1 to 1 do
C_{word} =
BackPtr(C_{word+1}, word+1)

Notes on Viterbi Algorithm
 It's not clear what happens to BackPtr(,) if there is a tie for the maximizer
of SeqScore(lexcat, t).
 If SeqScore(lexcat, t) = 0, this says the sequence is impossible, so there
is not point in having a BackPtr in this case, and there is no point considering
sequences with C_{t} = lexcat any further.
 In the tables in the examples below, the name of the lexcat is used rather than
the number.
Simulating the Viterbi Algorithm 1


lexcat  SeqScore (lexcat,1)  BackPtr (lexcat,1) 
V  7.6×10^{–6}  ∅ 
N  0.00725  ∅ 
P  0  ∅ 
ART  0  ∅ 

Simulating the Viterbi Algorithm 2
Sample computation in iteration step:
SeqScore(V, 2)  =
Max[SeqScore(N, 1) × Prob(VN),
SeqScore(V, 1) × Prob(VV)] × Prob(likeV) 
 = Max[0.00725 × 0.43, 7.6 × 10^{–6} × 0.0001] × 0.1 
 = 3.12 × 10^{–4} 
The category that maximizes SeqScore(V, 2) is N, so BackPtr(V, 2) is N.
Simulating the Viterbi Algorithm 3


lexcat  SeqScore (lexcat,1)  SeqScore (lexcat,2)  BackPtr (lexcat,2) 
V  7.6×10^{–6}  0.00031  N 
N  0.00725  1.3×10^{–5}  N 
P  0  0.00022  N 
ART  0  0  ∅ 

Simulating the Viterbi Algorithm 4
lexcat  SeqScore (lexcat,1)  SeqScore (lexcat,2)  SeqScore (lexcat,3)  BackPtr (lexcat,3) 
V  7.6×10^{–6}  0.00031  0  ∅ 
N  0.00725  1.3×10^{–5}  1.2×10^{–7}  V 
P  0  0.00022  0  ∅ 
ART  0  0  7.2×10^{–5}  V 
Simulating the Viterbi Algorithm 5
lexcat  SeqScore (lexcat,1)  SeqScore (lexcat,2)  SeqScore (lexcat,3)  SeqScore (lexcat,4)  BackPtr (lexcat,4) 
V  7.6×10^{–6}
 0.00031  0  2.6×10^{–9}  ART 
N  0.00725  1.3×10^{–5}
 1.2×10^{–7}  4.3×10^{–6}  ART 
P  0  0.00022
 0  0  ∅ 
ART  0  0
 7.2×10^{–5}  0  ∅ 
Simulating the Viterbi Algorithm 6
 To find C_{4}, you look for the lexcat that maximizes
SeqScore(lexcat, 4).
 This turns out to be lexcat = N.
 Then we follow the BackPtr chain:
BackPtr(N, 4) = ART, so C_{3} = ART;
BackPtr(ART, 3) = V, so C_{2} = V;
BackPtr(V, 2) = N, so C_{1} = N;
(and BackPtr(N, 1) = ∅).
 So the most probable tag sequence is N V ART N.
 Algorithms like this (with a decentsized corpus) typically tag
correctly ≥ 95% of the time on a perword basis (up from 90%).
7.4 Obtaining Lexical Probabilities
Obtaining Lexical Probabilities 2
Obtaining Lexical Probabilities 3
Obtaining Lexical Probabilities 4
 More precisely, we define the forward
probability α_{i}(t)  the
probability of producing the words w_{1}, ...,
w_{t}, ending in state
w_{t}/L_{i}, as follows:
α_{i}(t) = Pr(w_{t}/L_{i}, w_{1}, ..., w_{t}).
 For example, with The flies like flowers, α_{V}(3)
would be Prob(w_{t}/L_{V},
w_{1}, ..., w_{t})
which is the sum of the probabilities computed for all 16 sequences ending
in V,
viz. ART ART V, ART N V, ART P V, ART V V, N ART V, N N V, N P V, N V V,
P ART V, P N V, P P V, P V V, V ART V, V N V, V P V, and V V V.
 We can derive that:
Pr(w_{t}/L_{i}  w_{1}, ..., w_{t}) ≈
α_{i}(t) / Σ_{j=1,N} α_{j}(t)
and a variant of the Viterbi algorithm can be used to compute the forward
probabilities: see Fig. 7.14 below.
Fig. 7.14 from Allen
The forward algorithm for computing the lexical probabilities
Initialization Step
for lexcat = 1 to N do
SeqSum(lexcat, 1) =
Prob(w_{1}  L_{lexcat}) ×
Prob(L_{lexcat}  <start>)
Computing the Forward Probabilities
for t = 2 to T do
for lexcat = 1 to N do
SeqSum(lexcat, t) =
Σ_{j=1,N}(SeqSum(j, t–1) ×
Prob(L_{lexcat}  L_{j})) ×
Prob(w_{t}  L_{lexcat})
Converting Forward Probabilities into Lexical Probabilities
for t = 1 to T do
for lexcat = 1 to N do
Prob(C_{t} = L_{lexcat}) =
SeqSum(lexcat, t) /
Σ_{j=1,N} SeqSum(j, t)

Fig. 7.15 from Allen
Computing the sums of the probabilities of the sequences
Table 7.16 from Allen
Contextdependent estimates for lexical categories
(cf. Table 7.13)
Prob(the/ART  the)  ≈  1.0 
Prob(flies/N  the flies)  ≈  0.9988 
Prob(flies/V  the flies)  ≈  0.0011 
Prob(like/V  the flies like)  ≈  0.575 
Prob(like/P  the flies like)  ≈  0.4 
Prob(like/N  the flies like)  ≈  0.022 
Prob(flowers/N  the flies like flowers)  ≈  0.967 
Prob(flowers/V  the flies like flowers)  ≈  0.033 
Obtaining Lexical Probabilities 5
7.5 Probabilistic ContextFree Grammars
Probabilistic ContextFree Grammars 2
 Grammar 7.17 shows a probabilistic CFG based on a parsed version of our example
corpus:
 Rule  Count for LHS  Count for Rule  Probability 
1.
 S → NP VP
 300
 300
 1

2.
 VP → V
 300
 116
 .386

3.
 VP → V NP
 300
 118
 .393

4.
 VP → V NP PP
 300
 66
 .22

5.
 NP → NP PP
 1023
 241
 .24

6.
 NP → N N
 1023
 92
 .09

7.
 NP → N
 1023
 141
 .14

8.
 NP → ART N
 1023
 558
 .55

9.
 PP → P NP
 307
 307
 1

Probabilistic ContextFree Grammars 3
 You can then develop algorithms similar in functioning to the Viterbi
algorithm that will find the most likely parse tree for a
sentence.
 Certain invalid independence assumptions are involved. In
particular, you must assume that the probability of a constituent
being derived by a rule R_{j} is independent of how the
constituent is used as a subconstituent.
 For example, this assumption
would imply that the probabilities of NP rules are the same whether
the NP is the subject, the object of a verb, or the object of a
preposition, whereas, for example, subject NPs are much more likely
to be pronouns than other NPs.
 Other issues that should ultimately be considered include
the subcategorisation of the V in a VP. Rule 3 would not be used
by an intransitive verb.
Probabilistic ContextFree Grammars 4
 The formalism can be based on the probability that a constituent
C generates a sequence of words w_{i},
w_{i+1}, ... , w_{j}, written as
w_{i,j}. This probability is called the inside
probability and written Pr(w_{i,j}  C).
 E.g. Pr(a flower  NP) = Pr(a flower  NP → ART N) +
Pr(a flower  NP → N N)
 Here Pr(a flower  NP → ART N) means Pr(Rule 8  NP) ×
Pr(a  ART) × Pr(flower  N)
 Similarly,
Pr(a flower blooms  S) =
 Pr(Rule 1  S) × Pr(a flower  NP) × Pr(blooms  VP) + 
 Pr(Rule 1  S) × Pr(a  NP) × Pr(flower blooms  VP) 
Probabilistic ContextFree Grammars 5
 We're more interested in the probabilities of particular parses
than the overall probability of a particular sentence.
 The probabilities of specific parse trees for a sentence can be found
using a standard chart parsing algorithm, where the probability of each
constituent is computed from the probability of its subconstituents together
with the probability of the rule used, and recorded, with the contituent,
on the chart.
 When entering an item E of
category C on the chart, using rule i with n
subconstituents corresponding to chart entries E_{1},
..., E_{n}:
Pr(E) = Pr(Rule i  C) × Pr(E_{1}) × ... × Pr(E_{n})
Probabilistic ContextFree Grammars 6
Three ways to generate a flower wilted as an S
Probabilistic ContextFree Grammars 7
 Ultimately a parser built using these techniques doesn't work as
well as you might hope, though it does help.
 Typically it has been found that these techniques identify the
correct parse 50% of the time.
 One critical issue is that the model assumes that the probability
of a particular verb being used in a VP rule is independent of
which rule is being considered.
Probabilistic ContextFree Grammars 8
 For example, Grammar 7.17 says that rule 3 is used 39% of the
time, rule 4 22%, and rule 5 24%.
 These between them mean that irrespective of what words are used,
an input sequence of the form V NP PP will always be parsed with the
PP attached to the verb, since the V NP PP interpretation has a
probability of 0.22, whereas the version that attaches the PP to
the NP has a probability of 0.39 × 0.24 = 0.093.
 This is true even with verbs that rarely take a _np_pp complement.
7.6 BestFirst Parsing
 So far, probabilistic CFGs have done nothing for parser efficiency.
 Algorithms developed to explore highprobability constituents
first are called bestfirst parsing algorithms.
The hope is that the best parse will be found first and quickly.
 Our basic chart parsing algorithm can be modified to consider
most likely constituents first.
 The central idea is to make the agenda a priority queue 
the highest rated elements are always first in the queue, and the
parser always removes the highestranked constituent from the agenda
before adding it to the chart.
BestFirst Parsing 2
 Some slight further modifications are necessary to keep the
parser working.
 With the modified agenda system, if the last word in the sentence
has the highest score, it will be added to the chart first.
 Previous versions of the chart parsing algorithm have assumed
that we could safely work lefttoright. This no longer holds. Thus,
whenever an active arc is added to the chart (e.g. by "moving the dot")
it may be that the next constituent needed to extend the arc is already
in the chart, and we must check for this.
BestFirst Parsing 3
 The complete new algorithm is shown in Fig. 7.22.
 Adopting a bestfirst strategy makes a significant improvement
in parser efficiency. Using grammar 7.17 and the lexicon information
from the corpus, the sentence The man put a bird in the house
is parsed correctly after generating 65 constituents. The standard
bottomup chart parser generates 158 constituents from the same sentence.
 The bestfirst parser is guaranteed to find the highest
probability parse (see Allen p. 214 for proof, if interested).
to add a constituent C from position p_{1}
to p_{2}:
 Insert C into the chart from position p_{1} to
p_{2}.
 For any active arc X → X_{1} ... •
C ... X_{n} from position p_{0}
to p_{1}, add a new active arc X →
X_{1} ... C • ... X_{n}
from position p_{0} to p_{2}.
to add an active arc X → X_{1} ...
C • C' ... X_{n} to the chart from
p_{0} to p_{2}:
 If C is the last constituent (that is, the arc is completed),
add a new constituent of type X to the agenda.
 Otherwise, if there is a constituent Y of category C'
in the chart from p_{2} to p_{3},
then recursively add an active arc X →
X_{1} ... C C' •
... X_{n} from p_{0} to
p_{3} (which may of course add further arcs or
create further constituents).

Figure 7.22 The new arc extension algorithm
7.7 A Simple ContextDependent BestFirst Parser
 The bestfirst algorithm improves efficiency but not accuracy.
 This section explores a simple alternative method of computing rule
probabilities that uses more contextdependent lexical information.
 The idea exploits the fact that the first word in a constituent is
often the head word and thus has a big effect on the
probabilities of rules that account for the constituent.
A Simple ContextDependent BestFirst Parser 2
 This suggests a new probability measure for rules that takes
into account the first word: Pr(R  C, w):
 How this helps: for instance, in the corpus, singular nouns rarely
occur alone as a noun phrase (i.e. NP → N),
and plural nouns are rarely used as a noun modifier (i.e. starting
rule NP → N N) 
this can be seen from the statistics in Fig. 7.23:
Figure 7.23 Some rule estimates based on the first word
Rule
 the
 house
 peaches
 flowers

NP → N
 0
 0
 .65
 .76

NP → N N
 0
 .82
 0
 0

NP → NP PP
 .23
 .18
 .35
 .24

NP → ART N
 .76
 0
 0
 0

Rule
 ate
 bloom
 like
 put

VP → V
 .28
 .84
 0
 .03

VP → V NP
 .57
 .10
 .9
 .03

VP → V NP PP
 .14
 .05
 .1
 .93

A Simple ContextDependent BestFirst Parser 3
Summary: Ambiguity Resolution  Statistical Methods

The concepts of independence and conditional probability, together with
frequency statistics on partsofspeech and grammar rules obtained from an NL
corpus, allow us to work out the most likely lexical categories for words in a
sentence, to order the parsing agenda for bestfirst parsing. We found in all
the cases we examined that contextdependent methods work best.

Exercises from Allen (with some minor changes)
Some of these exercises might take quite a while. They are offered
in part as challenge material, and should not be seen as compulsory.
 Hand simulate the Viterbi algorithm using the data and
probability estimates in Tables 7.47.6 on the sentence
Flower flowers like flowers. Draw transition networks
as in "Simulating the Viterbi Algorithm 16" for this sentence,
and figure out which parts of speech the algorithm identifies for
each word.
 Using the bigram and lexicalgeneration probabilities given in
this section, calculate the word probabilities using the forward
algorithm for the sentence The a flies like flowers
(involving a very rare use of the word a as a noun, as in
the a flies, the b flies, and so on). Remember to
use 0.0001 as a probability for any bigram not in the table.
Are the results you get reasonable? If not, what is the problem,
and how might it be fixed?
 Using the small tagged corpus at
http://www.cse.unsw.edu.au/~billw/cs9414/notes/corpora/austen_tagged
(about 8000 words),
write code to collect bigram and lexical statistics, and implement
a partofspeech tagged using the Viterbi algorithm. Test your
algorithm on the same corpus. How many sentences are tagged
correctly? How many words are tagged correctly? Would you expect
to get better accuracy results on the the sentences that involved
an error if you had access to more data? You can investigate this
and related issues by:
 keeping some of the sentences (e.g. the last 50, say) back
from the statistics collection) and comparing the performance
on the subcorpus with performance (with full statistics)
on the whole corpus;
 keeping some of the sentences, e.g. the last 50, back
from the statistics collection, and examining the performance
of your system on these last 50 sentences.
CRICOS Provider Code No. 00098G
Copyright (C) Bill Wilson, 20052012, except where another source is acknowledged.
Acknowledgement: All numbered sections, tables, and figures are
reproduced from James Allen's "Natural Language Processing" (ed. 2,
Benjamin Cummings, 1995) which is used as a text book for the course
for which these web pages form lecture notes. You are advised to obtain
access to a copy of this book (e.g. by buying it or borrowing it from
a library) to assist in understanding the material.