Ambiguity Resolution - Statistical Methods
Reference: Allen, Chapter 7
| Aim: |
To develop enough theory to describe algorithms for disambiguating the
syntactic categories of the words, choosing the most likely parse, and
enhancing parser efficiency, using statistical techniques, and frequency data
on the number of times that:
- a word occurs in different lexical categories;
- a lexical category occurs following another lexical category (or sequence
of categories).
- a particular grammar rule is used
|
| Keywords:
Bayes' rule,
bigram,
conditional probability,
corpus,
hidden Markov model,
HMM,
independence (statistical),
lexical generation probability,
n-gram,
output probability,
part-of-speech tagging,
statistical NLP,
structural ambiguity,
tagged corpus,
trigram,
Viterbi algorithm
|
Plan
- revision of probability: conditional probability, Bayes' rule, independence
- part-of-speech identification, corpus, tagged corpus
- estimating probabilities from word frequencies in a corpus
- part-of-speech tagging
- bigrams, trigrams, n-grams
- estimating bigram probabilities
- experiments with a synthetic corpus
- graphical representation of bigram probabilities
- Markov assumption, hidden Markov models
- Viterbi algorithm
- forward probability, backward probability
- probabilistic context-free grammars
- inside probability
- best-first parsing
- context-dependent best-first parser
7.1 Basic Probability Theory
- Probabilities are expressed as numbers between 0 (impossible) and 1 (certain).
- A random variable basically means a set of mutually exclusive
possible outcomes of some event, such as tossing a coin.
- Example: tossing a coin - the outcomes are H (heads) and T (tails).
- Each outcome has a probability: since H and T are equally likely, and the
outcome {H or T} is certain (probability 1)
Pr(H) = Pr(T) = 0.5.
- In general, if there are n possible outcomes
e1, ..., en, then
Pr(e1) + ... + Pr(en) = 1.
Revision of Conditional Probability
Revision of Bayes' Rule
By the definition,
Pr(A | B) = Pr(A ∧ B) / Pr(B)
Pr(B | A) = Pr(B ∧ A) / Pr(A)
so
Pr(A | B) / Pr(B | A) = Pr(A) / Pr(B), or
| Pr(A | B) = Pr(B | A) × Pr(A) / Pr(B) |
This is called Bayes' rule, and it is sometimes useful in estimating
probabilities in the absence of some of the information.
Independence
Applying Statistical Concepts to NLP
- Problem: Given a sentence with ambiguous words, determine the
most likely lexical category for each word.
- This is called part-of-speech (POS) identification.
- Say you need the part-of-speech of words that can
be either N or V.
- This can be formalized using two random variables:
- one, C, ranges over the parts of speech (N and V)
- the other, W, ranges over all the possible words.
Applying Statistical Concepts to NLP 2
- Consider an example where W = flies.
The problem can be stated as determining which is greater, out of
Pr(C = N | W= flies)
Pr(C = V | W = flies)
- Often we shall write things like this as Pr(N | flies) for short.
- Using the definition of conditional probability, we have
Pr(N | flies) = Pr(flies ∧ N) / Pr(flies)
Pr(V | flies) = Pr(flies ∧ V) / Pr(flies)
- So what we really want is Pr(flies ∧ N) versus Pr(flies
∧ V).
Corpora and Tagging
- These probabilities can be estimated from a tagged corpus.
- A corpus is a collection of natural language sentences.
The collection might be of newspaper articles, for example.
- "Tagged" means that each word in the corpus is labelled
("tagged") with its part of speech.
- The label on a word is called its "tag".
-
Starting point for links to Jane Austen (early writings) Corpus Pages
Corpora and Tagging 2
- Suppose we have a corpus of simple sentences containing 1,273,000
words, including say 1000 uses of the word flies:
400 of them as an N and 600 of them as a V.
Pr(flies) = 1000 / 1,273,000 = 0.00078554595
Pr(flies ∧ N) = 400 / 1,273,000 = 0.00031421838
Pr(flies ∧ V) = 600 / 1,273,000 = 0.00047132757
- Our best guess will thus be that V is the most likely
part-of-speech. (And we should be right about 60% of the time.)
Corpora and Tagging 3
7.2 Estimating Probabilities
- Probability-based methods rely on our ability to estimate the
probabilities. Accurate estimates assume vast amounts of data.
This is OK for common words. For uncommon words, a problem arises.
- The Brown corpus is a fairly typical corpus of about
a million words, but only about 49,000 distinct words.
- So, on average, each word occurs about 20 times.
- But in fact, over 40,000 words occur 5 times or less (with
any part of speech). For these words, the probability
estimates will be very crude.
7.2 Estimating Probabilities 2
- If you increase the size of the corpus, the uncommon words that we
were worried about will have more occurrences, but new, even more
unusual words will be introduced, and the problem is still there.
- Some low-frequency words may occur zero times with one of the
parts-of-speech that they can in fact take. The probability estimate
is then 0 for that part of speech.
- One technique for dealing with this is to
add a small number, like 0.5, to the count of each part of speech for a word
before computing the probabilities.
7.3 Automatic Part-of-Speech Tagging
- Automatic POS-tagging can be done by selecting the most likely sequence
of syntactic categories for the words in a sentence, and using the syntactic
categories in that sequence as the tags for the words in the sentence.
- Our method in section 7.1 (always select the most frequent
category) works 90% of the time on a per word basis (mainly because lots of
words are unambiguous). So to be worth considering, a method needs to improve
on 90%.
- The standard method is to use some of the local context for the word.
With our flies example, if the word was preceded by the, then the
probability of N increases dramatically.
Computing Probabilities of Tag Sequences
-
Let w1, ..., wT be a sequence of words. We want to
find the sequence of lexical categories C1, ..., CT
that maximizes
1. Pr(C1, ..., CT | w1, ..., wT)
-
Unfortunately, it would take far too much data to generate reasonable estimates
for such sequences directly from a corpus, so direct methods cannot be applied.
Computing Probabilities of Tag Sequences 2
-
However, there are
approximation methods. To develop them, we use
Bayes' rule, which says that this conditional probability equals
2. Pr(C1, ..., CT) × Pr(w1, ..., wT | C1, ..., CT) / Pr(w1, ..., wT)
-
The common denominator Pr(w1, ..., wT) does not affect the maximisation problem, so in
effect we want to find a sequence C1, ..., CT that maximizes
3. Pr(C1, ..., CT) × Pr(w1, ..., wT | C1, ..., CT)
-
These can be approximated by probabilities that are easier to
collect by making some independence assumptions. These independence assumptions
are not valid, but the estimates work well in practice.
Bigrams and Trigrams
- The first factor in formula 3, Pr(C1, ..., CT),
can be approximated by a sequence of probabilities based on a limited
number of previous categories, say 1 or 2 previous categories.
- The bigram model uses Pr(Ci | Ci-1).
- The trigram model uses Pr(Ci | Ci-2
Ci-1). Such models are called n-gram models.
Bigrams
- Using bigrams, the following approximation can be used:
Pr(C1, ..., CT) ≈
Π i=1,T Pr(Ci
| Ci-1)
- To account for the beginning of a sentence, we invent a pseudocategory
<start> at position 0 as the value of C0.
- Then for the sequence ART N V N, using bigrams:
Pr(ART N V N) ≈
Pr(ART | <start>) × Pr(N | ART) × Pr(V | N) × Pr(N | V)
Approximating Probabilities
Approximating Probabilities 2
Bigram Frequencies with a Toy Corpus
- Table 7.4 in Allen gives some bigram frequencies for an artificially
generated corpus of simple sentences.
- The corpus consisted of 300 sentences and had words in only four
categories: N, V, ART, and P, including 833 nouns, 300 verbs, 558
articles, and 307 prepositions for a total of 1998 words.
- To deal with the problem of sparse data, any bigram not listed in
Figure 7.4 was assumed to have a token probability of 0.0001.
Table 7.4 from Allen
| Category
| Count at i
| Pair
| Count at i,i+1
| Bigram
| Estimate
|
| <start>
| 300
| <start>,ART
| 213
| Pr(Art|<start>)
| .71
|
| <start>
| 300
| <start>,N
| 87
| Pr(N|<start>)
| .29
|
| ART
| 558
| ART,N
| 558
| Pr(N|ART)
| 1
|
| N
| 833
| N,V
| 358
| Pr(V|N)
| .43
|
| N
| 833
| N,N
| 108
| Pr(N|N)
| .13
|
| N
| 833
| N,P
| 366
| Pr(P|N)
| .44
|
| V
| 300
| V,N
| 75
| Pr(N|V)
| .35
|
| V
| 300
| V,ART
| 194
| Pr(ART|V)
| .65
|
| P
| 307
| P,ART
| 226
| Pr(ART|P)
| .74
|
| P
| 307
| P,N
| 81
| Pr(N|P)
| .26
|
Table 7.4 Bigram probabilities from the generated corpus
Estimating Lexical Generation Probabilities
- The lexical generation probabilities Pr(Wi | Ci)
can be estimated simply by counting the number of occurrences of each word
by category.
- Figure 7.5 in Allen gives some counts for individual words, from
which the probability estimates in Table 7.6 in Allen are generated.
|
|
Table 7.5 Some corpus word counts
|
| N
| V
| ART
| P
| TOTAL
|
|---|
| flies
| 21
| 23
| 0
| 0
| 44
| | fruit
| 49
| 5
| 1
| 0
| 55
| | like
| 10
| 30
| 0
| 31
| 61
| | a
| 1
| 0
| 201
| 0
| 202
| | the
| 1
| 0
| 300
| 2
| 303
| | flower
| 53
| 15
| 0
| 0
| 68
| | flowers
| 42
| 16
| 0
| 0
| 58
| | birds
| 64
| 1
| 0
| 0
| 65
| | others
| 592
| 210
| 56
| 284
| 1142
| | TOTAL
| 833
| 300
| 558
| 307
| 1998
|
|
Estimating Lexical Generation Probabilities 2
Table 7.6 The lexical generation probabilities
| Pr(the | ART) | 0.54 | | Pr(a | ART) | 0.360
|
| Pr(flies | N) | 0.025 | | Pr(a | N) | 0.001
|
| Pr(flies | V) | 0.076 | | Pr(flower | N) | 0.063
|
| Pr(like | V) | 0.1 | | Pr(flower | V) | 0.05
|
| Pr(like | P) | 0.068 | | Pr(birds | N) | 0.076
|
| Pr(like | N) | 0.012 | | |
|
-
Given N categories and T words, there are NT possible tag-sequences.
- Luckily, it is not necessary to consider all of these because of the
independence assumptions that were made about the data.
Finite-state Machine Model
- Since we are only dealing with bigram probabilities, the probability
that the i-th word is in category Ci depends only
on the category of the i-1st word,
Ci-1.
- Thus the process can be modelled by a special type of finite state
machine, as shown in Fig. 7.7 (below).
- Each node represents a possible lexical category, and the transition
probabilities (i.e. the bigram probabilities from Table 7.4) show the
probability of one category following another.
Figure 7.7 from Allen

Figure 7.7: A network capturing the bigram probabilities
Probabilities from Transition Probability Networks
- With such a network, you can compute the probability of any sequence of
categories simply by finding the path through the network indicated by the
sequence and multiplying the transition probabilities together.
- Thus, the sequence ART N V N would have probability
0.71 × 1 × 0.43 × 0.35 = 0.107.
- The validity of this depends on the assumption that the probability
of a category occurring depends only on the immediately preceding
category.
- This assumption is referred to as the Markov assumption,
and networks like Fig. 7.7 are called Markov Chains.
Probabilities from Transition Probability Networks 2
- The network representation can now be extended to include the
lexical generation probabilities as well: we allow each node to have
an output probability which gives a probability to each
possible output that could correspond to the node.
- For instance the node N in Fig. 7.7 would be associated with a
probability table that gives, for each word, how likely that word is to
be selected if we randomly select a noun.
- The output probabilities are exactly the lexical generation
probabilities from Fig. 7.6. A network like that in
Figure 7.7 with output probabilities associated with each node
is called a Hidden Markov Model (HMM).
Probabilities from Transition Probability Networks 3
Probabilities from Transition Probability Networks 4
Back to Most-likely Tag Sequences
- Now we can resume the discussion of how to find the most
likely sequence of tags for a sequence of words.
- The key insight is that because of the Markov assumption, you do not
have to enumerate all the possible sequences. Sequences that end in the
same category can be collapsed together since the next category
only depends on the previous one in the sequence.
- So if you just keep track of the most likely sequence found so far
for each possible ending category, you can ignore all the other less
likely sequences.
- Fig. 7.8 does Flies like a flower in this way.
Figure 7.8: Encoding 256 possible sequences, using the Markov
assumption
Finding the Most Likely Tag Sequence
- To find the most likely sequence, sweep forward through the words
one at a time finding the most likely sequence for each ending category.
- In other words, you find the four best sequences for the two words
Flies like: the best ending with like as a V, the best as an N,
the best as a P and the best as an ART.
- You then use this information to find the four best sequences for
the the words flies like a, each one ending in a different category.
- This process is repeated until all the words are accounted for.
Finding the Most Likely Tag Sequence 2
- This algorithm is usually called the Viterbi algorithm.
- For a problem involving T words, and N lexical
categories, the Viterbi algorithm is guaranteed to find the most
likely sequence using kTN2 steps for some constant
k.
The Viterbi Algorithm
(Fig 7.9 in Allen)
|
Given word sequence
w1, ..., wT,
lexical categories
L1, ..., LN,
lexical probabilities Prob(wt | Li)
and bigram probabilities
Prob(Li | Lj), find
the most likely sequence of lexical categories
C1, ..., CT
for the word sequence.
Initialization Step
for lexcat = 1 to N do
SeqScore(lexcat, 1) =
Prob(w1 | Llexcat) ×
Prob(Llexcat | <start>)
BackPtr(lexcat, 1) = 0
Iteration Step
for t = 2 to T do
for lexcat = 1 to N do
SeqScore(lexcat, t) =
Maxj=1,N(SeqScore(j, t–1) ×
Prob(Llexcat | Lj)) ×
Prob(wt | Llexcat)
BackPtr(lexcat, t) =
index of j that gave the max above
Sequence Identification Step
CT = lexcat that maximizes SeqScore(i, T)
for word = T – 1 to 1 do
Cword =
BackPtr(Cword+1, word+1)
|
Notes on Viterbi Algorithm
- It's not clear what happens to BackPtr(-,-) if there is a tie for the maximizer
of SeqScore(lexcat, t).
- If SeqScore(lexcat, t) = 0, this says the sequence is impossible, so there
is not point in having a BackPtr in this case, and there is no point considering
sequences with Ct = lexcat any further.
- In the tables in the examples below, the name of the lexcat is used rather than
the number.
Simulating the Viterbi Algorithm 1
|
|
| lexcat | SeqScore (lexcat,1) | BackPtr (lexcat,1) |
| V | 7.6×10–6 | ∅ |
| N | 0.00725 | ∅ |
| P | 0 | ∅ |
| ART | 0 | ∅ |
|
Simulating the Viterbi Algorithm 2
Sample computation in iteration step:
| SeqScore(V, 2) | =
Max[SeqScore(N, 1) × Prob(V|N),
SeqScore(V, 1) × Prob(V|V)] × Prob(like|V) |
| | = Max[0.00725 × 0.43, 7.6 × 10–6 × 0.0001] × 0.1 |
| | = 3.12 × 10–4 |
The category that maximizes SeqScore(V, 2) is N, so BackPtr(V, 2) is N.
Simulating the Viterbi Algorithm 3
|
|
| lexcat | SeqScore (lexcat,1) | SeqScore (lexcat,2) | BackPtr (lexcat,2) |
| V | 7.6×10–6 | 0.00031 | N |
| N | 0.00725 | 1.3×10–5 | N |
| P | 0 | 0.00022 | N |
| ART | 0 | 0 | ∅ |
|
Simulating the Viterbi Algorithm 4
| lexcat | SeqScore (lexcat,1) | SeqScore (lexcat,2) | SeqScore (lexcat,3) | BackPtr (lexcat,3) |
| V | 7.6×10–6 | 0.00031 | 0 | ∅ |
| N | 0.00725 | 1.3×10–5 | 1.2×10–7 | V |
| P | 0 | 0.00022 | 0 | ∅ |
| ART | 0 | 0 | 7.2×10–5 | V |
Simulating the Viterbi Algorithm 5
| lexcat | SeqScore (lexcat,1) | SeqScore (lexcat,2) | SeqScore (lexcat,3) | SeqScore (lexcat,4) | BackPtr (lexcat,4) |
| V | 7.6×10–6
| 0.00031 | 0 | 2.6×10–9| ART | |
| N | 0.00725 | 1.3×10–5
| 1.2×10–7 | 4.3×10–6| ART | |
| P | 0 | 0.00022
| 0 | 0 | ∅ |
| ART | 0 | 0
| 7.2×10–5 | 0 | ∅ |
Simulating the Viterbi Algorithm 6
- To find C4, you look for the lexcat that maximizes
SeqScore(lexcat, 4).
- This turns out to be lexcat = N.
- Then we follow the BackPtr chain:
BackPtr(N, 4) = ART, so C3 = ART;
BackPtr(ART, 3) = V, so C2 = V;
BackPtr(V, 2) = N, so C1 = N;
(and BackPtr(N, 1) = ∅).
- So the most probable tag sequence is N V ART N.
- Algorithms like this (with a decent-sized corpus) typically tag
correctly ≥ 95% of the time on a per-word basis (up from 90%).
7.4 Obtaining Lexical Probabilities
- We have seen the simple method of estimating lexical
probabilities by counting the number of times that each word w appears
in a corpus in each category and dividing the counts by the total number
of times that the word w appears:
Pr(Lj | w) ≈ count(Lj ∧ w) / Σ
i=1,N count(Li ∧ w)
where L1, ..., LN are the possible
categories. This leads to estimates like the following:
Table 7.13 from Allen
| Prob(N | flies) | = | 0.48 |
| Prob(V | flies) | = | 0.52 |
| Prob(V | like) | = | 0.49 |
| Prob(P | like) | = | 0.34 |
| Prob(N | like) | = | 0.16 |
|
| Prob(ART | the) | = | 0.99 |
| Prob(ART | a) | = | 0.995 |
| Prob(N | a) | = | 0.005 |
| Prob(N | flower) | = | 0.78 |
| Prob(V | flower) | = | 0.22 |
|
- A better estimate: compute how likely it is that category
Li occurred at position t over all sequences of
categories for the input w1, w2,
..., wt.
Obtaining Lexical Probabilities 2
Obtaining Lexical Probabilities 3
Obtaining Lexical Probabilities 4
- More precisely, we define the forward
probability αi(t) - the
probability of producing the words w1, ...,
wt, ending in state
wt/Li, as follows:
αi(t) = Pr(wt/Li, w1, ..., wt).
- For example, with The flies like flowers, αV(3)
would be Prob(wt/LV,
w1, ..., wt)
which is the sum of the probabilities computed for all 16 sequences ending
in V,
viz. ART ART V, ART N V, ART P V, ART V V, N ART V, N N V, N P V, N V V,
P ART V, P N V, P P V, P V V, V ART V, V N V, V P V, and V V V.
- We can derive that:
Pr(wt/Li | w1, ..., wt) ≈
αi(t) / Σj=1,N αj(t)
and a variant of the Viterbi algorithm can be used to compute the forward
probabilities: see Fig. 7.14 below.
Fig. 7.14 from Allen
The forward algorithm for computing the lexical probabilities
|
Initialization Step
for lexcat = 1 to N do
SeqSum(lexcat, 1) =
Prob(w1 | Llexcat) ×
Prob(Llexcat | <start>)
Computing the Forward Probabilities
for t = 2 to T do
for lexcat = 1 to N do
SeqSum(lexcat, t) =
Σj=1,N(SeqSum(j, t–1) ×
Prob(Llexcat | Lj)) ×
Prob(wt | Llexcat)
Converting Forward Probabilities into Lexical Probabilities
for t = 1 to T do
for lexcat = 1 to N do
Prob(Ct = Llexcat) =
SeqSum(lexcat, t) /
Σj=1,N SeqSum(j, t)
|
Fig. 7.15 from Allen
Computing the sums of the probabilities of the sequences
Table 7.16 from Allen
Context-dependent estimates for lexical categories
(cf. Table 7.13)
| Prob(the/ART | the) | ≈ | 1.0 |
| Prob(flies/N | the flies) | ≈ | 0.9988 |
| Prob(flies/V | the flies) | ≈ | 0.0011 |
| Prob(like/V | the flies like) | ≈ | 0.575 |
| Prob(like/P | the flies like) | ≈ | 0.4 |
| Prob(like/N | the flies like) | ≈ | 0.022 |
| Prob(flowers/N | the flies like flowers) | ≈ | 0.967 |
| Prob(flowers/V | the flies like flowers) | ≈ | 0.033 |
Obtaining Lexical Probabilities 5
7.5 Probabilistic Context-Free Grammars
Probabilistic Context-Free Grammars 2
- Grammar 7.17 shows a probabilistic CFG based on a parsed version of our example
corpus:
| | Rule | Count for LHS | Count for Rule | Probability |
| 1.
| S → NP VP
| 300
| 300
| 1
|
| 2.
| VP → V
| 300
| 116
| .386
|
| 3.
| VP → V NP
| 300
| 118
| .393
|
| 4.
| VP → V NP PP
| 300
| 66
| .22
|
| 5.
| NP → NP PP
| 1023
| 241
| .24
|
| 6.
| NP → N N
| 1023
| 92
| .09
|
| 7.
| NP → N
| 1023
| 141
| .14
|
| 8.
| NP → ART N
| 1023
| 558
| .55
|
| 9.
| PP → P NP
| 307
| 307
| 1
|
Probabilistic Context-Free Grammars 3
- You can then develop algorithms similar in functioning to the Viterbi
algorithm that will find the most likely parse tree for a
sentence.
- Certain invalid independence assumptions are involved. In
particular, you must assume that the probability of a constituent
being derived by a rule Rj is independent of how the
constituent is used as a subconstituent.
- For example, this assumption
would imply that the probabilities of NP rules are the same whether
the NP is the subject, the object of a verb, or the object of a
preposition, whereas, for example, subject NPs are much more likely
to be pronouns than other NPs.
- Other issues that should ultimately be considered include
the subcategorisation of the V in a VP. Rule 3 would not be used
by an intransitive verb.
Probabilistic Context-Free Grammars 4
- The formalism can be based on the probability that a constituent
C generates a sequence of words wi,
wi+1, ... , wj, written as
wi,j. This probability is called the inside
probability and written Pr(wi,j | C).
- E.g. Pr(a flower | NP) = Pr(a flower | NP → ART N) +
Pr(a flower | NP → N N)
- Here Pr(a flower | NP → ART N) means Pr(Rule 8 | NP) ×
Pr(a | ART) × Pr(flower | N)
- Similarly,
| Pr(a flower blooms | S) =
| Pr(Rule 1 | S) × Pr(a flower | NP) × Pr(blooms | VP) + |
| | Pr(Rule 1 | S) × Pr(a | NP) × Pr(flower blooms | VP) |
Probabilistic Context-Free Grammars 5
- We're more interested in the probabilities of particular parses
than the overall probability of a particular sentence.
- The probabilities of specific parse trees for a sentence can be found
using a standard chart parsing algorithm, where the probability of each
constituent is computed from the probability of its subconstituents together
with the probability of the rule used, and recorded, with the contituent,
on the chart.
- When entering an item E of
category C on the chart, using rule i with n
subconstituents corresponding to chart entries E1,
..., En:
Pr(E) = Pr(Rule i | C) × Pr(E1) × ... × Pr(En)
Probabilistic Context-Free Grammars 6
Three ways to generate a flower wilted as an S
Probabilistic Context-Free Grammars 7
- Ultimately a parser built using these techniques doesn't work as
well as you might hope, though it does help.
- Typically it has been found that these techniques identify the
correct parse 50% of the time.
- One critical issue is that the model assumes that the probability
of a particular verb being used in a VP rule is independent of
which rule is being considered.
Probabilistic Context-Free Grammars 8
- For example, Grammar 7.17 says that rule 3 is used 39% of the
time, rule 4 22%, and rule 5 24%.
- These between them mean that irrespective of what words are used,
an input sequence of the form V NP PP will always be parsed with the
PP attached to the verb, since the V NP PP interpretation has a
probability of 0.22, whereas the version that attaches the PP to
the NP has a probability of 0.39 × 0.24 = 0.093.
- This is true even with verbs that rarely take a _np_pp complement.
7.6 Best-First Parsing
- So far, probabilistic CFGs have done nothing for parser efficiency.
- Algorithms developed to explore high-probability constituents
first are called best-first parsing algorithms.
The hope is that the best parse will be found first and quickly.
- Our basic chart parsing algorithm can be modified to consider
most likely constituents first.
- The central idea is to make the agenda a priority queue -
the highest rated elements are always first in the queue, and the
parser always removes the highest-ranked constituent from the agenda
before adding it to the chart.
Best-First Parsing 2
- Some slight further modifications are necessary to keep the
parser working.
- With the modified agenda system, if the last word in the sentence
has the highest score, it will be added to the chart first.
- Previous versions of the chart parsing algorithm have assumed
that we could safely work left-to-right. This no longer holds. Thus,
whenever an active arc is added to the chart (e.g. by "moving the dot")
it may be that the next constituent needed to extend the arc is already
in the chart, and we must check for this.
Best-First Parsing 3
- The complete new algorithm is shown in Fig. 7.22.
- Adopting a best-first strategy makes a significant improvement
in parser efficiency. Using grammar 7.17 and the lexicon information
from the corpus, the sentence The man put a bird in the house
is parsed correctly after generating 65 constituents. The standard
bottom-up chart parser generates 158 constituents from the same sentence.
- The best-first parser is guaranteed to find the highest
probability parse (see Allen p. 214 for proof, if interested).
|
to add a constituent C from position p1
to p2:
- Insert C into the chart from position p1 to
p2.
- For any active arc X → X1 ... •
C ... Xn from position p0
to p1, add a new active arc X →
X1 ... C • ... Xn
from position p0 to p2.
to add an active arc X → X1 ...
C • C' ... Xn to the chart from
p0 to p2:
- If C is the last constituent (that is, the arc is completed),
add a new constituent of type X to the agenda.
- Otherwise, if there is a constituent Y of category C'
in the chart from p2 to p3,
then recursively add an active arc X →
X1 ... C C' •
... Xn from p0 to
p3 (which may of course add further arcs or
create further constituents).
|
Figure 7.22 The new arc extension algorithm
7.7 A Simple Context-Dependent Best-First Parser
- The best-first algorithm improves efficiency but not accuracy.
- This section explores a simple alternative method of computing rule
probabilities that uses more context-dependent lexical information.
- The idea exploits the fact that the first word in a constituent is
often the head word and thus has a big effect on the
probabilities of rules that account for the constituent.
A Simple Context-Dependent Best-First Parser 2
- This suggests a new probability measure for rules that takes
into account the first word: Pr(R | C, w):
- How this helps: for instance, in the corpus, singular nouns rarely
occur alone as a noun phrase (i.e. NP → N),
and plural nouns are rarely used as a noun modifier (i.e. starting
rule NP → N N) -
this can be seen from the statistics in Fig. 7.23:
Figure 7.23 Some rule estimates based on the first word
| Rule
| the
| house
| peaches
| flowers
|
| NP → N
| 0
| 0
| .65
| .76
|
| NP → N N
| 0
| .82
| 0
| 0
|
| NP → NP PP
| .23
| .18
| .35
| .24
|
| NP → ART N
| .76
| 0
| 0
| 0
|
| Rule
| ate
| bloom
| like
| put
|
| VP → V
| .28
| .84
| 0
| .03
|
| VP → V NP
| .57
| .10
| .9
| .03
|
| VP → V NP PP
| .14
| .05
| .1
| .93
|
A Simple Context-Dependent Best-First Parser 3
Summary: Ambiguity Resolution - Statistical Methods
| The concepts of independence and conditional probability, together with
frequency statistics on parts-of-speech and grammar rules obtained from an NL
corpus, allow us to work out the most likely lexical categories for words in a
sentence, to order the parsing agenda for best-first parsing. We found in all
the cases we examined that context-dependent methods work best.
| |
Exercises from Allen (with some minor changes)
Some of these exercises might take quite a while. They are offered
in part as challenge material, and should not be seen as compulsory.
- Hand simulate the Viterbi algorithm using the data and
probability estimates in Tables 7.4-7.6 on the sentence
Flower flowers like flowers. Draw transition networks
as in "Simulating the Viterbi Algorithm 1-6" for this sentence,
and figure out which parts of speech the algorithm identifies for
each word.
- Using the bigram and lexical-generation probabilities given in
this section, calculate the word probabilities using the forward
algorithm for the sentence The a flies like flowers
(involving a very rare use of the word a as a noun, as in
the a flies, the b flies, and so on). Remember to
use 0.0001 as a probability for any bigram not in the table.
Are the results you get reasonable? If not, what is the problem,
and how might it be fixed?
- Using the small tagged corpus at
http://www.cse.unsw.edu.au/~billw/cs9414/notes/corpora/austen_tagged
(about 8000 words),
write code to collect bigram and lexical statistics, and implement
a part-of-speech tagged using the Viterbi algorithm. Test your
algorithm on the same corpus. How many sentences are tagged
correctly? How many words are tagged correctly? Would you expect
to get better accuracy results on the the sentences that involved
an error if you had access to more data? You can investigate this
and related issues by:
- keeping some of the sentences (e.g. the last 50, say) back
from the statistics collection) and comparing the performance
on the sub-corpus with performance (with full statistics)
on the whole corpus;
- keeping some of the sentences, e.g. the last 50, back
from the statistics collection, and examining the performance
of your system on these last 50 sentences.
CRICOS Provider Code No. 00098G
Copyright (C) Bill Wilson, 2005-2008, except where another source is acknowledged.
Acknowledgement: All numbered sections, tables, and figures are
reproduced from James Allen's "Natural Language Processing" (ed. 2,
Benjamin Cummings, 1995) which is used as a text book for the course
for which these web pages form lecture notes. You are advised to obtain
access to a copy of this book (e.g. by buying it or borrowing it from
a library) to assist in understanding the material.