## Ambiguity Resolution - Statistical Methods

Reference: Allen, Chapter 7

 Aim: To develop enough theory to describe algorithms for disambiguating the syntactic categories of the words, choosing the most likely parse, and enhancing parser efficiency, using statistical techniques, and frequency data on the number of times that: a word occurs in different lexical categories; a lexical category occurs following another lexical category (or sequence of categories). a particular grammar rule is used Keywords: Bayes' rule, bigram, conditional probability, corpus, hidden Markov model, HMM, independence (statistical), lexical generation probability, n-gram, output probability, part-of-speech tagging, statistical NLP, structural ambiguity, tagged corpus, trigram, Viterbi algorithm

#### Plan

• revision of probability: conditional probability, Bayes' rule, independence
• part-of-speech identification, corpus, tagged corpus
• estimating probabilities from word frequencies in a corpus
• part-of-speech tagging
• bigrams, trigrams, n-grams
• estimating bigram probabilities
• experiments with a synthetic corpus
• graphical representation of bigram probabilities
• Markov assumption, hidden Markov models
• Viterbi algorithm
• forward probability, backward probability
• probabilistic context-free grammars
• inside probability
• best-first parsing
• context-dependent best-first parser

### 7.1 Basic Probability Theory

• Probabilities are expressed as numbers between 0 (impossible) and 1 (certain).

• A random variable basically means a set of mutually exclusive possible outcomes of some event, such as tossing a coin.

• Example: tossing a coin - the outcomes are H (heads) and T (tails).

• Each outcome has a probability: since H and T are equally likely, and the outcome {H or T} is certain (probability 1)
Pr(H) = Pr(T) = 0.5.

• In general, if there are n possible outcomes e1, ..., en, then
Pr(e1) + ... + Pr(en) = 1.

#### Revision of Conditional Probability

• Suppose a horse Harry ran 100 races and won 20. So Pr(Win) = 0.2.

• Suppose also that 30 of these races were in wet conditions, and of these, Harry won 15. Clearly Harry does better in the rain: the probability that Harry would win given that it was raining would be 0.5.

• Conditional probability captures this idea. Pr(Win | Rain) = 0.5. Read "|" as "given".

• Formally, the conditional probability of and event e given an event e' is defined by the formula

Pr(e | e') = Pr(ee') / Pr(e')

#### Revision of Bayes' Rule

By the definition,

Pr(A | B) = Pr(AB) / Pr(B)

Pr(B | A) = Pr(BA) / Pr(A)

so

Pr(A | B) / Pr(B | A) = Pr(A) / Pr(B), or

 Pr(A | B) = Pr(B | A) × Pr(A) / Pr(B)

This is called Bayes' rule, and it is sometimes useful in estimating probabilities in the absence of some of the information.

#### Independence

• Two events are said to be independent of each other if:

the occurrence of one does not effect the probability of the occurrence of the other:

Pr(e | e') = Pr(e)

• Using the definition of conditional probability, this is equivalent to saying:

Pr(ee') = Pr(e) × Pr(e')

#### Applying Statistical Concepts to NLP

• Problem: Given a sentence with ambiguous words, determine the most likely lexical category for each word. One reason for wanting do this is that a common reason for multiple parses is lexical category ambiguity.

• This is called part-of-speech (POS) identification.

• Say you need the part-of-speech of words that can be either N or V.

• This can be formalized using two random variables:

1. one, C, ranges over the parts of speech (N and V)

2. the other, W, ranges over all the possible words.

#### Applying Statistical Concepts to NLP 2

• Consider an example where W = flies. The problem can be stated as determining which is greater, out of

Pr(C = N | W= flies)
Pr(C = V | W = flies)

• Often we shall write things like this as Pr(N | flies) for short.

• Using the definition of conditional probability, we have

Pr(N | flies) = Pr(flies ∧ N) / Pr(flies)
Pr(V | flies) = Pr(flies ∧ V) / Pr(flies)

• So what we really want is Pr(flies ∧ N) versus Pr(flies ∧ V).

#### Corpora and Tagging

• These probabilities can be estimated from a tagged corpus.

• A corpus is a collection of natural language sentences.
The collection might be of newspaper articles, for example.

• "Tagged" means that each word in the corpus is labelled ("tagged") with its part of speech.

• The label on a word is called its "tag".

• Starting point for links to Jane Austen (early writings) Corpus Pages

#### Corpora and Tagging 2

• Suppose we have a corpus of simple sentences containing 1,273,000 words, including say 1000 uses of the word flies:
400 of them as an N and 600 of them as a V.

Pr(flies) = 1000 / 1,273,000 = 0.00078554595
Pr(flies ∧ N) = 400 / 1,273,000 = 0.00031421838
Pr(flies ∧ V) = 600 / 1,273,000 = 0.00047132757

• Our best guess will thus be that V is the most likely part-of-speech. (And we should be right about 60% of the time.)

#### Corpora and Tagging 3

• We could have seen that directly from the counts, but in more complicated instances, we shall need the actual probabilities.

• Incidentally, we can now estimate Pr(V | flies):

 Pr(V | flies) = Pr(flies ∧ V) / Pr(flies) = 0.00047132757 / 0.00078554595 = 0.6

which also = count(flies ∧ V) / count(flies)

### 7.2 Estimating Probabilities

• Probability-based methods rely on our ability to estimate the probabilities. Accurate estimates assume vast amounts of data. This is OK for common words. For uncommon words, a problem arises.

• The Brown corpus is a fairly typical corpus of about a million words, but only about 49,000 distinct words.

• So, on average, each word occurs about 20 times.

• But in fact, over 40,000 words occur 5 times or less (with any part of speech). For these words, the probability estimates will be very crude.

### 7.2 Estimating Probabilities 2

• If you increase the size of the corpus, the uncommon words that we were worried about will have more occurrences, but new, even more unusual words will be introduced, and the problem is still there.

• Some low-frequency words may occur zero times with one of the parts-of-speech that they can in fact take. The probability estimate is then 0 for that part of speech.

• One technique for dealing with this is to add a small number, like 0.5, to the count of each part of speech for a word before computing the probabilities.

### 7.3 Automatic Part-of-Speech Tagging

• Automatic POS-tagging can be done by selecting the most likely sequence of syntactic categories for the words in a sentence, and using the syntactic categories in that sequence as the tags for the words in the sentence.

• Our method in section 7.1 (always select the most frequent category) works 90% of the time on a per word basis (mainly because lots of words are unambiguous). So to be worth considering, a method needs to improve on 90%.

• The standard method is to use some of the local context for the word. With our flies example, if the word was preceded by the, then the probability of N increases dramatically.

#### Computing Probabilities of Tag Sequences

• Let w1, ..., wT be a sequence of words. We want to find the sequence of lexical categories C1, ..., CT that maximizes

1. Pr(C1, ..., CT | w1, ..., wT)

• Unfortunately, it would take far too much data to generate reasonable estimates for such sequences directly from a corpus, so direct methods cannot be applied.

#### Computing Probabilities of Tag Sequences 2

• However, there are approximation methods. To develop them, we use Bayes' rule, which says that this conditional probability equals

2. Pr(C1, ..., CT) × Pr(w1, ..., wT | C1, ..., CT) / Pr(w1, ..., wT)

• The common denominator Pr(w1, ..., wT) does not affect the maximisation problem, so in effect we want to find a sequence C1, ..., CT that maximizes

3. Pr(C1, ..., CT) × Pr(w1, ..., wT | C1, ..., CT)

• These can be approximated by probabilities that are easier to collect by making some independence assumptions. These independence assumptions are not valid, but the estimates work well in practice.

#### Bigrams and Trigrams

• The first factor in formula 3, Pr(C1, ..., CT), can be approximated by a sequence of probabilities based on a limited number of previous categories, say 1 or 2 previous categories.

• The bigram model uses Pr(Ci | Ci–1): i.e. (Pr(wi is Ci | wi–1 is Ci–1).

• The trigram model uses Pr(Ci | Ci–2 Ci–1). Such models are called n-gram models.

#### Bigrams

• Using bigrams, the following approximation can be used:

Pr(C1, ..., CT) ≈ Π i=1,T Pr(Ci | Ci–1)

• To account for the beginning of a sentence, we invent a pseudocategory <start> at position 0 as the value of C0.

• Then for the sequence ART N V N, using bigrams:

Pr(ART N V N) ≈
Pr(ART | <start>) × Pr(N | ART) × Pr(V | N) × Pr(N | V)

#### Approximating Probabilities

• The second factor in formula 3,

Pr(w1, ..., wT | C1, ..., CT)

can be approximated by assuming that a word appears in a category independent of the words in the preceding or succeeding categories.

• It is approximated by

Πi=1,T Pr(wi | Ci).

#### Approximating Probabilities 2

• With these approximations, we are down to finding the sequence C1, ..., CT maximizing

Πi=1,T Pr(Ci | Ci-1) Pr(wi | Ci)

• These probabilities can be estimated from a tagged corpus.

• For example, the probability that a V follows an N can be estimated as follows:

Pr(Ci = V | Ci–1 = N) ≈
Count(N at position i–1 and V at i) / Count(N at position i–1)

#### Bigram Frequencies with a Toy Corpus

• Table 7.4 in Allen gives some bigram frequencies for an artificially generated corpus of simple sentences.

• The corpus consisted of 300 sentences and had words in only four categories: N, V, ART, and P, including 833 nouns, 300 verbs, 558 articles, and 307 prepositions for a total of 1998 words.

• To deal with the problem of sparse data, any bigram not listed in Figure 7.4 was assumed to have a token probability of 0.0001.

#### Table 7.4 from Allen

Category Count at i Pair Count at i,i+1 Bigram Estimate
<start> 300 <start>,ART 213 Pr(Art|<start>) .71
<start> 300 <start>,N 87 Pr(N|<start>) .29
ART 558 ART,N 558 Pr(N|ART) 1
N 833 N,V 358 Pr(V|N) .43
N 833 N,N 108 Pr(N|N) .13
N 833 N,P 366 Pr(P|N) .44
V 300 V,N 75 Pr(N|V) .35
V 300 V,ART 194 Pr(ART|V) .65
P 307 P,ART 226 Pr(ART|P) .74
P 307 P,N 81 Pr(N|P) .26
Table 7.4 Bigram probabilities from the generated corpus

#### Estimating Lexical Generation Probabilities

• The lexical generation probabilities Pr(Wi | Ci) can be estimated simply by counting the number of occurrences of each word by category.

• Figure 7.5 in Allen gives some counts for individual words, from which the probability estimates in Table 7.6 in Allen are generated.

Table 7.5 Some corpus word counts
N V ART P TOTAL
flies 21 23 0 0 44
fruit 49 5 1 0 55
like 10 30 0 31 61
a 1 0 201 0 202
the 1 0 300 2 303
flower 53 15 0 0 68
flowers 42 16 0 0 58
birds 64 1 0 0 65
others 592 210 56 284 1142
TOTAL 833 300 558 307 1998

#### Estimating Lexical Generation Probabilities 2

 Pr(the | ART) 0.54 Pr(a | ART) 0.360 Pr(flies | N) 0.025 Pr(a | N) 0.001 Pr(flies | V) 0.076 Pr(flower | N) 0.063 Pr(like | V) 0.1 Pr(flower | V) 0.05 Pr(like | P) 0.068 Pr(birds | N) 0.076 Pr(like | N) 0.012

• Given N categories and T words, there are NT possible tag-sequences.

• Luckily, it is not necessary to consider all of these because of the independence assumptions that were made about the data.

#### Finite-state Machine Model

• Since we are only dealing with bigram probabilities, the probability that the i-th word is in category Ci depends only on the category of the i–1st word, Ci–1.

• Thus the process can be modelled by a special type of finite state machine, as shown in Fig. 7.7 (below).

• Each node represents a possible lexical category, and the transition probabilities (i.e. the bigram probabilities from Table 7.4) show the probability of one category following another.

#### Figure 7.7 from Allen

Figure 7.7: A network capturing the bigram probabilities

#### Probabilities from Transition Probability Networks

• With such a network, you can compute the probability of any sequence of categories simply by finding the path through the network indicated by the sequence and multiplying the transition probabilities together.

• Thus, the sequence ART N V N would have probability 0.71 × 1 × 0.43 × 0.35 = 0.107.

• The validity of this depends on the assumption that the probability of a category occurring depends only on the immediately preceding category.

• This assumption is referred to as the Markov assumption, and networks like Fig. 7.7 are called Markov Chains.

#### Probabilities from Transition Probability Networks 2

• The network representation can now be extended to include the lexical generation probabilities as well: we allow each node to have an output probability which gives a probability to each possible output that could correspond to the node.

• For instance the node N in Fig. 7.7 would be associated with a probability table that gives, for each word, how likely that word is to be selected if we randomly select a noun.

• The output probabilities are exactly the lexical generation probabilities from Fig. 7.6. A network like that in Figure 7.7 with output probabilities associated with each node is called a Hidden Markov Model (HMM).

#### Probabilities from Transition Probability Networks 3

• The probability that the sequence N V ART N generates the output Flies like a flower is computed as follows:

• The probability of the path N V ART N given Fig. 7.7, is

.29 × .43 × .65 × 1 = 0.081.

• The probability of the output being Flies like a flower is computed from the output probabilities in Fig. 7.6:

 Pr(flies | N) × Pr (like | V) × Pr(a | ART) × Pr(flower | N) = 0.025 × .1 × .36 × .063 = 5.4×10–5

#### Probabilities from Transition Probability Networks 4

• Multiplying 0.081 and 5.4×10–6 together gives us the likelihood that the HMM would generate the sequence: 4.37×10–6.

• More generally, the approximate probability of generating a sentence w1, ..., wT together with the sequence of tags C1, ..., CT is

Πi=1,T Pr(Ci | Ci–1) Pr(wi | Ci)

#### Back to Most-likely Tag Sequences

• Now we can resume the discussion of how to find the most likely sequence of tags for a sequence of words.

• The key insight is that because of the Markov assumption, you do not have to enumerate all the possible sequences. Sequences that end in the same category can be collapsed together since the next category only depends on the previous one in the sequence.

• So if you just keep track of the most likely sequence found so far for each possible ending category, you can ignore all the other less likely sequences.

• Fig. 7.8 does Flies like a flower in this way.

Figure 7.8: Encoding 256 possible sequences, using the Markov assumption

#### Finding the Most Likely Tag Sequence

• To find the most likely sequence, sweep forward through the words one at a time finding the most likely sequence for each ending category.

• In other words, you find the four best sequences for the two words Flies like: the best ending with like as a V, the best as an N, the best as a P and the best as an ART.

• You then use this information to find the four best sequences for the the words flies like a, each one ending in a different category.

• This process is repeated until all the words are accounted for.

#### Finding the Most Likely Tag Sequence 2

• This algorithm is usually called the Viterbi algorithm.

• For a problem involving T words, and N lexical categories, the Viterbi algorithm is guaranteed to find the most likely sequence using kTN2 steps for some constant k.

### The Viterbi Algorithm

(Fig 7.9 in Allen)
 Given word sequence w1, ..., wT, lexical categories L1, ..., LN, lexical probabilities Prob(wt | Li) and bigram probabilities Prob(Li | Lj), find the most likely sequence of lexical categories C1, ..., CT for the word sequence. Initialization Step for lexcat = 1 to N do     SeqScore(lexcat, 1) = Prob(w1 | Llexcat) × Prob(Llexcat | )     BackPtr(lexcat, 1) = 0 Iteration Step for t = 2 to T do     for lexcat = 1 to N do         SeqScore(lexcat, t) = Maxj=1,N(SeqScore(j, t–1) × Prob(Llexcat | Lj)) × Prob(wt | Llexcat)         BackPtr(lexcat, t) = index of j that gave the max above Sequence Identification Step CT = lexcat that maximizes SeqScore(i, T) for word = T – 1 to 1 do     Cword = BackPtr(Cword+1, word+1)

### Notes on Viterbi Algorithm

1. It's not clear what happens to BackPtr(-,-) if there is a tie for the maximizer of SeqScore(lexcat, t).

2. If SeqScore(lexcat, t) = 0, this says the sequence is impossible, so there is not point in having a BackPtr in this case, and there is no point considering sequences with Ct = lexcat any further.

3. In the tables in the examples below, the name of the lexcat is used rather than the number.

lexcatSeqScore
(lexcat,1)
BackPtr
(lexcat,1)
V7.6×10–6
N0.00725
P0
ART0

### Simulating the Viterbi Algorithm 2

Sample computation in iteration step:

 SeqScore(V, 2) = Max[SeqScore(N, 1) × Prob(V|N),             SeqScore(V, 1) × Prob(V|V)]    × Prob(like|V) = Max[0.00725 × 0.43, 7.6 × 10–6 × 0.0001] × 0.1 = 3.12 × 10–4

The category that maximizes SeqScore(V, 2) is N, so BackPtr(V, 2) is N.

### Simulating the Viterbi Algorithm 3

lexcatSeqScore
(lexcat,1)
SeqScore
(lexcat,2)
BackPtr
(lexcat,2)
V7.6×10–60.00031N
N0.007251.3×10–5N
P00.00022N
ART00

### Simulating the Viterbi Algorithm 4

lexcatSeqScore
(lexcat,1)
SeqScore
(lexcat,2)
SeqScore
(lexcat,3)
BackPtr
(lexcat,3)
V7.6×10–60.000310
N0.007251.3×10–51.2×10–7V
P00.000220
ART007.2×10–5V

### Simulating the Viterbi Algorithm 5

lexcatSeqScore
(lexcat,1)
SeqScore
(lexcat,2)
SeqScore
(lexcat,3)
SeqScore
(lexcat,4)
BackPtr
(lexcat,4)
V7.6×10–6 0.0003102.6×10–9ART
N0.007251.3×10–5 1.2×10–74.3×10–6ART
P00.00022 00
ART00 7.2×10–50

### Simulating the Viterbi Algorithm 6

• To find C4, you look for the lexcat that maximizes SeqScore(lexcat, 4).

• This turns out to be lexcat = N.

• Then we follow the BackPtr chain:

BackPtr(N, 4) = ART, so C3 = ART;
BackPtr(ART, 3) = V, so C2 = V;
BackPtr(V, 2) = N, so C1 = N;
(and BackPtr(N, 1) = ∅).

• So the most probable tag sequence is N V ART N.

• Algorithms like this (with a decent-sized corpus) typically tag correctly ≥ 95% of the time on a per-word basis (up from 90%).

### Obtaining Lexical Probabilities 2

• For example, consider the probability that flies is a noun in The flies like flowers: sum the probabilities of all sequences that end with flies as a noun.

• Using the data in Figs. 7.4 & 7.6, the possibilities are:

 The/ART flies/N = Pr(ART|∅)×Pr(the|ART)×Pr(N|ART)×Pr(flies|N) = 0.71×0.54×1×0.025 = 9.585×10–3 The/N flies/N = Pr(N|∅)×Pr(the|N)×Pr(N|N)×Pr(flies|N) = 0.29×0.0033×0.13×0.025 = 3.11×10–6 The/P flies/N = Pr(P|∅)×Pr(the|P)×Pr(N|P)×Pr(flies|N) = 0.0001×0.0066×0.26×0.025 = 4.29×10–9

which doesn't differ significantly from 9.585×10–3.

### Obtaining Lexical Probabilities 3

• Similarly, 3 sequences with non-zero probability end with flies as a V, with a total probability of 1.13×10–5.

• No other non-zero probability sequences ending in flies exist, so the sum 9.585×10–3 + 1.13×10-5 = 9.596×10-3 is the probability of the sequence the flies.

• Then the probability that flies is a noun in this case is

 Pr(flies/N | the flies) = Pr(flies/N ∧ the flies) / Pr(the flies) = 9.585 ×10–3 / 9.596×10-3 = 0.99885

and so Pr(flies/V | the flies) = 0.00115.

• Compare these with the estimates Prob(N | flies) = 0 .48 and Prob(V | flies) = 0.52 from Table 7.13

### Obtaining Lexical Probabilities 4

• More precisely, we define the forward probability αi(t) - the probability of producing the words w1, ..., wt, ending in state wt/Li, as follows:

αi(t) = Pr(wt/Li, w1, ..., wt).

• For example, with The flies like flowers, αV(3) would be Prob(wt/LV, w1, ..., wt) which is the sum of the probabilities computed for all 16 sequences ending in V, viz. ART ART V, ART N V, ART P V, ART V V, N ART V, N N V, N P V, N V V, P ART V, P N V, P P V, P V V, V ART V, V N V, V P V, and V V V.

• We can derive that:

Pr(wt/Li | w1, ..., wt) ≈ αi(t) / Σj=1,N αj(t)

and a variant of the Viterbi algorithm can be used to compute the forward probabilities: see Fig. 7.14 below.

### Fig. 7.14 from Allen

 Initialization Step for lexcat = 1 to N do     SeqSum(lexcat, 1) = Prob(w1 | Llexcat) × Prob(Llexcat | ) Computing the Forward Probabilities for t = 2 to T do     for lexcat = 1 to N do         SeqSum(lexcat, t) = Σj=1,N(SeqSum(j, t–1) × Prob(Llexcat | Lj)) × Prob(wt | Llexcat) Converting Forward Probabilities into Lexical Probabilities for t = 1 to T do     for lexcat = 1 to N do          Prob(Ct = Llexcat) = SeqSum(lexcat, t) / Σj=1,N SeqSum(j, t)

### Fig. 7.15 from Allen

Computing the sums of the probabilities of the sequences

### Table 7.16 from Allen

 Prob(the/ART | the) ≈ 1 Prob(flies/N | the flies) ≈ 0.9988 Prob(flies/V | the flies) ≈ 0.0011 Prob(like/V | the flies like) ≈ 0.575 Prob(like/P | the flies like) ≈ 0.4 Prob(like/N | the flies like) ≈ 0.022 Prob(flowers/N | the flies like flowers) ≈ 0.967 Prob(flowers/V | the flies like flowers) ≈ 0.033

### Obtaining Lexical Probabilities 5

• You could also consider the backward probability βi(t) - the probability of producing wt, ..., wT beginning from state wT/Li, and working backwards through the sentence to word wt.

• In the end, a better method of estimating the lexical probabilities would be to consider the whole sentence, not just the words up to position t starting from one end or the other of the sentence. In this case, you could use the estimate:

Pr(wt/Li) = (αi(t)×βi(t)) / Σj=1,Nj(t)×βj(t))

### 7.5 Probabilistic Context-Free Grammars

• Just as we can accumulate statistics on the use of lexical grammatical categories (e.g. how many times N → "flies" and N → "flowers" etc. are used) so it is possible to get statistics on phrasal category use (e.g. how many times NP → ART N and NP → N N are used).

• One can obtain the statistics by counting the number of times each rule is used in a corpus containing parsed sentences.

• Suppose that there are m rules R1, ..., Rm with left-hand side C. You can estimate the probability of using rule Rj to derive C by

Pr(Rj | C) = count(#times Rj used) / Σi=1,m (#times Ri used)

### Probabilistic Context-Free Grammars 2

• Grammar 7.17 shows a probabilistic CFG based on a parsed version of our example corpus:

 Rule Count for LHS Count for Rule Probability 1. S → NP VP 300 300 1 2. VP → V 300 116 .386 3. VP → V NP 300 118 .393 4. VP → V NP PP 300 66 .22 5. NP → NP PP 1023 241 .24 6. NP → N N 1023 92 .09 7. NP → N 1023 141 .14 8. NP → ART N 1023 558 .55 9. PP → P NP 307 307 1

### Probabilistic Context-Free Grammars 3

• You can then develop algorithms similar in functioning to the Viterbi algorithm that will find the most likely parse tree for a sentence.

• Certain invalid independence assumptions are involved. In particular, you must assume that the probability of a constituent being derived by a rule Rj is independent of how the constituent is used as a subconstituent.

• For example, this assumption would imply that the probabilities of NP rules are the same whether the NP is the subject, the object of a verb, or the object of a preposition, whereas, for example, subject NPs are much more likely to be pronouns than other NPs.

• Other issues that should ultimately be considered include the subcategorisation of the V in a VP. Rule 3 would not be used by an intransitive verb.

### Probabilistic Context-Free Grammars 4

• The formalism can be based on the probability that a constituent C generates a sequence of words wi, wi+1, ... , wj, written as wi,j. This probability is called the inside probability and written Pr(wi,j | C).

• E.g. Pr(a flower | NP) = Pr(a flower | NP → ART N) + Pr(a flower | NP → N N)

• Here Pr(a flower | NP → ART N) means Pr(Rule 8 | NP) × Pr(a | ART) × Pr(flower | N)

• Similarly,  Pr(a flower blooms | S) = Pr(Rule 1 | S) × Pr(a flower | NP) × Pr(blooms | VP) + Pr(Rule 1 | S) × Pr(a | NP) × Pr(flower blooms | VP)

### Probabilistic Context-Free Grammars 5

• We're more interested in the probabilities of particular parses than the overall probability of a particular sentence.

• The probabilities of specific parse trees for a sentence can be found using a standard chart parsing algorithm, where the probability of each constituent is computed from the probability of its subconstituents together with the probability of the rule used, and recorded, with the contituent, on the chart.

• When entering an item E of category C on the chart, using rule i with n subconstituents corresponding to chart entries E1, ..., En:

Pr(E) = Pr(Rule i | C) × Pr(E1) × ... × Pr(En)

### Probabilistic Context-Free Grammars 6

Three ways to generate a flower wilted as an S

### Probabilistic Context-Free Grammars 7

• Ultimately a parser built using these techniques doesn't work as well as you might hope, though it does help.

• Typically it has been found that these techniques identify the correct parse 50% of the time.

• One critical issue is that the model assumes that the probability of a particular verb being used in a VP rule is independent of which rule is being considered.

### Probabilistic Context-Free Grammars 8

• For example, Grammar 7.17 says that rule 3 is used 39% of the time, rule 4 22%, and rule 5 24%.

• These between them mean that irrespective of what words are used, an input sequence of the form V NP PP will always be parsed with the PP attached to the verb, since the V NP PP interpretation has a probability of 0.22, whereas the version that attaches the PP to the NP has a probability of 0.39 × 0.24 = 0.093.

• This is true even with verbs that rarely take a _np_pp complement.

### 7.6 Best-First Parsing

• So far, probabilistic CFGs have done nothing for parser efficiency.

• Algorithms developed to explore high-probability constituents first are called best-first parsing algorithms. The hope is that the best parse will be found first and quickly.

• Our basic chart parsing algorithm can be modified to consider most likely constituents first.

• The central idea is to make the agenda a priority queue - the highest rated elements are always first in the queue, and the parser always removes the highest-ranked constituent from the agenda before adding it to the chart.

### Best-First Parsing 2

• Some slight further modifications are necessary to keep the parser working.

• With the modified agenda system, if the last word in the sentence has the highest score, it will be added to the chart first.

• Previous versions of the chart parsing algorithm have assumed that we could safely work left-to-right. This no longer holds. Thus, whenever an active arc is added to the chart (e.g. by "moving the dot") it may be that the next constituent needed to extend the arc is already in the chart, and we must check for this.

### Best-First Parsing 3

• The complete new algorithm is shown in Fig. 7.22.

• Adopting a best-first strategy makes a significant improvement in parser efficiency. Using grammar 7.17 and the lexicon information from the corpus, the sentence The man put a bird in the house is parsed correctly after generating 65 constituents. The standard bottom-up chart parser generates 158 constituents from the same sentence.

• The best-first parser is guaranteed to find the highest probability parse (see Allen p. 214 for proof, if interested).

 to add a constituent C from position p1 to p2: Insert C into the chart from position p1 to p2. For any active arc X → X1 ... • C ... Xn from position p0 to p1, add a new active arc X → X1 ... C • ... Xn from position p0 to p2. to add an active arc X → X1 ... C • C' ... Xn to the chart from p0 to p2: If C is the last constituent (that is, the arc is completed), add a new constituent of type X to the agenda. Otherwise, if there is a constituent Y of category C' in the chart from p2 to p3, then recursively add an active arc X → X1 ... C C' • ... Xn from p0 to p3 (which may of course add further arcs or create further constituents).

Figure 7.22 The new arc extension algorithm

### 7.7 A Simple Context-Dependent Best-First Parser

• The best-first algorithm improves efficiency but not accuracy.

• This section explores a simple alternative method of computing rule probabilities that uses more context-dependent lexical information.

• The idea exploits the fact that the first word in a constituent is often the head word and thus has a big effect on the probabilities of rules that account for the constituent.

### A Simple Context-Dependent Best-First Parser 2

• This suggests a new probability measure for rules that takes into account the first word: Pr(R | C, w):

• How this helps: for instance, in the corpus, singular nouns rarely occur alone as a noun phrase (i.e. NPN), and plural nouns are rarely used as a noun modifier (i.e. starting rule NPN N) - this can be seen from the statistics in Fig. 7.23:

Figure 7.23 Some rule estimates based on the first word
Rule the house peaches flowers
NP → N 0 0 .65 .76
NP → N N 0 .82 0 0
NP → NP PP .23 .18 .35 .24
NP → ART N .76 0 0 0
Rule ate bloom like put
VP → V .28 .84 0 .03
VP → V NP .57 .10 .9 .03
VP → V NP PP .14 .05 .1 .93

### A Simple Context-Dependent Best-First Parser 3

• Also, these context-sensitive rules encode verb preferences for different subcategorizations - see the bottom half of Fig. 7.23.

• A parser based on these probabilities does substantially better at getting PP attachment right (but still gets 34% wrong).

• It also turns out to be more efficient (36 constituents, compared with 65 (best-first) and 158 (vanilla parser) for The man put the bird in the house).

Further details in Allen, pp. 217-218.

 Summary: Ambiguity Resolution - Statistical Methods The concepts of independence and conditional probability, together with frequency statistics on parts-of-speech and grammar rules obtained from an NL corpus, allow us to work out the most likely lexical categories for words in a sentence, to order the parsing agenda for best-first parsing. We found in all the cases we examined that context-dependent methods work best.

### Exercises from Allen (with some minor changes)

Some of these exercises might take quite a while. They are offered in part as challenge material, and should not be seen as compulsory.

1. Hand simulate the Viterbi algorithm using the data and probability estimates in Tables 7.4-7.6 on the sentence Flower flowers like flowers. Draw transition networks as in "Simulating the Viterbi Algorithm 1-6" for this sentence, and figure out which parts of speech the algorithm identifies for each word.

2. Using the bigram and lexical-generation probabilities given in this section, calculate the word probabilities using the forward algorithm for the sentence The a flies like flowers (involving a very rare use of the word a as a noun, as in the a flies, the b flies, and so on). Remember to use 0.0001 as a probability for any bigram not in the table. Are the results you get reasonable? If not, what is the problem, and how might it be fixed?

3. Using the small tagged corpus at http://www.cse.unsw.edu.au/~billw/cs9414/notes/corpora/austen_tagged (about 8000 words), write code to collect bigram and lexical statistics, and implement a part-of-speech tagged using the Viterbi algorithm. Test your algorithm on the same corpus. How many sentences are tagged correctly? How many words are tagged correctly? Would you expect to get better accuracy results on the the sentences that involved an error if you had access to more data? You can investigate this and related issues by:
• keeping some of the sentences (e.g. the last 50, say) back from the statistics collection) and comparing the performance on the sub-corpus with performance (with full statistics) on the whole corpus;
• keeping some of the sentences, e.g. the last 50, back from the statistics collection, and examining the performance of your system on these last 50 sentences.

CRICOS Provider Code No. 00098G

Copyright (C) Bill Wilson, 2005-2012, except where another source is acknowledged.

Acknowledgement: All numbered sections, tables, and figures are reproduced from James Allen's "Natural Language Processing" (ed. 2, Benjamin Cummings, 1995) which is used as a text book for the course for which these web pages form lecture notes. You are advised to obtain access to a copy of this book (e.g. by buying it or borrowing it from a library) to assist in understanding the material.