Ambiguity Resolution - Statistical Methods

Reference: Allen, Chapter 7

Aim:
To develop enough theory to describe algorithms for disambiguating the syntactic categories of the words, choosing the most likely parse, and enhancing parser efficiency, using statistical techniques, and frequency data on the number of times that:
  1. a word occurs in different lexical categories;
  2. a lexical category occurs following another lexical category (or sequence of categories).
  3. a particular grammar rule is used
Keywords: Bayes' rule, bigram, conditional probability, corpus, hidden Markov model, HMM, independence (statistical), lexical generation probability, n-gram, output probability, part-of-speech tagging, statistical NLP, structural ambiguity, tagged corpus, trigram, Viterbi algorithm


Plan


7.1 Basic Probability Theory


Revision of Conditional Probability


Revision of Bayes' Rule

By the definition,

Pr(A | B) = Pr(AB) / Pr(B)

Pr(B | A) = Pr(BA) / Pr(A)

so

Pr(A | B) / Pr(B | A) = Pr(A) / Pr(B), or

Pr(A | B) = Pr(B | A) × Pr(A) / Pr(B)

This is called Bayes' rule, and it is sometimes useful in estimating probabilities in the absence of some of the information.


Independence


Applying Statistical Concepts to NLP


Applying Statistical Concepts to NLP 2


Corpora and Tagging


Corpora and Tagging 2


Corpora and Tagging 3


7.2 Estimating Probabilities


7.2 Estimating Probabilities 2


7.3 Automatic Part-of-Speech Tagging


Computing Probabilities of Tag Sequences


Computing Probabilities of Tag Sequences 2


Bigrams and Trigrams


Bigrams


Approximating Probabilities


Approximating Probabilities 2


Bigram Frequencies with a Toy Corpus


Table 7.4 from Allen

Category Count at i Pair Count at i,i+1 Bigram Estimate
<start> 300 <start>,ART 213 Pr(Art|<start>) .71
<start> 300 <start>,N 87 Pr(N|<start>) .29
ART 558 ART,N 558 Pr(N|ART) 1
N 833 N,V 358 Pr(V|N) .43
N 833 N,N 108 Pr(N|N) .13
N 833 N,P 366 Pr(P|N) .44
V 300 V,N 75 Pr(N|V) .35
V 300 V,ART 194 Pr(ART|V) .65
P 307 P,ART 226 Pr(ART|P) .74
P 307 P,N 81 Pr(N|P) .26
Table 7.4 Bigram probabilities from the generated corpus


Estimating Lexical Generation Probabilities

  • The lexical generation probabilities Pr(Wi | Ci) can be estimated simply by counting the number of occurrences of each word by category.

  • Figure 7.5 in Allen gives some counts for individual words, from which the probability estimates in Table 7.6 in Allen are generated.

 
Table 7.5 Some corpus word counts
  N V ART P TOTAL
flies 21 23 0 0 44
fruit 49 5 1 0 55
like 10 30 0 31 61
a 1 0 201 0 202
the 1 0 300 2 303
flower 53 15 0 0 68
flowers 42 16 0 0 58
birds 64 1 0 0 65
others 592 210 56 284 1142
TOTAL 833 300 558 307 1998

Estimating Lexical Generation Probabilities 2

Table 7.6 The lexical generation probabilities
Pr(the | ART)0.54Pr(a | ART)0.360
Pr(flies | N)0.025Pr(a | N)0.001
Pr(flies | V)0.076Pr(flower | N)0.063
Pr(like | V)0.1Pr(flower | V)0.05
Pr(like | P)0.068Pr(birds | N)0.076
Pr(like | N)0.012  


Finite-state Machine Model


Figure 7.7 from Allen


Figure 7.7: A network capturing the bigram probabilities


Probabilities from Transition Probability Networks


Probabilities from Transition Probability Networks 2


Probabilities from Transition Probability Networks 3


Probabilities from Transition Probability Networks 4


Back to Most-likely Tag Sequences


Figure 7.8: Encoding 256 possible sequences, using the Markov assumption


Finding the Most Likely Tag Sequence


Finding the Most Likely Tag Sequence 2


The Viterbi Algorithm

(Fig 7.9 in Allen)
Given word sequence w1, ..., wT, lexical categories L1, ..., LN, lexical probabilities Prob(wt | Li) and bigram probabilities Prob(Li | Lj), find the most likely sequence of lexical categories C1, ..., CT for the word sequence.

Initialization Step
for lexcat = 1 to N do
    SeqScore(lexcat, 1) = Prob(w1 | Llexcat) × Prob(Llexcat | <start>)
    BackPtr(lexcat, 1) = 0

Iteration Step
for t = 2 to T do
    for lexcat = 1 to N do
        SeqScore(lexcat, t) = Maxj=1,N(SeqScore(j, t–1) × Prob(Llexcat | Lj)) × Prob(wt | Llexcat)
        BackPtr(lexcat, t) = index of j that gave the max above

Sequence Identification Step
CT = lexcat that maximizes SeqScore(i, T)
for word = T – 1 to 1 do
    Cword = BackPtr(Cword+1, word+1)


Notes on Viterbi Algorithm

  1. It's not clear what happens to BackPtr(-,-) if there is a tie for the maximizer of SeqScore(lexcat, t).

  2. If SeqScore(lexcat, t) = 0, this says the sequence is impossible, so there is not point in having a BackPtr in this case, and there is no point considering sequences with Ct = lexcat any further.

  3. In the tables in the examples below, the name of the lexcat is used rather than the number.


Simulating the Viterbi Algorithm 1

Step 1 of Viterbi Algorithm  
lexcatSeqScore
(lexcat,1)
BackPtr
(lexcat,1)
V7.6×10–6
N0.00725
P0
ART0

Simulating the Viterbi Algorithm 2

Sample computation in iteration step:

SeqScore(V, 2)= Max[SeqScore(N, 1) × Prob(V|N),
            SeqScore(V, 1) × Prob(V|V)]    × Prob(like|V)
 = Max[0.00725 × 0.43, 7.6 × 10–6 × 0.0001] × 0.1
 = 3.12 × 10–4

The category that maximizes SeqScore(V, 2) is N, so BackPtr(V, 2) is N.


Simulating the Viterbi Algorithm 3

Step 2 of Viterbi Algorithm  
lexcatSeqScore
(lexcat,1)
SeqScore
(lexcat,2)
BackPtr
(lexcat,2)
V7.6×10–60.00031N
N0.007251.3×10–5N
P00.00022N
ART00

Simulating the Viterbi Algorithm 4

Step 3 of Viterbi Algorithm

lexcatSeqScore
(lexcat,1)
SeqScore
(lexcat,2)
SeqScore
(lexcat,3)
BackPtr
(lexcat,3)
V7.6×10–60.000310
N0.007251.3×10–51.2×10–7V
P00.000220
ART007.2×10–5V


Simulating the Viterbi Algorithm 5

Step 4 of Viterbi Algorithm

lexcatSeqScore
(lexcat,1)
SeqScore
(lexcat,2)
SeqScore
(lexcat,3)
SeqScore
(lexcat,4)
BackPtr
(lexcat,4)
V7.6×10–6 0.0003102.6×10–9ART
N0.007251.3×10–5 1.2×10–74.3×10–6ART
P00.00022 00
ART00 7.2×10–50


Simulating the Viterbi Algorithm 6


7.4 Obtaining Lexical Probabilities


Obtaining Lexical Probabilities 2


Obtaining Lexical Probabilities 3


Obtaining Lexical Probabilities 4


Fig. 7.14 from Allen

The forward algorithm for computing the lexical probabilities

Initialization Step
for lexcat = 1 to N do
    SeqSum(lexcat, 1) = Prob(w1 | Llexcat) × Prob(Llexcat | <start>)

Computing the Forward Probabilities
for t = 2 to T do
    for lexcat = 1 to N do
        SeqSum(lexcat, t) = Σj=1,N(SeqSum(j, t–1) × Prob(Llexcat | Lj)) × Prob(wt | Llexcat)

Converting Forward Probabilities into Lexical Probabilities
for t = 1 to T do
    for lexcat = 1 to N do
         Prob(Ct = Llexcat) = SeqSum(lexcat, t) / Σj=1,N SeqSum(j, t)


Fig. 7.15 from Allen

figure 7.15

Computing the sums of the probabilities of the sequences


Table 7.16 from Allen

Context-dependent estimates for lexical categories (cf. Table 7.13)
Prob(the/ART | the) 1.0
Prob(flies/N | the flies) 0.9988
Prob(flies/V | the flies) 0.0011
Prob(like/V | the flies like)0.575
Prob(like/P | the flies like)0.4
Prob(like/N | the flies like)0.022
Prob(flowers/N | the flies like flowers)0.967
Prob(flowers/V | the flies like flowers)0.033

Obtaining Lexical Probabilities 5


7.5 Probabilistic Context-Free Grammars


Probabilistic Context-Free Grammars 2


Probabilistic Context-Free Grammars 3


Probabilistic Context-Free Grammars 4


Probabilistic Context-Free Grammars 5


Probabilistic Context-Free Grammars 6

Figure 7.19 from Allen
Three ways to generate a flower wilted as an S


Probabilistic Context-Free Grammars 7


Probabilistic Context-Free Grammars 8


7.6 Best-First Parsing


Best-First Parsing 2


Best-First Parsing 3


to add a constituent C from position p1 to p2:

  1. Insert C into the chart from position p1 to p2.
  2. For any active arc XX1 ... • C ... Xn from position p0 to p1, add a new active arc XX1 ... C • ... Xn from position p0 to p2.

to add an active arc XX1 ... CC' ... Xn to the chart from p0 to p2:

  1. If C is the last constituent (that is, the arc is completed), add a new constituent of type X to the agenda.
  2. Otherwise, if there is a constituent Y of category C' in the chart from p2 to p3, then recursively add an active arc XX1 ... C C' • ... Xn from p0 to p3 (which may of course add further arcs or create further constituents).

Figure 7.22 The new arc extension algorithm


7.7 A Simple Context-Dependent Best-First Parser


A Simple Context-Dependent Best-First Parser 2


Figure 7.23 Some rule estimates based on the first word
Rule the house peaches flowers
NP → N 0 0 .65 .76
NP → N N 0 .82 0 0
NP → NP PP .23 .18 .35 .24
NP → ART N .76 0 0 0
Rule ate bloom like put
VP → V .28 .84 0 .03
VP → V NP .57 .10 .9 .03
VP → V NP PP .14 .05 .1 .93


A Simple Context-Dependent Best-First Parser 3


Summary: Ambiguity Resolution - Statistical Methods
The concepts of independence and conditional probability, together with frequency statistics on parts-of-speech and grammar rules obtained from an NL corpus, allow us to work out the most likely lexical categories for words in a sentence, to order the parsing agenda for best-first parsing. We found in all the cases we examined that context-dependent methods work best.


Exercises from Allen (with some minor changes)

Some of these exercises might take quite a while. They are offered in part as challenge material, and should not be seen as compulsory.

  1. Hand simulate the Viterbi algorithm using the data and probability estimates in Tables 7.4-7.6 on the sentence Flower flowers like flowers. Draw transition networks as in "Simulating the Viterbi Algorithm 1-6" for this sentence, and figure out which parts of speech the algorithm identifies for each word.

  2. Using the bigram and lexical-generation probabilities given in this section, calculate the word probabilities using the forward algorithm for the sentence The a flies like flowers (involving a very rare use of the word a as a noun, as in the a flies, the b flies, and so on). Remember to use 0.0001 as a probability for any bigram not in the table. Are the results you get reasonable? If not, what is the problem, and how might it be fixed?

  3. Using the small tagged corpus at http://www.cse.unsw.edu.au/~billw/cs9414/notes/corpora/austen_tagged (about 8000 words), write code to collect bigram and lexical statistics, and implement a part-of-speech tagged using the Viterbi algorithm. Test your algorithm on the same corpus. How many sentences are tagged correctly? How many words are tagged correctly? Would you expect to get better accuracy results on the the sentences that involved an error if you had access to more data? You can investigate this and related issues by:


CRICOS Provider Code No. 00098G

Copyright (C) Bill Wilson, 2005-2012, except where another source is acknowledged.

Acknowledgement: All numbered sections, tables, and figures are reproduced from James Allen's "Natural Language Processing" (ed. 2, Benjamin Cummings, 1995) which is used as a text book for the course for which these web pages form lecture notes. You are advised to obtain access to a copy of this book (e.g. by buying it or borrowing it from a library) to assist in understanding the material.