| Reference: Chapter 3 of Allen, J., Natural Language Understanding, 2nd edition, Benjamin Cummings, 1995. |
| Aim: |
|
To describe several types of formal grammars for natural language
processing, parse trees, and a number of parsing methods, including
a bottom-up chart parser in some detail. We show the use of Prolog for syntactic parsing of natural language text. Other issues in parsing, including PP attachment, are briefly discussed. |
| Keywords: accepter, active arc, active chart, alphabet (in grammar), ATN, augmented grammar, augmented transition networks, bottom-up parser, CFG, chart, chart parsing, Chomsky hierarchy, constituent, context-free, context-sensitive, CSG, derivation, distinguished non-terminal, generalized phrase structure grammar, generate, GPSG, grammar rule, HDPSG, head-driven phrase structure grammar, language generated by a grammar, left-to-right parsing, lexical category, lexical functional grammar, lexical insertion rule, lexical symbol, lexicon, LFG, Marcus parsing, non-terminal, parse tree, parser, phrasal category, pre-terminal, predictive parser, production, regular, rewriting process, right-linear grammar, right-to-left parsing, robustness, semantic grammar, sentential form, shift-reduce parser, start symbol, systemic grammar, terminal, top-down parser, unrestricted grammar |
| Plan: |
|
Rules for describing a language.
We are interested in syntactic grammar - here the rules describe the syntax of the grammar. There are also semantic grammars, text grammars, etc.
An algorithm for analysing sentences given a grammar:
Here is the same tree in list notation ...
(S (NP (NAME John))
(VP (V fed)
(NP (DET the))
(N numbat))))
and in a Prolog notation ...
s(np(name(john)),
vp(v(fed),
np(det(the),
n(numbat))))
NB: such a tree will usually look like this:s(np(name(john)), vp(v(fed), np(art(the), n(numbat))))
| * Numbats are ant-eating marsupials now confined to the south-west corner of Western Australia, though previously they were more widespread. |
www.wilderness.org.au
|
| 1. S → NP VP
2. VP→ V NP 3. NP→ NAME 4. NP→ DET N | grammar rules |
| 5. NAME → John
6. V → fed 7. DET → the 8. N → numbat | lexical entries |
T = {fed, John, numbat, the}
N = {S, NP, VP, N, V, NAME, DET}
P = {S → NP VP,
VP→ V NP, NP→ NAME,
NP→ DET N,
NAME→ John V→ fed,
DET→ the, N→ numbat} .
Notice how the productions were split into grammar rules and lexicon above. N, V, NAME and DET are called pre-terminal or lexical symbols.
|
In programming language grammars, there would be context-free
rules like:
WhileStatement → while Condition do StatementList
end Optional Homework for Digression: Write grammar rules for the Prolog programming language. You will need grammar rules to describe:
|
| Examples of Strings | Length |
|---|---|
| numbat S the | 3 |
| S | 1 |
| ε | 0 |
| NP VP | 2 |
| John fed the numbat | 4 |
| numbat the fed John | 4 |
| numbat the John S S John NP DET DET S S John fed | 13 |
| unrestricted grammars | context-sensitive grammars | |
| context-free grammars | regular grammars |
With context-free grammars, these form the Chomsky hierarchy of grammars. The four types of grammar differ in the type of rewriting rule α → β that is allowed.
Since the restrictions which define the grammar types apply to the rules, it makes sense to talk of unrestricted, context-sensitive, context-free, and regular rules.
No restrictions on the form that the Rules α → β can take.
Unrestricted grammars are not widely used: their extreme power makes them difficult to parse with.
All rules must be of the form N → α, where N is a nonterminal symbol and α is any string.
Mostly we will be considering context-free rules in this course.
All rules take one of two forms:
Regular grammar rules are not powerful enough to conveniently describe natural languages (or even programming languages).
They can sometimes be used to describe portions of languages, and have the advantage that they lead to fast parsing.
To derive a sentence from a grammar, start with the start symbol S, and refer to it as the current string. Repeatedly perform the following rewriting process:
Repeat until there are no non-terminals remaining in the current string. The current string is then a sentence in the language generated by the grammar. (Before this, it is termed a sentential form.)
| Current string | Rewriting | |
|---|---|---|
| S | ⇒ NP VP | |
| ⇒ NAME VP | ||
| ⇒ John VP | ||
| ⇒ John V NP | ||
| ⇒ John fed NP | ||
| ⇒ John fed DET N | ||
| ⇒ John fed the N | ||
| ⇒ John fed the numbat |
You are now in a position to do the exercises on grammars at http://www.cse.unsw.edu.au/~cs9414/Exercises/grammars.html
Parsing might be the reverse of this process (doing the steps shown above in reverse would constitute a bottom-up right-to-left parse of John fed the numbat.)
Example: Using this grammar:
S → NP VP
NP → DET N | NAME
PP → PREP NP
VP → V | V NP | V NP PP | V PP
we parse this sentence: 1 The 2 dogs 3 barked. 4
| Backup states | Position | ||
|---|---|---|---|
| S | → NP VP | 1 | |
| → DET N VP | 1 | ||
| NAME VP | 1 | ||
| → (The) N VP | 2 | ||
| NAME VP | 1 | → (dogs) VP | 3 |
| NAME VP | 1 | → V | 3 |
| V NP | 3 | ||
| V NP PP | 3 | ||
| V PP | 3 | ||
| NAME VP | 1 | → (barked.) | 4 |
| The dogs barked | → DET N V |
| → NP V | |
| → NP VP | |
| → S |
Using bottom-up parsing methods, all CFGs can be parsed in n3 steps, where n is the length of the sentence, whereas predictive parsing can take exponential time (i.e. much slower). The reason predictive parsing may take exponential time is that it may re-parse pieces of the sentence, particularly confusing sentences (like the horse [that was] raced past the barn fell ):
| Indication of parsing state | Comment |
|---|---|
| The horse raced past the barn ... | "past the barn" parsed as PP, "raced" parsed as [main] verb |
| The horse raced past the barn fell | oops! |
| The horse ... | backtrack to this point and start over |
| The horse raced past the barn fell | "past the barn" parsed as PP again: "raced" parsed as ADJ [from past participle] starting the ADJP ["raced past the barn"] |
1: Q. How much are apples?
2: A. Thirty cents each.
3: Q. Plums?
Parser inputs: sentence, lexicon, grammar.
Parser operations:
N1: N → fly FROM 2 TO 3, and
V1: V → fly FROM 2 TO 3
ARC1: NP → DET1 • ADJ N FROM m TO n
is added to the active chart. (In our example sentence, m would be 0 and n would be 1.) The "•" in an active arc marks the boundary between found constituents and constituents not (yet) found.

ARC1: NP → DET1 • ADJ N FROM m TO n
and there is a constituent in the chart of type ADJ (i.e. the first item after the •), say
ADJ1: ADJ → green FROM n TO p
such that the FROM position in the constituent matches the TO position in the active arc, then the "•" can be advanced, creating a new active arc:
ARC2: NP → DET1 ADJ1 • N FROM m TO p.
DET ADJ N" >
ARC3: NP → DET1 ADJ1 N1• FROM 0 TO 3
then this arc is converted to a constituent.
NP1: NP → DET1 ADJ1 N1 FROM 0 TO 3.

Not all active arcs are ever completed in this sense.
ARC4: S → NP1 • VP FROM 0 TO 3

Grammar:
1. S → NP VP |
Lexicon:
the... DET |
Sentence: 0 The 1 large 2 can 3 can 4 hold 5 the 6 water 7
Steps in Parsing:
*sentence*: the *position*: 1 *constituents*: DET1: DET → "the" from 0 to 1 *active-arcs*: ARC1: NP → DET1 • ADJ N from 0 to 1 [rule 2] ARC2: NP → DET1 • N from 0 to 1 [rule 3] *sentence*: the large *position*: 2 *constituents*: add ADJ1: ADJ → "large" from 1 to 2 *active-arcs*: add ARC3: NP → DET1 ADJ1 • N from 0 to 2 [arc1*→] ARC4: NP → ADJ1 • N from 1 to 2 [rule 4]
*sentence*: the large can *position*: 3 *constituents*: add NP2: NP → DET1 ADJ1 N1 from 0 to 3 [arc3*→] NP1: NP → ADJ1 N1 from 1 to 3 [arc4*→] N1: N → "can" from 2 to 3 AUX1: AUX → "can" from 2 to 3 V1: V → "can" from 2 to 3 *active-arcs*: add ARC5: VP → V1 • NP from 2 to 3 [rule 6] ARC6: VP → AUX1 • V NP from 2 to 3 [rule 5] ARC7: S → NP1 • VP from 1 to 3 [rule 1] ARC8: S → NP2 • VP from 0 to 3 [rule 1]
*sentence*: the large can can *position*: 4 *constituents*: add N2: N → "can" from 3 to 4 AUX2: AUX → "can" from 3 to 4 V2: V → "can" from 3 to 4 *active-arcs*: add ARC9: VP → AUX1 V2 • NP from 2 to 4 [arc6•→] ARC10: VP → V2 • NP from 3 to 4 [rule 6] ARC11: VP → AUX2 • V NP from 3 to 4 [rule 5]
*sentence*: the large can can hold *position*: 5 *constituents*: add N3: N → "hold" from 4 to 5 V3: V → "hold" from 4 to 5 *active-arcs*: add ARC12: VP → AUX2 V3 • NP from 3 to 5 [arc11•→] ARC13: VP → V3 • NP from 4 to 5 [rule 6]
*sentence*: the large can can hold the *position*: 6 *constituents*: add DET2: DET → "the" from 5 to 6 *active-arcs*: add ARC14: NP → DET2 • ADJ N from 5 to 6 [rule 2] ARC15: NP → DET2 • N from 5 to 6 [rule 3]
*sentence*: the large can can hold the water *position*: 7 *constituents*: add S2: S → NP1 VP2 from 1 to 7 [arc7*→] S1: S → NP2 VP2 from 0 to 7 [arc8*→] VP2: VP → AUX2 V3 NP3 from 3 to 7 [arc12*→] VP1: VP → V3 NP3 from 4 to 7 [arc13*→] NP3: NP → DET2 N4 from 5 to 7 [arc15*→] N4: N → "water" from 6 to 7 V4: V → "water" from 6 to 7 *active-arcs*: add ARC16: VP → V4 • NP from 6 to 7 [rule 6] ARC17: S → NP3 • VP from 5 to 7 [rule 1]
Doing the same steps diagrammatically:

the large can

the large can can

the large can can hold

the large can can hold the

the large can can hold the water

You are now in a position to do the exercises on grammars at http://www.cse.unsw.edu.au/~cs9414/Exercises/parsing.html
A frame-like, or slot/filler representation works: John fed the numbat
(S SUBJ (NP NAME "John")
MAIN-V feed
TENSE past
OBJ (NP DET the HEAD numbat))
s(subj(np(name('John'))),
mainv(feed),
tense(past),
obj(np(det(the), head(numbat))))
| S → NP VP | ..... | s(P1,P3) :- np(P1,P2), vp(P2,P3). |
| VP→ V NP | ..... | vp(P1,P3) :- v(P1,P2), np(P2,P3). |
| NP→ NAME | ..... | np(P1,P2) :- propernoun(P1,P2). |
| NP→ DET N | ..... | np(P1,P3) :- det(P1,P2), noun(P2,P3). |
| NAME → John | ..... | isname(john). |
| V → fed | ..... | isverb(fed). |
| DET → the | ..... | isdet(the). |
| N → numbat | ..... | isnoun(numbat). |
det(From, To) :-
word(Word, From, To),
isdet(Word).
This predicate is true only if the word between
the specified position is of that category.
word(Word, From, To)
signifies that Word is there in the input sentence between positions
From and To.
noun(From, To) :-
word(Word, From, To), isnoun(Word).
v(From, To) :-
word(Word, From, To), isverb(Word).
propernoun(From, To) :-
word(Word, From, To), isname(Word).
word(john, 1, 2).
word(fed, 2, 3).
word(the, 3, 4).
word(numbat, 4, 5).
?- s(1,5).
This will result in Yes.
word facts like those above.
To get at the structure of the sentence, add extra arguments to pass the structure back across necks as it is built:
s(P1,P3,s(NP,VP)) :-
np(P1,P2,NP), vp(P2,P3,VP).
vp(P1,P3,vp(v(Verb),NP)) :-
v(P1,P2,Verb), np(P2,P3,NP).
np(P1,P2,np(name(Name))) :-
proper(P1,P2,Name).
np(P1,P3,np(det(Det),noun(Noun))) :-
det(P1,P2,Det), noun(P2,P3,Noun).
det(From, To, Word) :-
word(Word, From, To), isdet(Word).
noun(From, To, Word) :-
word(Word, From, To), isnoun(Word).
v(From, To, Word) :-
word(Word, From, To), isverb(Word).
proper(From, To, Word) :-
word(Word, From, To), isname(Word).
?- s(1, 5, Parse).
Parse=
s(np(name(john)), vp(v(fed), np(det(the), noun(numbat))))
http://www.cse.unsw.edu.au/~billw/cs9414/notes/nlp/dcg2.pl
The boy saw the man on the hill with the telescope

The two visual interpretations correspond to two different parses (coming from different grammar rules (VP → V NP PP and NP → NP PP):
| or | ![]() |
| Summary: Grammars and Parsing |
| There are many approaches to parsing and many grammatical formalisms. Some problems in deciding the structure of a sentence turn out to be undecidable at the syntactic level. We have concentrated on a bottom-up chart parser based on a context-free grammar. We will subsequently extend this parser to augmented grammars. |
CRICOS Provider Code No. 00098G
Last updated:
Copyright © Bill Wilson, 2007, except where another source is acknowledged.