Grammars and Parsing

Reference: Chapter 3 of Allen, J., Natural Language Understanding, 2nd edition, Benjamin Cummings, 1995.
Aim:
To describe several types of formal grammars for natural language processing, parse trees, and a number of parsing methods, including a bottom-up chart parser in some detail.
We show the use of Prolog for syntactic parsing of natural language text. Other issues in parsing, including PP attachment, are briefly discussed.
Keywords: accepter, active arc, active chart, alphabet (in grammar), ATN, augmented grammar, augmented transition networks, bottom-up parser, CFG, chart, chart parsing, Chomsky hierarchy, constituent, context-free, context-sensitive, CSG, derivation, distinguished non-terminal, generalized phrase structure grammar, generate, GPSG, grammar rule, HDPSG, head-driven phrase structure grammar, language generated by a grammar, left-to-right parsing, lexical category, lexical functional grammar, lexical insertion rule, lexical symbol, lexicon, LFG, Marcus parsing, non-terminal, parse tree, parser, phrasal category, pre-terminal, predictive parser, production, regular, rewriting process, right-linear grammar, right-to-left parsing, robustness, semantic grammar, sentential form, shift-reduce parser, start symbol, systemic grammar, terminal, top-down parser, unrestricted grammar
Plan:
  • parsers, parse trees
  • context-free grammars, context-free rules
  • lexical and phrasal categories
  • Chomsky hierarchy: unrestricted > context-sensitive > context-free > regular
  • derivation, sentential forms
  • top-down parsing: predictive parser
  • bottom-up parsing
  • garden-path sentences, well-formed substring table
  • chart parsing, bottom-up chart parser
  • example of chart parser use
  • parsing in Prolog
  • problems and limitations in syntactic parsing

Grammar

Rules for describing a language.

We are interested in syntactic grammar - here the rules describe the syntax of the grammar. There are also semantic grammars, text grammars, etc.


Parser

pictorial view of tree shown below in text form

Different Tree Notations

Here is the same tree in list notation ...

(S (NP (NAME John))
   (VP (V fed)
       (NP (DET the))
           (N numbat))))
and in a Prolog notation ...
s(np(name(john)),
  vp(v(fed),
     np(det(the),
        n(numbat))))
NB: such a tree will usually look like this:
s(np(name(john)), vp(v(fed), np(art(the), n(numbat))))
in Prolog output, unless special steps are taken to "pretty-print" it to display the structure.

* Numbats are ant-eating marsupials now confined to the south-west corner of Western Australia, though previously they were more widespread. numbat picture from www.wilderness.org.au www.wilderness.org.au


Describing Syntax: Context Free Grammars (CFGs)

1. S → NP VP
2. VP→ V NP
3. NP→ NAME
4. NP→ DET N
grammar rules
5. NAME → John
6. V → fed
7. DET → the
8. N → numbat
lexical entries


Context Free Grammars (CFGs)

Definition: A CFG is a 5-tuple (P, A, N, T, S) where:


CFG Example

E.g.

T = {fed, John, numbat, the}

N = {S, NP, VP, N, V, NAME, DET}

P = {S → NP VP, VP→ V NP, NP→ NAME, NP→ DET N,
       NAME→ John V→ fed, DET→ the, N→ numbat} .

Notice how the productions were split into grammar rules and lexicon above. N, V, NAME and DET are called pre-terminal or lexical symbols.


Digression

In programming language grammars, there would be context-free rules like:

WhileStatement → while Condition do StatementList end
StatementList → Statement
StatementList → Statement ; StatementList

Optional Homework for Digression: Write grammar rules for the Prolog programming language. You will need grammar rules to describe:

  • a Prolog rule
  • the goal list in the body of the Prolog rule
  • Prolog terms
  • arithmetic expressions like N — 1 (several rules here)
  • relational expressions like N = M (several rules here)
  • ...
What would the lexicon be like?

Definition of String

Examples of StringsLength
numbat S the3
S1
ε0
NP VP2
John fed the numbat4
numbat the fed John4
numbat the John S S John NP DET DET S S John fed13


Strings 2


Types of grammars: the Chomsky Hierarchy

unrestricted grammars context-sensitive grammars
context-free grammars regular grammars

With context-free grammars, these form the Chomsky hierarchy of grammars. The four types of grammar differ in the type of rewriting rule α  → β that is allowed.

Since the restrictions which define the grammar types apply to the rules, it makes sense to talk of unrestricted, context-sensitive, context-free, and regular rules.

Unrestricted Grammars

No restrictions on the form that the Rules α  → β can take.

Unrestricted grammars are not widely used: their extreme power makes them difficult to parse with.


Context Sensitive Grammars


Context-Free Grammar or Phrase Structure Grammar

All rules must be of the form N → α, where N is a nonterminal symbol and α is any string.

Mostly we will be considering context-free rules in this course.


Regular Grammar or Right Linear Grammar

All rules take one of two forms:

  1. Nt,
  2. NtM, where N and M are non-terminals and t is a terminal.

Regular grammar rules are not powerful enough to conveniently describe natural languages (or even programming languages).

They can sometimes be used to describe portions of languages, and have the advantage that they lead to fast parsing.


Deriving sentences from a grammar

To derive a sentence from a grammar, start with the start symbol S, and refer to it as the current string. Repeatedly perform the following rewriting process:

Repeat until there are no non-terminals remaining in the current string. The current string is then a sentence in the language generated by the grammar. (Before this, it is termed a sentential form.)


Derivation Example

Current stringRewriting
SNP VP
S
NAME VP
NP
John VP
NAME
⇒ John V NP
VP
⇒ John fed NP
V
⇒ John fed DET N
NP
⇒ John fed the N
DET
⇒ John fed the numbat
N

You are now in a position to do the exercises on grammars at http://www.cse.unsw.edu.au/~cs9414/Exercises/grammars.html


Parsing as Reverse of Derivation

Parsing might be the reverse of this process (doing the steps shown above in reverse would constitute a bottom-up right-to-left parse of John fed the numbat.)


Top-Down Parsing with CFGs

Basic Top-Down "predictive" parser


Predictive Parsing Example

Example: Using this grammar:

S → NP VP
NP → DET N | NAME
PP → PREP NP
VP → V | V NP | V NP PP | V PP

we parse this sentence: 1 The 2 dogs 3 barked. 4


Predictive Parsing Example 2

  Backup statesPosition
S→ NP VP 1
 → DET N VP 1
  NAME VP1
 → (The) N VP 2
  NAME VP1
→ (dogs) VP 3
  NAME VP1
→ V 3
  V NP3
  V NP PP3
  V PP3
  NAME VP1
→ (barked.) 4

Bottom-up parsing

The dogs barked→ DET N V
 → NP V
 → NP VP
 → S


Parsing Efficiency

Using bottom-up parsing methods, all CFGs can be parsed in n3 steps, where n is the length of the sentence, whereas predictive parsing can take exponential time (i.e. much slower). The reason predictive parsing may take exponential time is that it may re-parse pieces of the sentence, particularly confusing sentences (like the horse [that was] raced past the barn fell ):

Indication of parsing stateComment
The horse raced past the barn ..."past the barn" parsed as PP, "raced" parsed as [main] verb
The horse raced past the barn felloops!
The horse ...backtrack to this point and start over
The horse raced past the barn fell"past the barn" parsed as PP again: "raced" parsed as ADJ [from past participle] starting the ADJP ["raced past the barn"]


Chart Parsing

(Section 3.4 of Allen)


Chart Parsing Advantages


A Bottom-Up Chart-Based Parsing Algorithm


Chart Parser 2


Chart Parser 3

Parser inputs: sentence, lexicon, grammar.

Parser operations:

  1. The algorithm operates on two data structures: the active chart - a collection of active arcs (see (4) below) and the constituents (see (3) and (6)). Both are initially empty.

  2. The grammar is considered to include lexical insertion rules: for example, if fly is a word in the lexicon/vocabulary being used, and if its lexical entry includes the fact that fly may be a N or a V, then rules of the form N  →   fly and V  →   fly are considered to be part of the grammar.


Chart Parser 4

  1. As a word (like fly) is scanned, constituents corresponding to its lexical categories are created:

    N1: N  →   fly FROM 2 TO 3, and

    V1: V  →   fly FROM 2 TO 3

  2. If the grammar contains a rule like NP  → DET ADJ N, and a constituent like DET1: DET → the FROM m TO n has been found, then an active arc

    ARC1: NP → DET1 •  ADJ  N FROM m TO n

    is added to the active chart. (In our example sentence, m would be 0 and n would be 1.) The "•" in an active arc marks the boundary between found constituents and constituents not (yet) found.

    picture of active arc for the


Chart Parser 5

  1. Advancing the "•": If the active chart has an active arc like:

    ARC1: NP → DET1 • ADJ N FROM m TO n

    and there is a constituent in the chart of type ADJ (i.e. the first item after the •), say

    ADJ1:  ADJ  →   green FROM n TO p

    such that the FROM position in the constituent matches the TO position in the active arc, then the "•" can be advanced, creating a new active arc:

    ARC2:  NP → DET1  ADJ1  •  N FROM m TO p.

    picture of active arc for 'the green DET ADJ N" >


Chart Parser 6

  1. If the process of advancing the "•" creates an active arc whose "•" is at the far right hand side of the rule: e.g.

    ARC3: NP → DET1 ADJ1 N1•  FROM 0 TO 3

    then this arc is converted to a constituent.

    NP1: NP → DET1 ADJ1  N1 FROM 0 TO 3.

    picture of constituent for the green fly

    Not all active arcs are ever completed in this sense.


Chart Parser 7

  1. Both lexical and phrasal constituents can be used in steps 3 and 4: e.g. if the grammar contains a rule S →   NP VP, then as soon as the constituent NP1 discussed in step 5 is created, it will be possible to make a new active arc

    ARC4: S →   NP1 • VP FROM 0 TO 3

    active arc for np


Chart Parser 8

  1. When subsequent constituents are created, they would have names like NP2, NP3, ..., ADJ2, ADJ3, ... and so on.

  2. The aim of parsing is to get phrasal constituents (normally of type S) whose FROM is 0 and whose TO is the length of the sentence. There may be several such constituents.


Final State of Chart for "The green fly flies"

Constituents
DET1: DET → "the" FROM 0 TO 1
ADJ1: ADJ → "green" FROM 1 TO 2
N1: N → "fly" FROM 2 TO 3
V1: V → "flies" FROM 3 TO 4
NP1: NP → DET1 ADJ1 N1 FROM 0 TO 3
VP1: VP → V1 FROM 3 TO 4
S1: S → NP1 VP1 FROM 0 TO 4
 
Active Arcs
ARC1: NP → DET1 • ADJ N FROM 0 TO 1
ARC2: NP → DET1 ADJ1 • N FROM 0 TO 2
ARC4: S → NP1 • VP FROM 0 TO 3
 
This assumes a minimal grammar and lexicon - if "green" could be a N (noun), "fly" a V (verb), and/or "flies" a N (noun), then there would be more lexical constituents, for example. The actual grammar rules used above were NP → DET ADJ N; VP → V; and S → NP VP.

Chart Parsing Example

Grammar:

1. S → NP VP
2. NP → DET ADJ N
3. NP → DET N
4. NP → ADJ N
5. VP → AUX V NP
6. VP → V NP

 

Lexicon:

the... DET
large... ADJ
can... AUX, N, V
hold... N, V
water... N, V

Sentence: 0 The 1 large 2 can 3 can 4 hold 5 the 6 water 7


Chart Parsing Example 2

Steps in Parsing:

*sentence*: the   *position*: 1
*constituents*:
DET1: DET → "the" from 0 to 1
*active-arcs*:
ARC1: NP → DET1 • ADJ N from 0 to 1 [rule 2]
ARC2: NP → DET1 • N from 0 to 1     [rule 3]

*sentence*: the large   *position*: 2
*constituents*: add
ADJ1: ADJ → "large" from 1 to 2
*active-arcs*: add
ARC3: NP → DET1 ADJ1 • N from 0 to 2 [arc1*→]
ARC4: NP → ADJ1 • N from 1 to 2      [rule 4]

Chart Parsing Example 3

*sentence*: the large can  *position*: 3
*constituents*: add
NP2: NP → DET1 ADJ1 N1 from 0 to 3  [arc3*→]
NP1: NP → ADJ1 N1 from 1 to 3       [arc4*→]
N1: N → "can" from 2 to 3
AUX1: AUX → "can" from 2 to 3
V1: V → "can" from 2 to 3
*active-arcs*: add
ARC5: VP → V1 • NP from 2 to 3     [rule 6]
ARC6: VP → AUX1 • V NP from 2 to 3 [rule 5]
ARC7: S → NP1 • VP from 1 to 3     [rule 1]
ARC8: S → NP2 • VP from 0 to 3     [rule 1]

Chart Parsing Example 4

*sentence*: the large can can *position*: 4
*constituents*: add
N2: N → "can" from 3 to 4
AUX2: AUX → "can" from 3 to 4
V2: V → "can" from 3 to 4
*active-arcs*: add
ARC9: VP → AUX1 V2 • NP from 2 to 4 [arc6•→]
ARC10: VP → V2 • NP from 3 to 4     [rule 6]
ARC11: VP → AUX2 • V NP from 3 to 4 [rule 5]

Chart Parsing Example 5

*sentence*: the large can can hold
*position*: 5
*constituents*: add
N3: N → "hold" from 4 to 5
V3: V → "hold" from 4 to 5
*active-arcs*: add
ARC12: VP → AUX2 V3 • NP from 3 to 5 [arc11•→]
ARC13: VP → V3 • NP from 4 to 5      [rule 6]

Chart Parsing Example 6

*sentence*: the large can can hold the
*position*: 6
*constituents*: add
DET2: DET → "the" from 5 to 6
*active-arcs*: add
ARC14: NP → DET2 • ADJ N from 5 to 6 [rule 2]
ARC15: NP → DET2 • N from 5 to 6     [rule 3]

Chart Parsing Example 7

*sentence*: the large can can hold the water
*position*: 7
*constituents*: add
S2: S → NP1 VP2 from 1 to 7         [arc7*→]
S1: S → NP2 VP2 from 0 to 7         [arc8*→]
VP2: VP → AUX2 V3 NP3 from 3 to 7   [arc12*→]
VP1: VP → V3 NP3 from 4 to 7        [arc13*→]
NP3: NP → DET2 N4 from 5 to 7       [arc15*→]
N4: N → "water" from 6 to 7
V4: V → "water" from 6 to 7
*active-arcs*: add
ARC16: VP → V4 • NP from 6 to 7 [rule 6]
ARC17: S → NP3 • VP from 5 to 7 [rule 1]

Chart Parsing Example 8

Doing the same steps diagrammatically:

the large


the large can

the large can


the large can can

the large can can


the large can can hold

the large can can hold


the large can can hold the

the large can can hold the


the large can can hold the water

the large can can hold the water

You are now in a position to do the exercises on grammars at http://www.cse.unsw.edu.au/~cs9414/Exercises/parsing.html


Notes on Chart Parsing


Recording Sentence Structure

A frame-like, or slot/filler representation works: John fed the numbat

(S SUBJ (NP NAME "John")
   MAIN-V feed
   TENSE past
   OBJ (NP DET the HEAD numbat))

s(subj(np(name('John'))),
  mainv(feed),
  tense(past),
  obj(np(det(the), head(numbat))))


Grammars and Logic Programming

(Section 3.8 in Allen)


Prolog Lexicon


Rules to Build Lexical Constituents


Rules to Build Lexical Constituents 2


Words, Parsing


Parsing

To get at the structure of the sentence, add extra arguments to pass the structure back across necks as it is built:

   s(P1,P3,s(NP,VP)) :-
       np(P1,P2,NP), vp(P2,P3,VP).

   vp(P1,P3,vp(v(Verb),NP)) :-
       v(P1,P2,Verb), np(P2,P3,NP).
   np(P1,P2,np(name(Name))) :-
       proper(P1,P2,Name).
   np(P1,P3,np(det(Det),noun(Noun))) :-
       det(P1,P2,Det), noun(P2,P3,Noun).

Structure Passing with Lexical Constituents

   det(From, To, Word) :-
       word(Word, From, To), isdet(Word).

   noun(From, To, Word) :-
       word(Word, From, To), isnoun(Word).
   v(From, To, Word) :-
       word(Word, From, To), isverb(Word).
   proper(From, To, Word) :-
       word(Word, From, To), isname(Word).

Query to Start the Parser


PP attachment

The boy saw the man on the hill with the telescope

two pictures: one has man looking through telescope at boy on hill; other has man looking at hill that has boy and telescope on it


PP Attachment Parse Trees

The two visual interpretations correspond to two different parses (coming from different grammar rules (VP → V NP PP and NP → NP PP):

parse tree for sentence with telescope PP attached to VP or parse tree for sentence with telescope PP attached to NP 'the hill'

Other Syntax Representation and Parsing Methods


Limitations of Syntax in NLP


Summary: Grammars and Parsing
There are many approaches to parsing and many grammatical formalisms. Some problems in deciding the structure of a sentence turn out to be undecidable at the syntactic level. We have concentrated on a bottom-up chart parser based on a context-free grammar. We will subsequently extend this parser to augmented grammars.


CRICOS Provider Code No. 00098G
Last updated:
Copyright © Bill Wilson, 2007, except where another source is acknowledged.