Parsing
See:R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997): Longman, chapters 11 (Bateman et al) and 12 (Garside & Rayson)G Kennedy, An introduction to corpus linguistics, London (1998): Longman, pp. 231-244.CF Meyer, English corpus linguistics, Cambridge (2002): CUP, pp. 91-96.
R Mitkov (ed) The Oxford Handbook of Computational Linguistics, Oxford (2003): OUP, chapter 4 (Kaplan)J Allen Natural Language Understanding (2nd ed) (1994): Addison Wesley
2
Parsing
• POS tags give information about the individual words, and their internal form (eg sing vs plur, tense of verb)
• Additional level of information concerns the way the words relate to each other– the overall structure of each sentence– the relationships between the words
• This can be achieved by parsing the corpus
3
Parsing – overview
• What sort of information does parsing add?
• What are the difficulties relating to parsing?
• How is parsing done?• Parsing and corpora
– partial parsing, chunking– stochastic parsing– treebanks
4
Structural information
• Parsing adds information about sentence structure and constituents
• Allows us to see what constructions words enter into– eg, transitivity, passivization, argument structure for
verbs
• Allows us to see how words function relative to each other– eg, what words can modify / be modified by other
words
5
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]
Nemo , the killer whale , who ‘d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .
S
N
Fr
V V
J P
P N
N P
P N
N N N
NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 .
Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .
6
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]
S
N
Fr
V V
J P
P N
N P
P N
N N N
NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 .
Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .
given this verb, what kinds of things can be subject?
7
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]
S
N
Fr
V V
J P
P N
N P
P N
N N N
NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 .
Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .
verb with adjective complement:what verbs can participate in this construction?with what adjectives?any other constraints?
8
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]
S
N
Fr
V V
J P
P N
N P
P N
N N N
NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 .
Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .
verb with PP complement:what verbs with what prepositions?any constraints on noun?
9
Parsing: difficulties
• Besides lexical ambiguities (usually resolved by tagger), language can be structurally ambiguous– global ambiguities due to ambiguous words
and/or alternative possible combinations– local ambiguities, especially due to
attachment ambiguities, and other combinatorial possibilities
– sheer weight of alternatives available in the absence of (much) knowledge
10
Global ambiguities
• Individual words can be ambiguous as to category
• In combination with each other this can lead to ambiguity:– Time flies like an arrow– Gas pump prices rose last time oil stocks fell
11
Local ambiguities
• Structure of individual constituents may be given, but how they fit together can be in doubt
• Classic example of PP attachment– The man saw the girl with the telescope
The man saw the girl in the park with a statue of the general on a horse
with a sword on a stand in the morningwith a red dress with a telescope
• Many other attachments potentially ambiguous– relative clauses, adverbs, parentheticals, etc
12
Difficulties
• Broad coverage necessary for parsing corpora of real text
• Long sentences: – structures are very complex– ambiguities proliferate
• Difficulty (even for human) to verify if parse is correct– because it is complex– because it may be genuinely ambiguous
13
How to parse
• Traditionally (in linguistics)– hand-written grammar– usually narrow coverage– linguists are interested in theoretical issues
regarding syntax
• Even in computational linguistics– interest is (was?) in parsing algorithms
• In either case, grammars typically used small set of categories (N, V, Adj etc)
14
Lack of knowledge
• Humans are very good at disambiguating• In fact they rarely even notice the ambiguity• Usually, only one reading “makes sense”• They use a combination of
– linguistic knowledge– common-sense (real-world) knowledge– contextual knowledge
• Only the first is available to computers, and then only in a limited way
15
Parsing corpora
• Using tagger as a front-end changes things:– Richer set of grammatical categories which
reflect some morphological information– Hand-written grammars more difficult though
because many generalisations lost (eg now need many more rules for NP)
– Disambiguation done by tagger in some sense pre-empts work that you might have expected the parser to do
16
Parsing corpora
• Impact of broad coverage requirement– Broad coverage means that many more
constructions are covered by the grammar– This increases ambiguity massively
• Partial parsing may be sufficient for some needs
• Availability of corpora permits (and encourages) stochastic approach
17
Partial parsing
• Identification of constituents (noun phrases, verb groups, PPs) is often quite robust …
• Only fitting them together can be difficult
• Although some information is lost, identifying “chunks” can be useful
18
Stochastic parsing
• Like ordinary parsing, but competing rules are assigned a probability score
• Scores can be used to compare (and favour) alternative parses
• Where do the probabilities come from?
S NP VP .80S aux NP VP .15S VP .05NP det n .20NP det adj n .35NP n .20NP adj n .15NP pro .10
19
Where do the probabilities come from?
1) Use a corpus of already parsed sentences: a “treebank”– Best known example is the Penn Treebank
• Marcus et al. 1993• Available from Linguistic Data Consortium• Based on Brown corpus + 1m words of Wall Street Journal
+ Switchboard corpus
– Count all occurrences of each rule variation (e.g. NP) and divide by total number of NP rules
– Very laborious, so of course is done automatically
20
Where do the probabilities come from?
2) Create your own treebank from your own corpus– Easy if all sentences are unambiguous: just
count the (successful) rule applications– When there are ambiguities, rules which
contribute to the ambiguity have to be counted separately and weighted
21
Where do the probabilities come from?
3) Learn them as you go along– Again, assumes some way of identifying the
correct parse in case of ambiguity– Each time a rule is successfully used, its
probability is adjusted– You have to start with some estimated
probabilities, e.g. all equal– Does need human intervention, otherwise
rules become self-fulfilling prophecies
22
Bootstrapping the grammar
• Start with a basic grammar, possibly written by hand, with all rules equally probable
• Parse a small amount of text, then correct it manually– this may involve correcting the trees and/or changing the
grammar
• Learn new probabilities from this small treebank• Parse another (similar) amount of text, then correct it
manually• Adjust the probabilities based on the old and new trees
combined• Repeat until the grammar stabilizes
23
Treebanks – some examples (with links)
• Penn perhaps best known– Wall Street Corpus, Brown Corpus; >1m words
• International Corpus of English (ICE); • Lancaster Parsed Corpus and Lancaster-Leeds treebank
– parsed excerpts from LOB; 140k and 45k words resp.• Susanne Corpus, Christine Corpus, Lucy Corpus;
– related to Lancaster corpora; developed by Geoffrey Sampson• Verbmobil treebanks
– parallel treebanks (Eng, Ger, Jap) used in speech MT project• LinGO Redwoods: HPSG-based parsing of Verbmobil data • Multi-Treebank
– parses in various frameworks of 60 sentences• The PARC 700 Dependency Bank;
– LFG parses of 700 sentences also found in Penn treebank • CHILDES
– Brown Eve corpus of children’s speech samples with dependency annotation
Top Related