Download - Parsing See: R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997): Longman, chapters 11 (Bateman et al) and 12 (Garside & Rayson) G Kennedy,

Parsing

See:R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997): Longman, chapters 11 (Bateman et al) and 12 (Garside & Rayson)G Kennedy, An introduction to corpus linguistics, London (1998): Longman, pp. 231-244.CF Meyer, English corpus linguistics, Cambridge (2002): CUP, pp. 91-96.

R Mitkov (ed) The Oxford Handbook of Computational Linguistics, Oxford (2003): OUP, chapter 4 (Kaplan)J Allen Natural Language Understanding (2nd ed) (1994): Addison Wesley

2

Parsing

• POS tags give information about the individual words, and their internal form (eg sing vs plur, tense of verb)

• Additional level of information concerns the way the words relate to each other– the overall structure of each sentence– the relationships between the words

• This can be achieved by parsing the corpus

3

Parsing – overview

• What sort of information does parsing add?

• What are the difficulties relating to parsing?

• How is parsing done?• Parsing and corpora

– partial parsing, chunking– stochastic parsing– treebanks

4

Structural information

• Parsing adds information about sentence structure and constituents

• Allows us to see what constructions words enter into– eg, transitivity, passivization, argument structure for

verbs

• Allows us to see how words function relative to each other– eg, what words can modify / be modified by other

words

5

[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S]

Nemo , the killer whale , who ‘d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .

S

N

Fr

V V

J P

P N

N P

P N

N N N

NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 .

Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .

6


S

N

Fr

V V

J P

P N

N P

P N

N N N



given this verb, what kinds of things can be subject?

7


S

N

Fr

V V

J P

P N

N P

P N

N N N



verb with adjective complement:what verbs can participate in this construction?with what adjectives?any other constraints?

8


S

N

Fr

V V

J P

P N

N P

P N

N N N



verb with PP complement:what verbs with what prepositions?any constraints on noun?

9

Parsing: difficulties

• Besides lexical ambiguities (usually resolved by tagger), language can be structurally ambiguous– global ambiguities due to ambiguous words

and/or alternative possible combinations– local ambiguities, especially due to

attachment ambiguities, and other combinatorial possibilities

– sheer weight of alternatives available in the absence of (much) knowledge

10

Global ambiguities

• Individual words can be ambiguous as to category

• In combination with each other this can lead to ambiguity:– Time flies like an arrow– Gas pump prices rose last time oil stocks fell

11

Local ambiguities

• Structure of individual constituents may be given, but how they fit together can be in doubt

• Classic example of PP attachment– The man saw the girl with the telescope

The man saw the girl in the park with a statue of the general on a horse

with a sword on a stand in the morningwith a red dress with a telescope

• Many other attachments potentially ambiguous– relative clauses, adverbs, parentheticals, etc

12

Difficulties

• Broad coverage necessary for parsing corpora of real text

• Long sentences: – structures are very complex– ambiguities proliferate

• Difficulty (even for human) to verify if parse is correct– because it is complex– because it may be genuinely ambiguous

13

How to parse

• Traditionally (in linguistics)– hand-written grammar– usually narrow coverage– linguists are interested in theoretical issues

regarding syntax

• Even in computational linguistics– interest is (was?) in parsing algorithms

• In either case, grammars typically used small set of categories (N, V, Adj etc)

14

Lack of knowledge

• Humans are very good at disambiguating• In fact they rarely even notice the ambiguity• Usually, only one reading “makes sense”• They use a combination of

– linguistic knowledge– common-sense (real-world) knowledge– contextual knowledge

• Only the first is available to computers, and then only in a limited way

15

Parsing corpora

• Using tagger as a front-end changes things:– Richer set of grammatical categories which

reflect some morphological information– Hand-written grammars more difficult though

because many generalisations lost (eg now need many more rules for NP)

– Disambiguation done by tagger in some sense pre-empts work that you might have expected the parser to do

16

Parsing corpora

• Impact of broad coverage requirement– Broad coverage means that many more

constructions are covered by the grammar– This increases ambiguity massively

• Partial parsing may be sufficient for some needs

• Availability of corpora permits (and encourages) stochastic approach

17

Partial parsing

• Identification of constituents (noun phrases, verb groups, PPs) is often quite robust …

• Only fitting them together can be difficult

• Although some information is lost, identifying “chunks” can be useful

18

Stochastic parsing

• Like ordinary parsing, but competing rules are assigned a probability score

• Scores can be used to compare (and favour) alternative parses

• Where do the probabilities come from?

S NP VP .80S aux NP VP .15S VP .05NP det n .20NP det adj n .35NP n .20NP adj n .15NP pro .10

19

Where do the probabilities come from?

1) Use a corpus of already parsed sentences: a “treebank”– Best known example is the Penn Treebank

• Marcus et al. 1993• Available from Linguistic Data Consortium• Based on Brown corpus + 1m words of Wall Street Journal

+ Switchboard corpus

– Count all occurrences of each rule variation (e.g. NP) and divide by total number of NP rules

– Very laborious, so of course is done automatically

20


2) Create your own treebank from your own corpus– Easy if all sentences are unambiguous: just

count the (successful) rule applications– When there are ambiguities, rules which

contribute to the ambiguity have to be counted separately and weighted

21


3) Learn them as you go along– Again, assumes some way of identifying the

correct parse in case of ambiguity– Each time a rule is successfully used, its

probability is adjusted– You have to start with some estimated

probabilities, e.g. all equal– Does need human intervention, otherwise

rules become self-fulfilling prophecies

22

Bootstrapping the grammar

• Start with a basic grammar, possibly written by hand, with all rules equally probable

• Parse a small amount of text, then correct it manually– this may involve correcting the trees and/or changing the

grammar

• Learn new probabilities from this small treebank• Parse another (similar) amount of text, then correct it

manually• Adjust the probabilities based on the old and new trees

combined• Repeat until the grammar stabilizes

23

Treebanks – some examples (with links)

• Penn perhaps best known– Wall Street Corpus, Brown Corpus; >1m words

• International Corpus of English (ICE); • Lancaster Parsed Corpus and Lancaster-Leeds treebank

– parsed excerpts from LOB; 140k and 45k words resp.• Susanne Corpus, Christine Corpus, Lucy Corpus;

– related to Lancaster corpora; developed by Geoffrey Sampson• Verbmobil treebanks

– parallel treebanks (Eng, Ger, Jap) used in speech MT project• LinGO Redwoods: HPSG-based parsing of Verbmobil data • Multi-Treebank

– parses in various frameworks of 60 sentences• The PARC 700 Dependency Bank;

– LFG parses of 700 sentences also found in Penn treebank • CHILDES

– Brown Eve corpus of children’s speech samples with dependency annotation

http://www.cis.upenn.edu/~treebank/

http://www.ucl.ac.uk/english-usage/ice/index.htm