Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts...

Post on 02-Aug-2020

1 views 0 download

Transcript of Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts...

Treebanks, parsing, etc.

Syntax and computers

• Parsing: input is sentence, output is tree (or equivalent representation)

• Browsing: – Finding particular syntactic structures within a

corpus of sentences – Finding sentences that match a particular

syntactic construction

• Information retrieval, machine translation, speech recognition, etc.

Why parsing is difficult: Newspaper headlines

• Iraqi Head Seeks Arms

• Juvenile Court to Try Shooting Defendant

• Teacher Strikes Idle Kids

• Stolen Painting Found by Tree

• Local High School Dropouts Cut in Half

• Red Tape Holds Up New Bridges

• Clinton Wins on Budget, but More Lies Ahead

• Hospitals Are Sued by 7 Foot Doctors

• Kids Make Nutritious Snacks

4

Ambiguous headlines

POLICE BEGIN CAMPAIGN TO RUN DOWN JAYWALKERS SAFETY EXPERTS SAY SCHOOL BUS PASSENGERS SHOULD BE BELTED DRUNK GETS NINE MONTHS IN VIOLIN CASE FARMER BILL DIES IN HOUSE IRAQI HEAD SEEKS ARMS PROSTITUTES APPEAL TO POPE BRITISH LEFT WAFFLES ON FALKLAND ISLANDS LUNG CANCER IN WOMEN MUSHROOMS TEACHER STRIKES IDLE KIDS ENRAGED COW INJURES FARMER WITH AXE JUVENILE COURT TO TRY SHOOTING DEFENDANT TWO SOVIET SHIPS COLLIDE, ONE DIES

Soar 2003 Tutorial 5

WordNet subcat frames

1 Something ----s 2 Somebody ----s 3 It is ----ing 4 Something is ----ing PP 5 Something ----s something Adjective/Noun 6 Something ----s Adjective/Noun 7 Somebody ----s Adjective 8 Somebody ----s something 9 Somebody ----s somebody 10 Something ----s somebody 11 Something ----s something 12 Something ----s to somebody 13 Somebody ----s on something 14 Somebody ----s somebody something 15 Somebody ----s something to somebody 16 Somebody ----s something from somebody 17 Somebody ----s somebody with something 18 Somebody ----s somebody of something 19 Somebody ----s something on somebody

20 Somebody ----s somebody PP 21 Somebody ----s something PP 22 Somebody ----s PP 23 Somebody's (body part) ----s 24 Somebody ----s somebody to INFINITIVE 25 Somebody ----s somebody INFINITIVE 26 Somebody ----s that CLAUSE 27 Somebody ----s to somebody 28 Somebody ----s to INFINITIVE 29 Somebody ----s whether INFINITIVE 30 Somebody ----s somebody into V-ing something 31 Somebody ----s something with something 32 Somebody ----s INFINITIVE 33 Somebody ----s VERB-ing 34 It ----s that CLAUSE 35 Something ----s INFINITIVE

English LCS lexicon

• Theta-grid information for verbs

• Derive ucat features – used to build syntactic structure

• Co-referenced with WordNet2.0 – theta-grids are aligned with ucat features and

word sense information

English LCS lexicon data

10.6.a#1#_ag_th,mod-poss(of)#exonerate#exonerate#exonerate#exonerate+ed# (2.0,00874318_exonerate%2:32:00::) "10.6.a" :NAME "Verbs of Possessional Deprivation: Cheat Verbs / -of“ WORDS (absolve acquit balk bereave bilk bleed burgle cheat cleanse con cull cure defraud denude deplete depopulate deprive despoil disabuse disarm disencumber dispossess divest drain ease exonerate fleece free gull milk mulct pardon plunder purge purify ransack relieve render rid rifle rob sap strip swindle unburden void wean) THETA_ROLES ((1 "_ag_th,mod-poss()") (1 "_ag_th,mod-poss(from)") (1 "_ag_th,mod-poss(of)")) SENTENCES "He !!+ed the people (of their rights); He !!+ed him of his sins"

8

Doing syntax with computers

• To do this you need a grammar • So where do grammars come from?

– Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to

write grammars (typically in some particular grammar formalism of interest to the linguists developing the grammar).

– TreeBanks • Semi-automatically generated sets of parse trees for the

sentences in some corpus. Typically in a generic lowest common denominator formalism (of no particular interest to any modern linguist).

9

TreeBanks

• TreeBanks provide a grammar (of a sort). • Hence they provide the training data for various computer

applications that use syntax • But they can also provide useful data for more purely

linguistic pursuits. – You might have a theory about whether or not something can

happen in particular language. – Or a theory about the contexts in which something can happen. – TreeBanks can give you the means to explore those theories. If you

can formulate the questions in the right way and get the data you need.

A Penn Treebank sentence

( (S (NP-SBJ (DT The) (NN move)) (VP (VBD followed) (NP (NP (DT a) (NN round)) (PP (IN of) (NP (NP (JJ similar) (NNS increases)) (PP (IN by) (NP (JJ other) (NNS lenders))) (PP (IN against) (NP (NNP Arizona) (JJ real) (NN estate) (NNS loans)))))) (, ,) (S-ADV (NP-SBJ (-NONE- *)) (VP (VBG reflecting) (NP (NP (DT a) (VBG continuing) (NN decline)) (PP-LOC (IN in) (NP (DT that) (NN market))))))) (. .)))

11

Equivalent representations

• PS tree (phrase-markers)

• Bracketed labeling

• Automaton

• F-structure

12

Bracketed labeling

[IP[NP[DetThe] [Ndog]] [VP[vbarked] [PP [Pat] [NP[Detthe] [Nboy]]]]]].

An automaton

13

F-structure

Time to be flexible!

• We have learned a way to diagram parse trees; it involves certain assumptions

• Not everybody agrees with all of these assumptions

• In fact, very few people agree on very many specifics at all

• Syntax resources reflect this diversity • Hence the need to be flexible

16

flight

17

flight

flight

18

flight

flight

flight

19

Classical grammar engineering

• Write rules with associated lexicon – S → NP VP NN → interest – NP → (DT) NN NNS → rates – NP → NN NNS NNS → raises – NP → NNP VBP → interest – VP → V NP VBZ → rates – Simple 10 rule grammar: 592 parses for some

ambiguous sentences – Real-size broad-coverage grammar: millions of

parses for a complicated sentence

A simple grammar

S → NP VP 1.0 VP → V NP 0.7 VP → VP PP 0.3 PP → P NP 1.0 P → with 1.0 V → saw 1.0

NP → NP PP 0.4 NP → astronomers 0.1 NP → ears 0.18 NP → saw 0.04 NP → stars 0.18 NP → telescope 0.1

23

Ambiguity

Ambiguity

• Tree for: Fed raises interest rates 0.5% in effort to control inflation (NYT headline 5/17/00)

Local V/N ambiguities

Ambiguity

• Local ambiguity means that we have to deal with multiple plausible choices during the parsing process.

• Global ambiguity means that the grammar can’t tell us which of several (many?) possible parses is the correct one.

26

Two possible PP attachments

29

Sample treebank parse

30

Sample treebank sentence

31

Sample NP rules

11/2/2011 CSCI 5832 Spring 2006 32

Example

How many rules?

A sample parsed sentence

PP attachment ambiguity (German)

PP attachment in Chinese

Sample trees

Searching treebank corpora

• Online – The Treebank Tool Suite

– The VISL website

– The NCLT website

• Offline – Treebank corpus

– Search utilities: tgrep, tregex, etc.

tgrep

Im APPRART

Dat

in

nächsten ADJA

Sup.Dat. Sg.Neut

nahe

Jahr NN Dat.

Pl.Neut Jahr

. $.

HD SB OC

HD OA MO

AC NK NK NK NK NK NK

S

VP

NP NP PP

will VMFIN

3.Sg. Pres.Ind wollen

die ART Nom.

Sg.Fem die

Regierung NN

Nom. Sg.Fem

Regierung

ihre PPOSAT

Acc. Pl.Masc

ihr

Reformpläne NN Acc.

Pl.Masc Plan

umsetzen VVINF

Inf

umsetzen

annotation on word level: part-of-speech,

morphology, lemmata

TiGer Treebank

node labels: phrase categories

edge labels: syntactic functions crossing branches for

discontinuous constituency types

Parallel treebanks

• Translation training and studies

• Machine translation (MT) research & development

Aligning parses