ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course:...
-
Upload
benedict-casey -
Category
Documents
-
view
216 -
download
0
Transcript of ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course:...
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources1
Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources
Josef van Genabith, Dublin City University
Yusuke Miyao, University of Tokyo
Julia Hockenmaier, University of Pennsylvania and University of Edinburgh
ESSLLI 200618th European Summer School for Language, Logic
and Information, University of Malaga, July – August 2006
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources2
• Josef van Genabith, National Centre for Language Technology NCLT, School of Computing, Dublin City University, Dublin 9, Ireland, [email protected]
• Julia Hockenmaier, [email protected]
• Yusuke Miyao, Department of Computer Science, The University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, JAPAN, [email protected]
Lecturer Contact Information
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources3
Motivation
• What do grammars do?– Grammars define languages as sets of strings– Grammars define what strings are grammatical
and what strings are not– Grammars tell us about the syntactic structure of
(associated with) strings• “Shallow” vs. “Deep” grammars• Shallow grammars do all of the above• Deep grammars (in addition) relate text to information/meaning
representation• Information: predicate-argument-adjunct structure, deep
dependency relations, logical forms, …• In natural languages, linguistic material is not always
interpreted locally where you encounter it: long-distance dependencies (LDDs)
• Resolution of LDDs crucial to construct accurate and complete information/meaning representations.
• Deep grammars := (text <-> meaning) + (LDD resolution)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources4
Motivation
• Unification (Constraint-Based) Grammar Formalisms (FU, GPSG, PATR-II, …)
– Lexical-Functional Grammar (LFG)– Head-Driven Phrase Structure Grammar (HPSG)– Combinatory Categorial Grammar (CCG)– Tree-Adjoining Grammar (TAG)
• Traditionally, deep constraint-based grammars are hand-crafted• LFG ParGram, HPSG LingoErg, Core Language Engine CLE, Alvey
Tools, RASP, ALPINO, …• Wide-coverage, deep unification (constraint-based) grammar
development is knowledge extensive and expensive!• Very hard to scale hand-crafted grammars to unrestricted text! • English XLE (Riezler et al. 2002); German XLE (Forst and Rohrer
2006); Japanese XLE (Masuichi and Okuma 2003); RASP (Carroll and Briscoe 2002); ALPINO (Bouma, van Noord and Malouf, 2000)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources5
Motivation
• Instance of “knowledge acquisition bottleneck” familiar from classical “rationalist” rule/knowledge-based AI/NLP
• Alternative to classical “rationalist” rule/knowledge-based AI/NLP• “Empiricist” research paradigm (AI/NLP):
– Corpora, treebanks, …, machine-learning-based and statistical approaches, …
– Treebank-based grammar acquisition, probabilistic parsing– Advantage: grammars can be induced (learned) automatically – Very low development cost, wide-coverage, robust, but …
• Most treebank-based grammar induction/parsing technology produces “shallow” grammars
• Shallow grammars don’t resolve LDDs (but see (Johnson 2002); …), do not map strings to information/meaning representations …
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources6
Motivation
• Poses a research question:• Can we address the knowledge acquisition bottleneck for
deep grammar development by combining insights from rationalist and empiricist research paradigms?
• Specifically:• Can we automatically acquire wide-coverage “deep”,
probabilistic, constraint-based grammars from treebanks?• How do we use them in parsing?• Can we use them for generation?• Can we acquire resources for different languages and
treebank encodings?• How do these resources compare with hand-crafted
resources?• …
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources7
Course Overview
Monday:
Tuesday:
Wednesday:
Thursday:
Friday:
Motivation, Course Overview, Introductions to TAG, LFG, CCG, HPSG and Penn-II TreeBank, TAG Resources
Penn-II-Based Acquisition of LFG Resources
Penn-II-Based Acquisition of CCG Resources
Penn-II-Based Acquisition of HPSG Resources
Multilingual Resources, Formal Semantics, Comparing LFG, CCG, HPSG and TAG-Based Approaches, Demos, Current and Future Work, Discussion
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources8
Course Overview
Tuesday/Wednesday/Thursday
Penn-II-Based Acquisition of XXG Resources:
• Treebank Preprocessing/Clean-Up
• Treebank Annotation/Conversion
• Grammar and Lexicon Extraction
• Parsing (Architectures, Probability Models, Evaluation)
• Generation (Architectures, Probability Models, Evaluation)
• Other (Sematics, Domain Variation, …)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources9
Grammar Formalisms
Grammar Formalisms
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources10
Grammar formalisms and linguistic theories
• Linguistics aims to explain natural language:– What is universal grammar?– What are language-specific constraints?
• Formalisms are mathematical theories:– They provide a language in which linguistic theories
can be expressed (like calculus for physics)– They define elementary objects (trees, strings,
feature structures) and recursive operations which generate complex objects from simple objects.
– They do impose linguistic constraints (e.g. on the kinds of dependencies they can capture)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources11
Lexicalised Grammar Formalisms:
TAG, CCG, LFG and HPSG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources12
Lexicalised formalisms (TAG, CCG, LFG and HPSG)
• The lexicon:– pairs words with elementary objects– specifies all language-specific information
(number and location of arguments, control and binding theory)
• The grammatical operations:– are universal– define (and impose constraints on) recursion
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources13
TAG, CCG, LFG and HPSG
• They describe different kinds of linguistic objects:– TAG is a theory of trees– CCG is a theory of (syntactic and semantic) types– LFG is a multi-level theory based on a projection
architecture relating different types of linguistic objects (trees, AVMs, linear logic–based semantics)
– HPSG uses single, uniform formalism (typed feature structures) to describe phonological, morphological, syntactic and semantic representations (signs)
• They differ in details:– treatment of wh-movement, coordination, etc.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources14
TAG, CCG, LFG and HPSG
• TAG and CCG are weakly equivalent.
• Both are mildly context-sensitive:– can capture Dutch crossing dependencies – but are still efficiently parseable (in polynomial
time)
• LFG context-sensitive
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources15
Tree-Adjoining Grammar (TAG)
Tree-Adjoining Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources16
(Lexicalized) Tree-Adjoining Grammar
• TAG is a tree-rewriting formalism:– TAG defines operations (substitution and adjunction) on
trees.– The elementary objects in TAG are trees (not strings)
• TAG is lexicalized:– Each elementary tree is anchored to a lexical item (word)– “Extended domain of locality”:
The elementary tree contains all arguments of the anchor.– TAG requires a linguistic theory which specifies the shape
of these elementary trees.
• TAG is mildly context-sensitive:– can capture Dutch crossing dependencies– but is still efficiently parseable
AK Joshi and Y Schabes (1996) Tree Adjoining Grammars. In G. Rosenberg and A. Salomaa, Eds., Handbook of Formal Languages
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources17
TAG substitution (arguments)
SubstituteX YX Y
X Y
Derivation tree:
Derived tree:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources18
ADJOIN
TAG adjunction (modifiers)
XX*
X
X
X*
Auxiliary tree
Foot node
Derived tree:
Derivation tree:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources19
A small TAG lexicon
S
NP VP
VBZ NP
eats
NP
John
VP
RB VP*
always
NP
tapas
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources20
A TAG derivation
S
NP VP
VBZ NP
eats
NP
John
NP
tapas
VPRB VP*
always
NP
NP
NP
NP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources21
A TAG derivation
S
NP VP
VBZ NP
eats tapas
VPRB VP*
always
John
VP
VP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources22
A TAG derivation
S
NP
VBZ
VP
NP
eats tapas
VPRB VP*
always
John
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources23
Combinatory Categorial Grammar (CCG)
Combinatory Categorial Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources24
Combinatory Categorial Grammar
• CCG is a lexicalized grammar formalism(the “rules” of the grammar are completely general,all language-specific information is given in the lexicon)
• CCG is nearly context-free(can capture Dutch crossing dependencies, but is still efficiently parseable)
• CCG has a flexible constituent structure• CCG has a simple, unified treatment of
extraction and coordination • CCG has a transparent syntax-semantics interface
(every syntactic category and operation has a semantic counterpart)
• CCG rules are monotonic(movement or traces don’t exist)
• CCG rules are type-driven, not structure-driven(this means e.g. that intransitive verbs and VPs are indistinguishable)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources25
• Categories: specify subcat lists of words/constituents.
• Combinatory rules: specify how constituents can combine.
• The lexicon: specifies which categories a word can have.
• Derivations: spell out process of combining constituents.
CCG: the machinery
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources26
CCG categories
• Simple categories: NP, S, PP
• Complex categories: functions which return a result when combined with an argument:
VP or intransitive verb: S\NPTransitive verb: (S\NP)/NPAdverb: (S\NP)\(S\NP)PPs: ((S\NP)\(S\NP))/NP
(NP\NP)/NP• Every category has a semantic
interpretation
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources27
Function application
• Combines a function with its argument to yield a result:
(S\NP)/NP NP -> S\NPeats tapas eats tapas
NP S\NP -> SJohn eats tapas John eats tapas
• Used in all variants of categorial grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources28
A (C)CG derivation
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources29
Type-raising and function composition
• Type-raising: turns an argument into a function.Corresponds to case:
NP -> S/(S\NP) (nominative)NP -> (S\NP)/((S\NP)/NP) (accusative)
• Function composition: composes two functions (complex categories)
(S\NP)/PP PP/NP -> (S\NP)/NPS/(S\NP) (S\NP)/NP -> S/NP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources30
Type-raising and Composition
• Wh-movement:
• Right-node raising:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources31
Another CCG derivation
• We will only be concerned with canonical “normal-form” derivations, which only use function composition and type-raising when syntactically necessary.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources32
CCG: semantics
• Every syntactic category and rule has a semantic counterpart:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources33
The CCG lexicon
• Pairs words with their syntactic categories(and semantic interpretation):
eats (S\NP)/NP xy.eats’xyS\NP x.eats’x
• The main bottleneck for wide-coverage CCG parsing
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources34
Why use CCG for statistical parsing?
• CCG derivations are binary trees: we can use standard chart parsing techniques.
• CCG derivations represent long-range dependencies and complement-adjunct distinctions directly:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources35
A comparison with Penn Treebank parsers
• Standard Treebank parsers do not recover the null elements and function tags that are necessary for semantic interpretation:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources36
Lexical-Functional Grammar (LFG)
Lexical-Functional Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources37
Lexical-Functional Grammar LFG
Lexical-Functional Grammar (LFG) (Bresnan & Kaplan 1981, Bresnan 2001, Dalrymple 2001) is a unification- (or constraint-) based theory of grammar.
Two (basic) levels of representation:
• C-structure: represents surface grammatical configurations such as word order, annotated CFG data structures
• F-structure: represents abstract syntactic functions such as SUBJ(ject), OBJ(ect), OBL(ique), PRED(icate), COMP(lement), ADJ(unct) …, AVM attribute-value matrices/structures
F-structure approximates to basic predicate-argument structure, dependency representation, logical form (van Genabith and Crouch, 1996; 1997)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources38
Lexical-Functional Grammar LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources39
Lexical-Functional Grammar LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources40
Lexical-Functional Grammar LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources41
LFG Grammar Rules and Lexical Entries
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources42
LFG Parse Tree (with Equations/Constraints)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources43
LFG Constraint Resolution (1/3)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources44
LFG Constraint Resolution (2/3)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources45
LFG Constraint Resolution (3/3)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources46
LFG Subcategorisation & Long Distance Dependencies
• Subcategorisation:
– Semantic forms (subcat frames): sign< SUBJ, OBJ>
– Completeness: all GFs in semantic form present at local f-structure
– Coherence: only the GFs in semantic form present at local f-structure
• Long Distance Dependencies (LDDs): resolved at f-structure with Functional Uncertainty Equations (regular expressions specifying paths in f-structure).
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources47
LFG LDDs: Complement Relative Clause
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources48
LFG LDDs: Complement Relative Clause
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources49
LFG LDDs: Complement Relative Clause
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources50
Head-Driven Phrase Structure Grammar (HPSG)
Head-Driven Phrase Structure Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources51
Head-Driven Phrase Structure Grammar HPSG
• HPSG (Pollard and Sag 1994, Sag et al. 2003) is a unification-/constraint-based theory of grammar
• HPSG is a lexicalized grammar formalism• HPSG aims to explain generic regularities that underlie
phrase structures, lexicons, and semantics, as well as language-specific/-independent constraints
• Syntactic/semantic constraints are uniformly denoted by signs, which are represented with feature structures
• Two components of HPSG– Lexical entries represent word-specific constraints
(corresponding to elementary objects)– Principles express generic grammatical regularities
(corresponding to grammatical operations)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources52
Sign
• Sign is a formal representation of combinations of phonological forms, syntactic and semantic constraints
signPHON string
SYNSEM LOCAL
NONLOCAL
CAT
CONT content
HEAD
VAL
valenceSPR listSUBJ listCOMPS list
headMOD synsem
synsemlocal
category
nonlocalQUE listREL listSLASH list
phonological formsyntactic/
semanticconstraints
local constraints
syntactic category
syntactic headmodifying
constraintssubcategorization
framessemantic
representationsnon-local
dependenciesDTRS dtrs daughter structures
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources53
Lexical entries
• Lexical entries express word-specific constraints
PHON “loves”HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>
We use simplified notations in this lecture
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources54
Principles
• Principles describe generic regularities of grammar– Not corresponding to construction rules
• Head Feature Principle– The value of HEAD must be percolated from the head
daughter
• Valence Principle– Subcats not consumed are percolated to the mother
• Immediate Dominance (ID) Principle– A mother and her immediate daughters must satisfy one of
ID schemas
• Many other principles: percolation of NONLOCAL features, semantics construction, etc.
HEAD 1 HEAD 1head daughter
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources55
ID schemas
• ID schemas correspond to construction rules in CFGs and other grammar formalisms– For subject-head constructions (ex. “John runs”)
– For head-complement constructions (ex. “loves Mary”)
– For filler-head constructions (ex. “what he bought”)
COMPS < | >1 2 1
SUBJ <> 1SUBJ < >1
COMPS 2
SLASH < | >1 21SLASH 2
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources56
Example: HPSG parsing
• Lexical entries determine syntactic/semantic constraints of words
HEAD nounSUBJ <>COMPS <>
John Mary
HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>
HEAD nounSUBJ <>COMPS <>
saw
Lexical entries
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources57
Example: HPSG parsing
• Principles determine generic constraints of grammar
HEAD nounSUBJ <>COMPS <>
John Mary
HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>
HEAD nounSUBJ <>COMPS <>
saw
HEAD SUBJCOMPS < | >
23 4
13
HEAD SUBJCOMPS
12
4
Unification
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources58
Example: HPSG parsing
• Principle application produces phrasal signs
HEAD nounSUBJ <>COMPS <>
John Mary
HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>
HEAD nounSUBJ <>COMPS <>
saw
HEAD verbSUBJ <HEAD noun>COMPS <>
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources59
Example: HPSG parsing
• Recursive applications of principles produce syntactic/semantic structures of sentences
HEAD nounSUBJ <>COMPS <>
John Mary
HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>
HEAD nounSUBJ <>COMPS <>
saw
HEAD verbSUBJ <HEAD noun>COMPS <>
HEAD verbSUBJ <>COMPS <>
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources60
Example: LDDs
• NONLOCAL features(SLASH, REL, etc.) explain long-distance dependencies– WH
movements– Topicalization– Relative
clauses etc...
prices
HEAD nounSUBJ < >COMPS < >SPR < >
HEAD nounSUBJ < >COMPS < >SPR < >
HEAD verbSUBJ < >COMPS < >SLASH < >
chargedwere
we
2HEAD verbSUBJ < >COMPS < >REL < >
HEAD nounSUBJ < >COMPS < >
HEAD verbSUBJ < >COMPS < >SLASH < >
3
HEAD verbSUBJ < >COMPS < >
34
4
3
2
HEAD verbSUBJ < >COMPS < >SLASH < >2
3
2
2
1
1
HEAD detSUBJ < >COMPS < >
the
1
HEAD nounSUBJ < >COMPS < >SPR < >
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources61
Brief Intro to Penn Treebank
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources62
The Penn Treebank
• The first large syntactically annotated corpus
• Contains text from different domains:– Wall Street Journal (50,000 sentences, 1 Million words)– Switchboard– Brown corpus– ATIS
• The annotation:– POS-tagged (Ratnaparkhi’s MXPOST) – Manually annotated with phrase-structure trees– Traces and other null elements used to represent non-local
dependencies (movement, PRO, etc.)– Designed to facilitate extraction of predicate-argument
structure
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources63
A Treebank tree
• Relatively flat structures:– There is no noun level– VP arguments and adjuncts appear at the same level
• Co-indexed null elements indicate long-range dependencies• Function tags indicate complement-adjunct distinction (?)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources64
Penn-II Treebank
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources65
Penn-II Treebank
• Until Congress acts , the government hasn't any authority to issue new debt obligations of any kind , the Treasury said .
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources66
Penn-II Treebank
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources67
Penn-II Treebank
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources68
Penn-II Treebank
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources69
Penn-II TreebankADJP ADJP-ADV ADJP-CLR ADJP-HLN ADJP-LOC ADJP-MNR ADJP-PRD ADJP-SBJ ADJP-TMP ADJP-TPC ADJP-TTL ADVP ADVP-CLR ADVP-DIR ADVP-EXT ADVP-HLN ADVP-LOC ADVP-MNR ADVP-PRD ADVP-PRP ADVP-PUT ADVP-TMP ADVP-TPC ADVP|PRT CONJP FRAG FRAG-ADV FRAG-HLN FRAG-PRD FRAG-TPC FRAG-TTL
INTJ INTJ-CLR INTJ-HLN LST NAC NAC-LOC NAC-TMP NAC-TTL NP NP-ADV NP-BNF NP-CLR NP-DIR NP-EXT NP-HLN NP-LGS NP-LOC NP-MNR NP-PRD NP-SBJ NP-TMP NP-TPC NP-TTL NP-VOC NX NX-TTL PP PP-BNF PP-CLR PP-DIR PP-DTV
PP-EXT PP-HLN PP-LGS PP-LOC PP-MNR PP-NOM PP-PRD PP-PRP PP-PUT PP-SBJ PP-TMP PP-TPC PP-TTL PRN PRT PRT|ADVP QP RRC S S-ADV S-CLF S-CLR S-HLN S-LOC S-MNR S-NOM S-PRD S-PRP S-SBJ S-TMP S-TPC
S-TTL SBAR SBAR-ADV SBAR-CLR SBAR-DIR SBAR-HLN SBAR-LOC SBAR-MNR SBAR-NOM SBAR-PRD SBAR-PRP SBAR-PUT SBAR-SBJ SBAR-TMP SBAR-TPC SBAR-TTL SBARQ SBARQ-HLN SBARQ-NOM SBARQ-PRD SBARQ-TPC SBARQ-TTL SINV SINV-ADV SINV-HLN SINV-TPC SINV-TTL SQ SQ-PRD SQ-TPC SQ-TTL
UCP UCP-ADV UCP-CLR UCP-DIR UCP-EXT UCP-HLN UCP-LOC UCP-MNR UCP-PRD UCP-PRP UCP-TMP UCP-TPC VP VP-TPC VP-TTL WHADJP WHADVP WHADVP-TMP WHNP WHPP X X-ADV X-CLF X-DIR X-EXT X-HLN X-PUT X-TMP X-TTL X-TTL
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources70
Penn-II Treebank (Simple Transitive Verb)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources71
Penn-II Treebank (Simple Coordination)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources72
Penn-II Treebank (Passive)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources73
Penn-II Treebank (Subject WH-Relative Clause)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources74
Penn-II Treebank (WH-Less Complement Relative Cl.)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources75
Penn-II Treebank (Control and WH-Compl. Rel. Cl.)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources76
Penn-II Treebank (Adv. Relative Clause)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources77
Penn-II Treebank (Coord. and Right Node Raising)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources78
The Parseval measure
• Standard evaluation metric for Treebank parsers.Two components: – Precision: how many of the proposed NTs are correct?– Recall: how many of the correct NTs are proposed?
• Measures recovery of nonterminals(span + syntactic category)
• Ignores function tags and null elements
Has biased research towards parsers that produce linguistically shallow output (Collins, Charniak)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources79
Treebank-Based Acquisition
of TAG resources
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources80
Extracting a TAG from the Treebank
• Two different approaches:– F. Xia. Automatic Grammar Generation From Two
Different Perspectives. PhD thesis, University of Pennsylvania, 2001.
– J. Chen, S. Bangalore, K. Vijaj-Shanker. Automated Extraction of Tree-Adjoining Grammars from Treebanks, Natural Language Engineering (forthcoming)
• This lecture: just the basic ideas!
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources81
Extracting a TAG from the Penn Treebank
• Input: a Treebank tree (= the TAG derived tree)
•Output: a set of elementary trees(= the TAG lexicon)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources82
Extracting a TAG: the head
- Identify the head path (requires a head percolation table)
S
VPVBG
making
VP
- Find the arguments of the head (requires an argument table)- Ignore modifiers (requires an adjunct table)
- Merge unary productions (VP -> VP)
NP-SBJ
NP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources83
Extracting a TAG: the head
• This is the elementary tree for the head:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources84
Extracting a TAG: arguments
• Arguments are combined via substitution• Recurse on the arguments:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources85
Extracting a TAG: adjuncts
• Adjuncts require auxiliary trees(use adjunction to be combined with the head)
• Auxiliary trees require a foot node (with the same label as the root)
is
VBZ
VP
VP
ADVP-MNR
officially
NP
DTthe
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources86
Extracting a TAG: adjuncts
• Adjuncts require auxiliary trees(use adjunction to be combined with the head)
• Auxiliary trees require a foot node (with the same label as the root)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources87
Special cases
• Coordination• Null elements (e.g. traces for wh-
movement):The trace has to be part of the elementary treeof the main verb
• Punctuation marks
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources88
Wh-movement: relative clauses
(NP (NP a charge))
(SBAR (WHNP-2 (-NONE- 0))
(S (NP-SBJ Mr. Coleman))
(VP (VBZ denies)
(NP (-NONE- *T*-2)))))))
NP
NP
NP
SBAR
NP
S
VP
VBZ
WHNP
-NONE-
-NONE-
*T*-2
0
denies
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources89
Evaluating an extracted grammar/lexicon
• Grammar/lexicon size?– Depends on head table, argument/adjunct distinction,
treatment of null elements, mapping of Treebank labels/POS tags to categories in extracted grammar etc.
– For TAGs, between 3,000-8,500 elementary tree types,and 100,000-130,000 lexical entries.
• Lexical coverage? – For TAGs, around 92-93%
• Distribution of tree types?• Convergence?• Quality?
– Inspection, comparison with manual grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources90
References: TAG extraction
TAG:A.K. Joshi and Y. Schabes (1996) Tree Adjoining Grammars. In G. Rosenberg and A.
Salomaa, Eds., Handbook of Formal Languages
TAG extraction:F. Xia. Automatic Grammar Generation From Two Different Perspectives. PhD
thesis, University of Pennsylvania, 2001.J. Chen, S. Bangalore, K. Vijaj-Shanker. Automated Extraction of Tree-Adjoining
Grammars from Treebanks, Natural Language Engineering (forthcoming)Also: L. Shen and A.K. Joshi, Building an LTAG Treebank, Technical Report MS-CIS-
05-15, CIS Department, University of Pennsylvania, 2005
Parsing with extracted TAGs:D. Chiang. Statistical parsing with an automatically extracted tree adjoining
grammar. In Data Oriented Parsing, CSLI Publications, pages 299–316.
L. Shen and A.K. Joshi. Incremental LTAG parsing, HLT/EMNLP 2005
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources91
Penn-II-Based Acquisition of LFG Resources
Lexical-Functional Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources92
Penn-II-Based Acquisition of LFG Resources
• Introduction
• Treebank Preprocessing/Clean-Up
• Treebank Annotation/Conversion
• Grammar and Lexicon Extraction
• Parsing (Architectures, Probability Models, Evaluation)
• Generation (Architectures, Probability Models, Evaluation)
• Other (Semantics, Domain Variation, …)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources93
Introduction: Penn-II & LFG
• If we had f-structure annotated version of Penn-II, we could use (standard) machine learning methods to extract probabilistic, wide-coverage LFG resources
• How do we get f-structure annotated Penn-II?
• Manually? No: 50,000 trees …!
• Automatically! Yes: F-Structure annotation algorithm … !
• Penn-II is a 2nd generation treebank – contains lots of annotations to support derivation of deep meaning representations: trees, Penn-II “functional” tags, traces & coindexation – f-structure annotation algorithm can exploit those.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources94
Introduction: Penn-II & LFG
• What is the task?
• Given a Penn-II tree, the f-structure annotation algorithm has to traverse the tree and associate all tree nodes with f-structure equations (including lexical equations at the leaves of the tree).
• A simple example
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources95
Introduction: Penn-II & LFG
S
NP-SBJ VP
NN NNS
Factory payrolls
VBD PP-TMP
fellIN NP
NNPin
↑=↓
↑subj=↓
↑=↓
↑=↓
↓↑adjunct
↑=↓ ↓↑adjunct
↑=↓
↑obj=↓
↑=↓
September
Factory payrolls fell in September.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources96
Introduction: Penn-II & LFG
subj : pred : payroll num : pl pers : 3 adjunct : 2 : pred : factory num : sg pers : 3adjunct : 1 : pred : in obj : pred : september num : sg pers : 3pred : falltense : past
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources97
Treebank Preprocessing/Clean-Up: Penn-II & LFG
• Penn-II treebank: often flat analyses (coordination, NPs …), a certain amount of noise: inconsistent annotations, errors …
• No treebank preprocessing or clean-up in the LFG approach
• Take Penn-II treebank as is, but
• Remove all trees with FRAG or X labelled constituents
• Frag = fragments, X = not known how to annotate
• Total of 48,424 trees as they are.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources98
Treebank Annotation: Penn-II & LFG
• Annotation-based (rather than conversion-based)• Automatic annotation of nodes in Penn-II treebank tress
with f-structure equations• F-structure Annotation Algorithm• Annotation Algorithm exploits:
– Head information – Categorial information– Configurational information– Penn-II functional tags– Trace information
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources99
Treebank Annotation: Penn-II & LFG
• Architecture of a modular algorithm to assign LFG f-structure equations to trees in the Penn-II treebank:
Left-Right Context Annotation Principles
Coordination Annotation Principles
Catch-All and Clean-Up
Traces
ProtoF-Structures Proper
F-Structures
Head-Lexicalisation [Magerman,1994]
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources100
Treebank Annotation: Penn-II & LFG
• Head Lexicalisation: modified rules based on (Magerman, 1994)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources101
Treebank Annotation: Penn-II & LFG
Left-Right Context Annotation Principles:
• Head of NP likely to be rightmost noun …• Mother → Left Context Head Right Context
LeftContext
Right Context
Head
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources102
Treebank Annotation: Penn-II & LFG
Left Context Head Right Context
DT: ↑spec:det=↓ QP: ↑spec:quant=↓JJ, ADJP: ↓↑adjunct
NN, NNS: ↑=↓
NP: ↓↑app PP: ↓↑adjunctS, SBAR: ↓↑relmod
NP
DT
RB
ADJP
very politicized
NN
JJ deala
NP
↑spec:det=↓
DT
RB
↓↑adjunct
ADJP
very politicized
↑=↓
NN
JJ deala
→
NP:
Left-Right Annotation Matrix
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources103
Treebank Annotation: Penn-II & LFGADJPADJP-ADVADJP-CLRADJP-HLNADJP-LOCADJP-MNRADJP-PRDADJP-SBJADJP-TMPADJP-TPCADJP-TTLADVPADVP-CLRADVP-DIRADVP-EXTADVP-HLNADVP-LOCADVP-MNRADVP-PRDADVP-PRPADVP-PUTADVP-TMPADVP-TPCADVP|PRTCONJPFRAGFRAG-ADVFRAG-HLNFRAG-PRDFRAG-TPCFRAG-TTL
INTJINTJ-CLRINTJ-HLNLSTNACNAC-LOCNAC-TMPNAC-TTLNPNP-ADVNP-BNFNP-CLRNP-DIRNP-EXTNP-HLNNP-LGSNP-LOCNP-MNRNP-PRDNP-SBJNP-TMPNP-TPCNP-TTLNP-VOCNXNX-TTLPPPP-BNFPP-CLRPP-DIRPP-DTV
PP-EXTPP-HLNPP-LGSPP-LOCPP-MNRPP-NOMPP-PRDPP-PRPPP-PUTPP-SBJPP-TMPPP-TPCPP-TTLPRNPRTPRT|ADVPQPRRCSS-ADVS-CLFS-CLRS-HLNS-LOCS-MNRS-NOMS-PRDS-PRPS-SBJS-TMPS-TPC
S-TTLSBARSBAR-ADVSBAR-CLRSBAR-DIRSBAR-HLNSBAR-LOCSBAR-MNRSBAR-NOMSBAR-PRDSBAR-PRPSBAR-PUTSBAR-SBJSBAR-TMPSBAR-TPCSBAR-TTLSBARQSBARQ-HLNSBARQ-NOMSBARQ-PRDSBARQ-TPCSBARQ-TTLSINVSINV-ADVSINV-HLNSINV-TPCSINV-TTLSQSQ-PRDSQ-TPCSQ-TTL
UCPUCP-ADVUCP-CLRUCP-DIRUCP-EXTUCP-HLNUCP-LOCUCP-MNRUCP-PRDUCP-PRPUCP-TMPUCP-TPCVPVP-TPCVP-TTLWHADJPWHADVPWHADVP-TMPWHNPWHPPXX-ADVX-CLFX-DIRX-EXTX-HLNX-PUTX-TMPX-TTLX-TTL
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources104
Treebank Annotation: Penn-II & LFG
• Do annotation matrix for each of the monadic categories
(without –Fun tags) in Penn-II
• Based on analysing the most frequent rule types for each
category
such that
sum total of token frequencies of these rule types is greater
than 85% of total number of rule tokens for that category
100% 85% 100% 85%
NP 6595 102 VP 10239 307
S 2602 20 ADVP 234 6
• Apply annotation matrix to all (i.e. also unseen) rules/sub-trees,
i.e. also those NP-LOC, NP-TMP etc.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources105
Treebank Annotation: Penn-II & LFG
• Co-ordination Annotation Principles• Often flat Penn-II analysis of coordination:
Co-ordinated ElementObjectModifier
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources106
Treebank Annotation: Penn-II & LFG
• Unlike constituents coordination:
Co-ordinated Element
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources107
Treebank Annotation: Penn-II & LFG
Traces Module:
• Long Distance Dependencies
• Topicalisation• Wh- and wh-less questions• Relative clauses• Passivisation• Control constructions• ICH (interpret constituent here)• RNR (right node raising)• …
• Translate Penn-II traces and coindexation into corresponding reentrancy in f-structure
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources108
Treebank Annotation: WH-Relative Clauses
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources109
Treebank Annotation: Wh-Less Relative Clauses
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources110
Treebank Annotation: Control & Wh-Rel. LDD
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources111
Treebank Annotation: Adv. Relative Clause
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources112
Treebank Annotation: Right Node Raising
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources113
Treebank Annotation: Right Node Raising
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources114
Treebank Annotation: Penn-II & LFG
Catch-All and Clean-Up Module:
• Penn-II Functional Tags are used to identify potential errors– e.g. Nodes with the tag -SBJ should be annotated as the subject …
• Correction of Overgeneralisations– e.g. Change a second OBJ annotations to OBJ2 …– e.g. Change arguments of head nouns erroneously annotated as
relative clauses to COMP arguments: • … signs [that managers expect declines]_RELCL …• … signs [that managers expect declines]_COMP …
• Unannotated Nodes– Defaults …
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources115
Treebank Annotation: Penn-II & LFG
Left-Right Context Annotation Principles
Coordination Annotation Principles
Catch-All and Clean-Up
Traces
ProtoF-Structures Proper
F-Structures
Head-Lexicalisation [Magerman,1995]
Constraint Solver
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources116
Treebank Annotation: Penn-II & LFG
• Collect f-structure equations• Send to constraint solver• Generates f-structures
• F-structure annotation algorithm implemented in Java, constraint solver in Prolog
• ~3 min annotating approx. 50,000 Penn-II trees• ~5 min producing approx. 50,000 f-structures
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources117
Treebank Annotation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources118
Treebank Annotation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources119
Evaluation (Quantitative):
• Burke (2006)
• Coverage:
Over 99.8% of Penn-II sentences (without X and FRAG constituents) receive a single covering and connected f-structure:
0 F-structures 45 0.093% 1 F-structure 48329 99.804% 2 F-structures 50 0.103%
Treebank Annotation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources120
Evaluation (Qualitative):
• Burke (2006)
• F-structure quality evaluation against DCU 105, a manually annotated dependency gold standard of 105 sentences randomly extracted from WSJ section 23.
• Triples are extracted from the gold standard and the automatically produced f-structures using the evaluation software from (Crouch et al. 2002) and (Riezler et al. 2002)
relation(predicate~0, argument~1)
• Results calculated in terms of Precision and Recall
Treebank Annotation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources121
Treebank Annotation: Penn-II & LFG
• Precision and Recall for DCU 105 Dependency Bank results are calculated for All Annotations and for Preds-Only
DCU 105 All Annotations Preds-Only
Precision 97.06% 94.28%
Recall 96.80% 94.28%
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources122
Treebank Annotation: Penn-II & LFG
DCU 105
Feature Precision Recall F-Score
adjunct 892/968 = 92 892/950 = 94 93 app 16/16 = 100 16/19 = 84 91 comp 88/92 = 96 88/102 = 86 91 coord 153/184 = 83 153/167 = 92 87 obj 442/459 = 96 442/461 = 96 96 obl 50/52 = 96 50/61 = 82 88 oblag 12/12 = 100 12/12 = 100 100passive 76/79 = 96 76/80 = 95 96poss 74/79 = 94 74/81 = 91 92quant 40/64 = 62 40/52 = 77 69relmod 46/48 = 96 46/50 = 92 94subj 396/412 = 96 396/414 = 96 96topic 13/13 = 100 13/13 = 100 100 topicrel 46/49 = 94 46/52 = 88 91 xcomp 145/153 = 95 145/146 = 99 97
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources123
Treebank Annotation: Penn-II & LFG
• Following (Kaplan et al. 2004) Precision and Recall for PARC 700 Dependency Bank calculated for:
all annotations PARC features preds-only
• Mapping required• (Burke 2006)
PARC 700 PARC features
Precision 88.31%
Recall 86.38%
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources124
Grammar and Lexicon Extraction : Penn-II & LFG
Lexical Resources:
• Lexical information extremely important in modern lexicalised grammar formalisms
• LFG, HPSG, CCG, TAG, … • Lexicon development is time consuming and extremely
expensive • Rarely if ever complete• Familiar knowledge acquisition bottleneck …• Subcategorisation frame induction (LFG semantic forms) from
f-Structure annotated version of Penn-II and -III• Evaluation against COMLEX
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources125
Grammar and Lexicon Extraction: Penn-II & LFG
• Lexicon Construction– Manual vs. Automated
Our Approach:
– F-Structure Annotation of Penn-II and Penn-III– Frames not Predefined– Functional and Categorial Information– Parameterised for Prepositions and Particles– Active and Passive – Long Distance Dependencies– Conditional Probabilities
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources126
Grammar and Lexicon Extraction: Penn-II & LFG
• Extraction Methodology– Automatic F-Structure Annotation of Penn-II & III– Lexical Extraction Algorithm– Examples
• Evaluation– Gold Standards (COMLEX, OALD)– Experimental Architecture– Results
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources127
Grammar and Lexicon Extraction: Penn-II & LFG
sign<subj,obj>
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources128
Grammar and Lexicon Extraction: Penn-II & LFG
• Semantic Forms: PRED<GF1, GF2, …, GFn>
• Governable Grammatical Functions (Arguments)
– SUBJ, OBJ, OBJθ, OBL, OBLθ, COMP, XCOMP, PART…
• Non-Governable Grammatical Functions (Adjuncts)
– ADJ, XADJ, APP, RELMOD, …
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources129
Grammar and Lexicon Extraction: Penn-II & LFG
Penn-II Treebank
Automatic F-Structure Annotation Algorithm
LFG F-Structures
Extraction Algorithm
Semantic Forms
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources130
Grammar and Lexicon Extraction: Penn-II & LFG
Extraction Algorithm:
For each f-structure F
For each level of embedding in F Determine the local predicate PRED Collect all subcategorisable grammatical functions GF1, …,
GFn
Return: PRED<GF1, GF2, …, GFn>
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources131
Grammar and Lexicon Extraction: Penn-II & LFG
subj : spec : det : pred : the pred : inquiry num : sg pers : 3adjunct : 1 : pred : soonpred : focustense : pastobl : pform : on obj : spec : det : pred : the pred : judge num : sg pers : 3
“The inquiry soon focused on the judge” (wsj_0267_72)
Prepositions and OBLs:
focus([subj,obl:on])
on([obj])
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources132
Grammar and Lexicon Extraction: Penn-II & LFG
topic : index : [1] subj : spec : det : pred : the num : sing pred : government pers : 3
……
pred : have tense : pressubj : spec : det : pred : the pers : 3 pred : treasury num : singcomp : index : [1] subj : spec : det : pred : the num : sing pred : government pers : 3
… …
pred : have tense : prespred : saytense : past
LDDs:
say([subj,comp])
“Until Congress acts , the government hasn't any authority to issue new debt obligations of any kind, the Treasury said.” (wsj_0008_2)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources133
Grammar and Lexicon Extraction: Penn-II & LFG
subj : pred : pro pron_form : itpassive : +to_inf : +pred : bexcomp : subj : pred : pro pron_form : it passive : + pred : consider tense : past obl : pform : as obj : spec : det : pred : a ……… ……… pred : risk num : sg pers : 3
Passive:
consider([subj,obl:as],p)
“… to be considered as an additional risk for the investor…”(wsj_0018_14)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources134
Grammar and Lexicon Extraction: Penn-II & LFG
subj : spec : det : pred : the cat : dt
pred : inquiry num : sg pers : 3 cat : nnadjunct : 1 : pred : soon
cat : rbpred : focustense : pastcat : vbdobl : pform : on obj : spec : det : pred : the
cat : dt pred : judge num : sg pers : 3
cat : nn
CFG categories:
focus(v,[subj,obl:on])focus(v,[subj(n),obl:on])
“The inquiry soon focused on the judge.” (wsj_0267_72)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources135
Grammar and Lexicon Extraction: Penn-II & LFG
Semantic Form Conditional Probability
accept([subj,obj]) 0.813 accept([subj],p) 0.060 accept([subj,comp]) 0.033 accept([subj,obl:as],p) 0.020 accept([subj,obj,obl:as]) 0.020 accept([subj,obj,obl:from]) 0.020 accept([subj]) 0.013 Others 0.021
Without Prep/Part With Prep/Part Lemmas 3586 3586 Semantic Forms 10969 14348 Frame Types 38 577
Lexicon extracted from Penn-II (O’Donovan et al 2005):
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources136
Grammar and Lexicon Extraction: Penn-II & LFG
• Evaluation for all active verbs (2992) extracted from Penn-II against COMLEX
• Largest evaluation for English subcat frame extraction system • Carroll and Rooth (1998) – 200 verbs• Schulte im Walde (2000) – over 3000 German verbs
• (VERB :ORTH “reimburse” :SUBC ((NP-NP)
(NP-PP :PVAL (“for”))
(NP)))
• (vp-frame np-np :cs ((np 2)(np 3))
:gs (:subject 1 :obj 2 :obj2 3)
:ex “she asked him his name”)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources137
Grammar and Lexicon Extraction: Penn-II & LFG
• Following Schulte im Walde (2000):
• Experiment 1: Exclude prepositional phrases entirely (e.g. [subj,obl:on] is [subj])
• Experiment 2: Include prepositional phrase but not specific preposition (e.g. [subj,obl]). – 2a (+ Part value)
• Experiment 3: Include details of specific preposition (e.g. [subj,obl:on]) – 3a (+ Part value)
• Relative Thresholds of 1% and 5%
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources138
Grammar and Lexicon Extraction: Penn-II & LFG
Threshold of 1% Threshold of 5% P R F P R F
Exp. 1 79.0% 59.6% 68.0% 83.5% 54.7% 66.1% Exp. 2 77.1% 50.4% 61.0% 81.4% 44.8% 57.8% Exp. 2a 76.4% 44.5% 56.3% 80.9% 39.0% 52.6% Exp. 3 73.7% 22.1% 34.0% 78.0% 18.3% 29.6% Exp. 3a 73.3% 19.9% 31.3% 77.6% 16.2% 26.8%
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources139
Grammar and Lexicon Extraction: Penn-II & LFG
• Directional Prepositions (about, across, along, around, behind, below, beneath, between, beyond, by, down, from…) included in COMLEX by “default” for verbs that have at least one p-dir …
Exp. 3 Exp. 3a Recall 40.8% 35.4% Increase 18.7% 15.5% F-Score 54.4% 49.7% Increase 20.4% 18.4%
(VERB :ORTH "cycle" :SUBC ((PP :PVAL ("p-dir")))
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources140
Grammar and Lexicon Extraction: Penn-II & LFG
• Penn-III = Penn-II + the parsed section of the Brown Corpus
– About 300,000 of a total of 1 Million Words Brown Corpus– Balanced Corpus (8 genres) e.g. Humour, Science Fiction
etc.
• Subcategorisation variation across domains • More data, more verbs
• -CLR tag (closely related)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources141
Grammar and Lexicon Extraction: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources142
Grammar and Lexicon Extraction: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources143
Grammar and Lexicon Extraction: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources144
Grammar and Lexicon Extraction: Penn-II & LFG
• Applications:
• Porting to other languages– German (TIGER) – Spanish (CAST3LB )– Chinese (CTB-I and II)
• LDD resolution in parsing new text (Cahill et al., 2004)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources145
Grammar and Lexicon Extraction: Penn-II & LFG
Parsing-Based Subcat Frame Extraction (O’Donovan 2006):
• Treebank-based vs. parsing-based subcat frame extraction
• We parsed British National Corpus BNC (100 million words) with our automatically induced LFGs
• 19 days on single machine: ~5 million words per day
• Subcat frame extraction for ~10,000 verb lemmas
• Evaluation against COMLEX and OALD
• Evaluation against Korhonen (2002) gold standard
• Our method is statistically significantly better
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources146
Parsing: Penn-II and LFG
• Overview Parsing Architectures:
Pipeline & Integrated
• Long-Distance Dependency Resolution at F-Structure
• Evaluation
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources147
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources148
Parsing: Penn-II and LFG
• PCFG consists of CFG rules with associated probabilities
• A-PCFG treats strings consisting of CFG categories followed by 1 or more functional annotation(s) as monadic categories (e.g. NP[up-obj=down] )
• Probabilistic parsing technology (PCFGs, History-Based and Lexicalised Parsers) produces trees without LDDs
• Exceptions: (Collins 1999): wh-relclauses; (Johnson 2002) post-processing; …
• In our (standard) architecture new text is parsed into proto f-structures.
• LDD resolution at f-structure
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources149
Parsing: Penn-II and LFG
• Penn-II tree with traces and co-indexation for LDDs
“U.N. signs treaty, the paper said”
S
S-1 NP VP
NP VP DT NN VBD S
NNP VBZ NP -NONE-
NN *T*-1U.N. signs
treaty
the papersaid
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources150
Parsing: Penn-II and LFG
• Trace and coindexaction in tree translated into reentrancy at f-structure by annotation algorithm:
“U.N. signs treaty, the headline said”
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources151
Parsing: Penn-II and LFG
• Parse tree from PCFG and History-Based Parsers without traces:
“U.N. signs treaty, the paper said”
S
S NP VP
NP VP DT NN VBD
NNP VBZ NP
NNU.N. signs
treaty
the paper said
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources152
Parsing: Penn-II and LFG
• Basic, but possibly incomplete, predicate-argument structures (proto-f-structures):
“U.N. signs treaty, the headline said”
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources153
Parsing: Penn-II and LFG
• Require:– subcategorisation frames (O’Donovan et al., 2004, 2005;
O’Donovan 2006)– functional uncertainty equations
• Previous Example:– say([subj,comp]) topic = comp*comp (search along a path of 0 or more
comps)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources154
Parsing: Penn-II and LFG
Subcat Frames:
• Automatically acquired from automatically f-structure-
annotated Penn-II Treebank following (O’Donovan et al. 2004,
2005; O’Donovan 2006)
• Distinction between active and passive frames
• Associated with probabilities
• O’Donovan et al. evaluate against COMLEX resource
• Extracted from sections 02-21
• 10960 active lemma-frame types (semantic forms/subcat
frames), 2241 passive types
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources155
Parsing: Penn-II and LFG
Functional Uncertainty equations:
• Automatically acquire finite approximations of FU-equations
• Extract paths between co-indexed material in automatically
generated f-structures from sections 02-21 from Penn-II
• 26 TOPIC, 60 TOPICREL, 13 FOCUS path types
• 99.69% coverage of paths in section 23
• Each path type associated with a probability
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources156
Parsing: Penn-II and LFG
Sample TOPICREL paths with frequencies:
up-subj7894up-obj 1167up-xcomp 956up-xcomp:obj 793
up-xcomp:xcomp 161up-xcomp:xcomp:obj 135up-comp:subj 119up-xcomp:subj
92
Sample TOPIC paths with probabilities:up-topic=up-comp 0.940up-topic=up-xcomp:comp 0.006up-topic=up-comp:comp 0.001
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources157
Parsing: Penn-II and LFG
LDD Resolution Algorithm: recursively traverse an f-structure and
– find TOPIC:T attribute-value pair
– retrieve TOPIC paths
– for each path p of the form GF1:…: GFn:GF, traverse the f-
structure along the TOPIC path GF1:…: GFn to local sub f-
structure g
• at g retrieve local PRED:P
• add GF:T to g iff
– GF is not present at g
– g together with GF is locally complete and coherent with respect to a semantic form s for P
– multiply path and semantic form probabilities involved to rank resolution
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources158
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources159
Subcategorisation Framessay([subj]) 0.06say([comp,subj]) 0.87say([subj,xcomp]) 0.02... ...
Subcategorisation Frames say([subj]) 0.06Subcategorisation Frames say([subj]) 0.06say([comp,subj]) 0.87
topic : pred : sign subj : pred : U.N. obj : pred : treatypred : saysubj : spec : the pred : paper
Parsing: Penn-II and LFG
comp : pred : sign subj : pred : U.N. obj : pred : treaty
FU-path approximationsup-topic=up-comp 0.940up-topic=up-xcomp:comp 0.006up-topic=up-comp:comp 0.001... ...
topic
pred : say
0.9400.87
FU-path approximationsup-topic=up-comp
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources160
Parsing: Penn-II and LFG
• How do treebank-based constraint grammars compare to deep hand-crafted grammars like XLE and RASP?
• XLE (Riezler et al. 2002, Kaplan et al. 2004)– hand-crafted, wide-coverage, deep, state-of-the-art English LFG
and XLE parsing system with log-linear-based probability models for disambiguation
– PARC 700 Dependency Bank gold standard (King et al. 2003), Penn-II Section 23-based
• RASP (Carroll and Briscoe 2002)– hand-crafted, wide-coverage, deep, state-of-the-art English
probabilistic unification grammar and parsing system (RASP Rapid Accurate Statistical Parsing)
– CBS 500 Dependency Bank gold standard (Carroll, Briscoe and Sanfillippo 1999), Susanne-based
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources161
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources162
• Choose best treebank-based LFG system to compare with XLE/RASP:
• C-structure engines (state-of-the-art history based, lexicalised parsers):– (Collins 1999)– (Charniak 2000)– (Bikel 2002)
• (Bikel 2002) retrained to retain Penn-II functional tags (-SBJ, -SBJ, -LOC, -TMP, -CLR, etc.)
• Pipeline architecture: tagged text Bikel retrained + f-structure annotation algorithm + LDD resolution f-structures automatic conversion evaluation against XLE/RASP gold standards PARC-700/CBS-500 dependency banks
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources163
• Systematic differences between our f-structures and PARC 700 and CBS 500 dependency representations
• Automatic conversion of our f-structures to PARC 700 / CBS 500 -like structures (Burke et al. 2004, Burke 2006, Cahill et al. under review)
• Best XLE and RASP resources with better results than those reported in literature to date
• (Crouch et al. 2002) and (Carroll and Briscoe 2002) evaluation software
• (Noreen 1989) Approximate Randomisation Test to test for statistical significance of results
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources164
Parsing: Penn-II and LFG
• Result dependency f-scores:
PARC 700 XLE vs. BKR-LFG:– 80.55% XLE– 83.08% BKR-LFG (+2.53%)
CBS 500 RASP vs. BKR-LFG:– 76.57% RASP– 80.23% BKR-LFG (+3.66%)
• Results statistically significant at 95% level (Noreen 1989) Approximate Randomisation Test
• BKR-LFG = treebank-induced Lexical-Functional Grammar resources with Bickel retrained (BKR) as c-structure engine in pipeline
architecture
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources165
Parsing: Penn-II and LFG
PARC 700 Evaluation:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources166
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources167
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources168
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources169
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources170
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources171
Probability Models: Penn-II & LFG
Probability Models:
• Our approach does not constitute proper probability model (Abney, 1996)
• Why? Probability model leaks:
• Highest ranking parse tree may feature f-structure equations that cannot be resolved into f-structure
• Probability associated with that parse tree is lost
• Doesn’t happen often in practise (coverage >99.5% on unseen data)
• Research on appropriate discriminative, log-linear or maximum entropy models is important (Miyao and Tsujii, 2002) (Riezler et al. 2002)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources172
Generation: Penn-II & LFG
Cahill and van Genabith, 2006
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources173
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources174
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources175
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources176
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources177
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources178
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources179
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources180
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources181
Generation: the Good, the Bad and the Ugly
• Orig: Supporters of the legislation view the bill as an effort to add stability and certainty to the airline-acquisition process , and to preserve the safety and fitness of the industry .
• Gen: Supporters of the legislation view the bill as an effort to add stability and certainty to the airline-acquisition process , and to preserve the safety and fitness of the industry.
• Orig: The upshot of the downshoot is that the A 's go into San Francisco 's Candlestick Park tonight up two games to none in the best-of-seven fest .
• Gen: The upshot of the downshoot is that the A 's tonight go into San Francisco 's Candlestick Park up two games to none in the best-of-seven fest .
• Orig: By this time , it was 4:30 a.m. in New York , and Mr. Smith fielded a call from a New York customer wanting an opinion on the British stock market , which had been having troubles of its own even before Friday 's New York market break .
• Gen: Mr. Smith fielded a call from New a customer York wanting an opinion on the market British stock which had been having troubles of its own even before Friday 's New York market break by this time and in New York , it was 4:30 a.m. .
• Orig: Only half the usual lunchtime crowd gathered at the tony Corney & Barrow wine bar on Old Broad Street nearby .
• Gen: At wine tony Corney & Barrow the bar on Old Broad Street nearby gathered usual , lunchtime only half the crowd , .
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources182
Domain Variation, Multilingual LFG Resources, etc.
• Domain variation: ATIS (Judge et al 2005) and QuestionBank (Judge et al 2006)
• F-Str -> (Q)LF Quasi-Logical Forms (Cahill et al. 2003)
• Multilingual treebank-based LFG acquisition:
– German: TIGER treebank (Cahill et al 2003), (Cahill et al 2005)
– Chinese: Chinese Penn Treebank (Burke et al 2004)
– Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van Genabith 2006)
• GramLab Project at DCU (2005-2008): Chinese, Japanese, Arabic, Spanish, French and German
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources183
Demo System
• http://lfg-demo.computing.dcu.ie/lfgparser.html
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources184
Publications
A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFG-Approximations, COLING/ACL 2006, Sydney, Australia
J. Judge, A. Cahill and J. van Genabith, QuestionBank: Creating a Corpus of Parse-Annotated Questions, COLING/ACL 2006, Sydney, Australia
G. Chrupala and J. van Genabith, Using Machine-Learning to Assign Function Labels to Parser Output for Spanish, COLING/ACL 2006, Sydney, Australia
M. Burke, Automatic Treebank Annotation for the Acquisition of LFG Resources, Ph.D. Thesis, School of Computing, Dublin City University, Dublin 9, Ireland. 2005
R. O’Donovan, Automatic Extraction of Large-Scale Multilingual Lexical Resources, Ph.D. Thesis, School of Computing, Dublin City University, Dublin 9, Ireland. 2005
R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005
A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language and Computation; Special Issue on "Shared Representations in Multilingual Grammar Engineering", (eds.) E. Bender, D. Flickinger, F. Fouvry and M. Siegel, Kluwer Academic Press, 2005
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources185
Publications
R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005
J. Judge, M. Burke, A. Cahill, R. O'Donovan, J. van Genabith, and A. Way. Strong Domain Variation and Treebank-Induced LFG Resources; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway,2005
M. Burke, A. Cahill, J. van Genabith, and A. Way. Evaluating Automatically Acquired F-Structures against PropBank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005
M. Burke, A. Cahill, M. McCarthy, R.O'Donovan, J. van Genabith and A. Way. Evaluating Automatic F-Structure Annotation for the Penn-II Treebank; Journal of Language and Computation; Special Issue on "Treebanks and Linguistic Theories", (eds.) E. Hinrichs and K.Simov, Kluwer Academic Press. 2005. pages 523-547
A. Cahill. Parsing with Automatically Acquired, Wide-Coverage, Robust, Probabilistic LFG Approximations. Ph.D. Thesis. School of Computing, Dublin City University, Dublin 9, Ireland. 2004
M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLIC-18 Conference, Waseda University, Tokyo, Japan, pages 161-172, 2004
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources186
Publications
M. Burke, A. Cahill, R. O'Donovan, J. van Genabith, and A. Way. The Evaluation of an Automatic Annotation Algorithm against the PARC 700 Dependency Bank, In Proceedings of the Ninth International Conference on LFG, Christchurch, New Zealand, pages 101-121, 2004
A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26 2004, pages 320-327, Barcelona, Spain, 2004
R. O'Donovan, M. Burke, A. Cahill, J. van Genabith, and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank, In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26 2004, pages 368-375, Barcelona, Spain, 2004
M. Burke, Cahill A., R. O' Donovan, J. van Genabith and A. Way. Treebank-Based Acquisition of Wide-Coverage, Probabilistic LFG Resources: Project Overview, Results and Evaluation, The First International Joint Conference on Natural Language Processing (IJCNLP-04), Workshop "Beyond shallow analyses - Formalisms and statistical modeling for deep analyses"; March 22-24, 2004 Sanya City, Hainan Island, China, 2004
Cahill A., M. Forst, M. McCarthy, R. O' Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Multilingual Unification-Grammar Development. In the Proceedings of the Workshop on Ideas and Strategies for Multilingual Grammar Development, at the 15th European Summer School in Logic Language and Information, Vienna, Austria, 18th - 29th August 2003
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources187
Publications
Cahill A, M. McCarthy, J. van Genabith and A. Way. Quasi-Logical Forms for the Penn Treebank; In (eds.) Harry Bunt, Ielka van der Sluis and Roser Morante; Proceedings of the Fifth International Workshop on Computational Semantics, IWCS-05, January 15-17, 2003, Tilburg, The Netherlands, ISBN: 90-74029-24-8, pp.55-71, 2003
Cahill A, M. McCarthy, J. van Genabith and A. Way. Evaluating Automatic F-Structure Annotation for the Penn-II Treebank. TLT 2002, Treebanks and Linguistic Theories 2002, 20th and 21st September 2002, Sozopol, Bulgaria, (eds.) E. Hinrichs and K. Simov, Proceedings of the First Workshop on Treebanks and Linguistic Theories (TLT 2002), pp. 42-60, 2002
Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure Annotation, In M. Butt and T. Holloway-King (eds.): Proceedings of the Seventh International Conference on LFG CSLI Publications, Stanford, CA., pp.76--95. 2002
Cahill A, and J. van Genabith. TTS - A Treebank Tool; in LREC 2002, The Third International Conference on Language Resources and Evaluation, Las Palmas de Grand Canaria, Spain, May 27th--June 2nd, 2002, Proceedings of the Conference, Volume V, (eds.) M.G.Rodriguez and C.P. Suarez Arnajo, ISBN 2-9517408-0-8, pp. 1712-1717, 2002
Cahill A, M. McCarthy, J. van Genabith and A. Way. Automatic Annotation of the Penn-Treebank with LFG F-Structure Information; LREC 2002 workshop on Linguistic Knowledge Acquisition and Representation - Bootstrapping Annotated Language Data, LREC 2002, Third International Conference on Language Resources and Evaluation, post-conference workshop, June 1st, 2002, proceedings of the workshop, (eds.) A. Lenci, S. Montemagni and V. Pirelli, ELRA - European Language Resources Association, Paris France, pp. 8-15, 2002
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources188
Penn-II-Based Acquisition of CCG Resources
Combinatory Categorial Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources189
This lecture
• Recap: CCG
• Translating the Penn Treebank to CCG– The translation algorithm– CCGbank: the acquired grammar and lexicon
• Wide-coverage parsing with CCG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources190
• Categories: specify subcat lists of words/constituents.
• Combinatory rules: specify how constituents can combine.
• The lexicon: specifies which categories a word can have.
• Derivations: spell out process of combining constituents.
CCG: the machinery
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources191
CCG categories
• Simple categories: NP, S, PP
• Complex categories: functions which return a result when combined with an argument:
VP or intransitive verb: S\NPTransitive verb: (S\NP)/NPAdverb: (S\NP)\(S\NP)PPs: ((S\NP)\(S\NP))/NP
(NP\NP)/NP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources192
The combinatory rules
• Function application: x.f(x) a f(a) X/Y Y X (>)Y X\Y X (<)
• Function composition: x.f(x) y.g(y) x.f(g(x))X/Y Y/Z X/Z (>B)Y\Z X\Y X/Z (<B)X/Y Y\Z X\Z (>Bx)Y/Z X\Y X/Z (<Bx)
• Type-raising: a f.f(a)X T/(T\X) (>T)
X T\(T/X) (<T)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources193
CCG derivations
• Canonical “normal-form” derivations (mostly function application):
• Alternative derivations:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources194
Type-raising and Composition
• Wh-movement:
• Right-node raising:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources195
CCG: semantics
• Every syntactic category and rule has a semantic counterpart:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources196
From the Penn Treebank to CCG
• The basic translation algorithm• Dealing with null elements• Type-changing rules in the grammar• Preprocessing• CCGbank: The extracted lexicon/grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources197
Input: Penn Treebank tree
• Flat phrase-structure tree• Traces/null elements and indices
represent underlying dependencies• Function tags
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources198
Output: CCG derivation
• Binary derivation treewith explicit “deep”dependency structuresand subcategorization information.
• No null elements
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources199
I. Identify heads, arguments, adjuncts
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources200
II. Binarise the tree
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources201
III. Assign CCG categories
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources202
Morphosyntactic Features
• Features on verbal categories:declarative, infinitival, past participle,present participle, passive
• Sentential features:wh-questions, yes-no questions, embedded questions, embedded declaratives, fragments, etc.
• CCGbank has no case or number distinction!
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources203
III. Assign CCG categories: adjuncts
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources204
III. Assign CCG categories: arguments
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources205
IV. Assign predicate-argument structure
• We approximate predicate-argument structure by word-word dependencies
• These are defined by the argument slots of functor catgeories:
just (S\NP)/(S\NP) opened opened (S[dcl]\NP)/NP doors
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources206
IV. Assign predicate-argument structure
• Non-local dependencies arise through:– Binding and control: “He may want you to listen”– Extraction: “the tapas that he told us she ate”
• Both are mediated by lexical categories:– Control verbs, auxiliaries/modals– Relative pronouns
• We represent this via coindexation: (NP\NPi)/(S[dcl]/NPi)
In CCGbank: added automatically to certain category types
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources207
Lexical categories that mediate dependencies
• Auxiliaries/modals, raising verbs: will, might, seem(S[dcl]\NPi)/(S[b]\NPi)
• Control verbs: persuade you to go((S[dcl]\NP)/(S[to]\NPi))/NPi
• Relative pronouns: which, who, that(NP\NPi)/(S[dcl]/NPi)
• Many more (listed in CCGbank manual)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources208
Summary: The basic algorithm
1. Identify heads, complements and adjuncts.2. Binarize the tree.3. Assign CCG categories.4. Add co-indexation to lexical categories.5. Create predicate-argument structure.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources209
Problems with basic algorithm
• Depends on Treebank markup:– Complement/adjunct distinction– The analyses don’t always correspond to CCG
analysis– Errors in Treebank annotation
• Proliferation of categories:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources210
The need for preprocessing
• Eliminating (some of) the noise:– POS-tagging errors– Bracketing errors (coordination!)
• Changing the Treebank analyses:– Small clauses
• Adding more structure:– Insert a noun level into NPs– Analyze QPs, fragments, parentheticals,
multiword-expressions
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources211
Compacting the grammar: Type-changing rules
• Type-changing rules for adjuncts capture syntactic regularities:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources212
Null elements, traces, and coindexation
• *-null elements: passive, PRO• *T*-traces: wh-movement, tough movement• *RNR*-traces: right-node raising• Other null elements:
– *EXP*: expletive,– *ICH* (“insert constituent here”): extraposition – *U* (units): $ 500 *U* – *PPA* (permanent predictable ambiguity)
• =-coindexation: argument cluster coordination and gapping
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources213
• Used for passive or PRO (arbitrary or controlled):
• Only the passive * matters for translation:(S with null subject = VP = S\NP)
* null elements
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources214
Unbounded long-range dependencies
• … arising through extraction (*T*):– Wh-movement (relative clauses and wh-questions):
the articles that (you believed he saw that…) I filed– Tough-movement:
Peter is easy to please– Parasitic gaps:
the articles that I filed without reading
• … arising through coordination (*RNR* and =):– Right-node raising:
[[Mary ordered] and [John ate]] the tapas. – Argument cluster coordination:
Mary ordered [[tapas for herself] and [wine for John]].– Sentential gapping:
[[Mary ordered tapas] and [John beer]].
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources215
Dealing with extraction
• Penn Treebank: *T* traces indicate extraction
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources216
Dealing with extraction
• Pass the extracted NP up to relative clause.• The relative pronoun subcategorizes for an
‘incomplete’ sentence:(NP\NP)/(S[dcl]\NP) for subject relatives(NP\NP)/(S[dcl]/NP) for object relatives
• The derivation uses type-raising and composition
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources217
Right node raising in the Penn Treebank
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources218
Right node raising in CCGbank
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources219
Argument-cluster coordination
• “Template gapping” annotation: Co-indexation between constituents in conjuncts
• The first conjunct contains the head
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources220
Argument-cluster coordination in CCGbank
• The shared constituents are coordinated (via type-raising and composition):
X T\(T/NP) (<T)NP (S\NP)\((S\NP)/NP) (<T)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources221
Sentential Gapping
• In the Treebank:
• CCG uses decomposition to obtain the types(interpretation is given extragrammatically)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources222
Remaining problems: NP level
• Lists and appositives are indistinguishable:
• Compound nouns have no internal structure:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources223
Remaining problems: other constructions
• Complement-adjunct distinction:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources224
Putting it all together….
Funds that are or soon will be listed in New York or London
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources225
The CCG derivation
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources226
that: (NPi\NPi)/(S[dcl]\NPi) funds are,will
The relative clause:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources227
The right-node-raising VP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources228
CCGbank
• Coverage of the translation algorithm:99.44% of all sentences in the Treebank(main problem: sentential gapping)
• The lexicon (sec.02-21): – 74,669 entries for 44,210 word types– 1286 lexical category types
(439 appear once, 556 appear 5 times or more)
• The grammar (sec. 02-21):– 3262 rule instantiations (1146 appear once)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources229
The most ambiguous words
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources230
Frequency distribution of categories
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources231
Lexical coverage
• How well does our lexicon cover unseen data?“Training” data: sections 02-21
Test data: section 00
• The lexicon contains the correct entries for94.0% of the tokens in section 00.
• 3.8% of the tokens in section 00 do not appearin sections 02-21.
35% of the unknown tokens are N29% of the unknown tokens are N/N
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources232
Statistical Parsing with CCG
• The data: CCGbank• The algorithms: standard CKY chart parsing
(and a supertagger)• The models:
– Generative: Hockenmaier and Steedman (2002)– Conditional: Clark and Curran (2004)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources233
Parsing algorithms for CCG
• CCG derivations are binary trees.• Standard chart parsing algorithms (eg. CKY)
can be used.• Complexity: O(n6)
(or O(n3) if the category set is fixed)• Recovery of “deep” dependencies require
feature structures. • Supertagging: assign most likely categories
to words before parsing. Significantly speeds up parsing!
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources234
Parsing models
• Generative models: P(s,)Model the process which generates the derivation – Advantage: easy to guarantee consistency– Disadvantage: requires good smoothing techniques,
difficult to include complex features
Good baseline
• Conditional models: P( |s)Given a sentence s, predict most likely derivation – Advantage: more natural for parsing– Disadvantage: large model size, difficult to estimate
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources235
Evaluation: recovery of dependency structures
LabelledUnlabelled
Generative: 83.3 90.3(Hockenmaier and Steedman, 2002)
Conditional: 84.6 91.2(Clark and Curran, 2004)
This includes long-range dependencies
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources236
ccg2sem: from CCG to DRT
• A Prolog package which translates CCGbank derivations into Discourse Representation Theory structures (Bos, 2005)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources237
CCGbanks for other languages
• German (Hockenmaier, 2006):– Translation of German TIGER corpus into CCG.– Many crossing dependencies, etc.:
context-free approximations are inappropriate– Current coverage: 92.4% of all graphs
(excluding headlines, fragments etc.)
• Turkish (Cakici, 2005):– Extracts a CCG lexicon from the METU Sabanci
Treebank.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources238
A few referencesGeneral CCG references:M. Steedman (2000). The Syntactic Process, MIT Press.M. Steedman (1996). Surface Structure and Interpretation, MIT Press.CCGbank(s) and wide-coverage CCG parsing:J. Hockenmaier and M. Steedman (2005). CCGbank: User’s Manual, MS-CIS-05-09,
Dept. of Computer and Information Science, University of Pennsylvania.J. Hockenmaier and M. Steedman (2002). Acquiring Compact Lexicalized
Grammars from a Cleaner Treebank, LREC, Las Palmas, Spain.J. Hockenmaier (2003). Data and Models for Statistical Parsing with Combinatory
Categorial Grammar. PhD thesis, Infomatics, University of Edinburgh.J. Hockenmaier and M. Steedman (2002). Generative Models for Statistical Parsing
with Combinatory Categorial Grammar, ACL ‘02, Philadelphia, PA, USA.S. Clark and J. R. Curran (2004). Parsing the WSJ using CCG and Log-Linear
Models ACL '04, Barcelona, Spain.S. Clark and J. R. Curran (2004). The Importance of Supertagging for Wide-
Coverage CCG Parsing. Coling’04, Geneva, Switzerland.J. Bos (2005): Towards Wide-Coverage Semantic Interpretation. IWCS-6.R. Cakici (2005). Automatic Induction of a CCG Grammar for Turkish.
ACL Student Research Workshop, Ann Arbor, Mi, USA.J. Hockenmaier (2006). Creating a CCGbank and a wide-coverage CCG lexicon for
German. ACL/COLING ‘06, Sydney, Australia.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources239
More references
• The CCG website: http://groups.inf.ed.ac.uk/ccgwith lots of general references about CCG(as well as CCGbank, CCG parsing, etc.)
• CCGbank is available from the Linguistic Data Consortium (LDC) at the University of Pennsylvania.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources240
Penn- II-Based Acquisition of HPSG Resources
Head-Driven Phrase Structure Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources241
Penn- II-Based Acquisition of HPSG Resources
• Introduction• Treebank conversion and HPSG annotation• Lexicon extraction• Probabilistic models
– Feature forest model– Design of features
• Parsing• Evaluation• Advanced topics
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources242
Introduction
• If we had an HPSG version of Penn-II, we could obtain lexical entries and probabilistic models
• How do we get HPSG-annotated Penn-II?• Converting Penn-II into an HPSG-conformant
treebank• How do we verify the conformity with the HPSG
theory?• Principles are exploited for the verification
– Implementation of principles is relatively easy, while construction of the lexicon is extremely difficult
– Principles are hand-coded, while lexical entries are acquired from a converted treebank
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources243
Introduction
• We develop a treebank rather than a lexicon• A treebank provides more information than a
lexicon– Verification of the consistency of the grammar– Statistics
Principles
Lexicon
Treebank
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources244
Methodology
TreebankTreebank
PrinciplesPrinciples LexiconLexicon
pretty/JJ
database/NN
Treebankconversion
Treebankconversion
HPSG treebankHPSG treebank
Lexiconextraction
Lexiconextraction
Grammarwriter
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources245
Comparison with conventional grammar development
Lexiconextractor
Lexiconextractor
LexiconLexicon
PrinciplesTreebankPrinciplesTreebank
ParserParserGrammar writer
PrinciplesLexicon
PrinciplesLexicon
TreebankTreebank
CorpusCorpus editedit
verifyverify
Treebank-baseddevelopment
Manualdevelopment
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources246
Treebank conversion and HPSG annotation
• Convert Penn-style parse trees into HPSG-style parse trees– Correcting frequent errors in Penn Treebank
• Ex. Confusion of VBD/VBN
– Converting tree structures• Small clauses, passives, NP structures, auxiliary/control
verbs, LDDs, etc.
– Mapping into HPSG-style representations• Head/argument/modifier distinction, schema name
assignment• Mapping into HPSG categories
– Applying HPSG principles/schemas• Undetermined features are filled• Violations of feature constraints are detected
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources247
HEAD verbSUBJ < >COMPS < >MOD
HEAD verbSUBJ < >COMPS < >
Overview
S
making
the offer
NP
NL
NP
is officially
VP
VP
VP
head
head
head mod head
arg
arg
arg
S
making
the offer
NP
NL
NP
is
officially
VP
VP
ADVP
Error correction &tree conversion
Mapping intoHPSG-stylerepresentation
NL
HEAD verbSUBJ < >COMPS < >
subject-head
HEAD nounSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
the offermaking
HEAD adv
HEAD verbSUBJ < >1
HEAD verb
HEAD verbSUBJ < >1
HEAD verb
is officially
HEAD verb
head-comp
head-mod head-comp
Principleapplication
NL
HEAD verbSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
the offermaking
HEAD verbSUBJ < >COMPS < >
1
HEAD verbSUBJ < >COMPS < >
1
is officially
1
12
HEAD verbSUBJ < >COMPS < >
12
3
3
HEAD verbSUBJ < >COMPS < >
14
4
2
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources248
Tree conversion
• Coordination, quotation, insertion, and apposition
• Small clauses, “than” phrases, quantifier phrases, complementizers, etc.
• Disambiguation of non-/pre-terminal symbols (TO, etc.)
• HEAD features (CASE, INV, VFORM, etc.)• Noun phrase structures• Auxiliary/control verbs• Subject extraction• Long distance dependencies• Relative clauses, reduced relatives
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources249
Pattern-based tree conversion
tree_transform_rule("predicative", $Input, $Output) :- tree_match(TREE_NODE\$Node & TREE_DTRS\[tree_any & ANY_TREES\$LeftTrees, (TREE_NODE\SYM\"S" & TREE_DTRS\($PRDTrees & [tree_any, tree & TREE_NODE\
FUNC\"PRD", tree_any])), tree_any & ANY_TREES\$RightTrees], $Input), append_list([$LeftTrees, $PRDTrees, $RightTrees], $Dtrs), $Output = TREE_NODE\$Node & TREE_DTRS\$Dtrs.S
NP VP
SNP ADJP
himself
He considered
superior
S
NP VP
NP ADJPhimself
He considered
superior
Tree pattern
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources250
Passive
• “be + VBN” constructions are assigned“VFORM passive”
S
been
out
VP
*-2
NP-SBJ-2
have n’t VP
VP
the details
worked/VBN NP PRT
VFORM passive
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources251
Noun phrase structures
• Determiners are raised• Possessive structures are explicitly
represented
NP
of
plant
NPMonsanto
NP
’s
director PP
sciences
NP
of
plant
NPMonsanto
DP
’s director PP
sciences
N’
NP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources252
Auxiliary/control verbs
• Auxiliary/control verbs are annotated as taking unsaturated constituents
S
VP
have
to
choose
this particular moment
S
NP VP
VP
NP
they
NP-1
did n’t
*-1
VP
VP
SUBJ < >1
1 SUBJ < >2
SUBJ < >2
SUBJ < >3
3=
S
VP
have
to
choose
this particular moment
VP
VP
NP
they
NP-1
did n’t
VP
VP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources253
Subject extraction
• HPSG does not allow subject extraction• Relativizers are treated as ordinary subjects
in relative clauses
NP
WHNP-1
SBAR
SThe company
NP
which NP VP
VPhas
reported NP
*T*-1
net losses
NP
WHNP-1
SBAR
The company
NP
which
VP
VPhas
reported NP
net losses
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources254
Subject relative
• Relativizers have a non-empty list in REL• The element of REL is consumed in a head-
relative construction and represents the relative-antecedent relation
NP
WHNP-1
SBAR
The company
NP
which
VP
VPhas
reported NP
net losses
REL < >2
REL < >22
REL < >
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources255
LDDs: Object relative
• SLASH represents moved arguments• REL represents relative-antecedent relations
REL < >
SLASH < >1
2REL < >SLASH < >
2
REL < >SLASH < >
NP
WHNP-3
SBAR
Sthe energy and ambitions
NP
that NP-2
reformers
VP
Swanted
reward
VP
*T*-3
1
NP
to VP
NP
*-2
SLASH < >1
SLASH < >1
SLASH < >1
SLASH < >1
2
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources256
Mapping into HPSG-style representations
• Convert nonterminal symbols into HPSG-style categories
• Assign schema names to internal nodes
NNHEAD: nounAGR: 3sg
HEAD: verbVFORM: finiteTENSE: past
VBD
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources257
Category mapping & schema name assignment
• Example: “NL is officially making the offer”
S
making
the offer
NP
NL
NP
is officially
VP
VP
VP
head
head
head mod head
arg
arg
argNL
HEAD verbSUBJ < >COMPS < >
subject-head
HEAD nounSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
the offermaking
HEAD adv
HEAD verbSUBJ < >1
HEAD verb
HEAD verbSUBJ < >1
HEAD verb
is officially
HEAD verb
head-comp
head-mod head-comp
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources258
Principle application
inverse_schema_binary(subj_head_schema, $Mother, $Left, $Right) :- $Left = (SYNSEM\($LeftSynsem & LOCAL\CAT\(HEAD\MOD\[] & VAL\(SUBJ\[] & COMPS\[] & SPR\[])))), $Right = (SYNSEM\LOCAL\CAT\(HEAD\$Head & VAL\(SUBJ\[$LeftSynsem] & COMPS\[] &
SPR\[]))), $Mother = (SYNSEM\LOCAL\CAT\(HEAD\$Head & VAL\(SUBJ\[] & COMPS\[] & SPR\
[]))).
HEAD: noun HEAD: verbHe considered ...
HEAD: verbSUBJ: <HEAD: noun>
HEAD: verbSUBJ: <>
considered ...
HEAD: nounSUBJ: <>
HEAD: verb
He
structure-sharing
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources259
Principle application
NL
HEAD verbSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
the offermaking
HEAD advMOD
officially
1HEAD verbSUBJ < >COMPS < >
1
HEAD verbSUBJ < >COMPS < >
12
HEAD verbSUBJ < >COMPS < >
12
3 3
is
HEAD verbSUBJ < >COMPS < >
1
HEAD verbSUBJ < >COMPS < >
14
4
2
NL
HEAD verbSUBJ < >COMPS < >
subject-head
HEAD nounSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
the offermaking
HEAD adv
HEAD verbSUBJ < >1
HEAD verb
HEAD verbSUBJ < >1
HEAD verb
is officially
HEAD verb
head-comp
head-mod head-comp
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources260
Complicated example
NP
we were
VP
the prices
NP
S
SBAR
WHNP-1head
head
head
head
arg
arg
arg0
charged
NP
VP
*-2 *T*-1
arg
argarghead
prices
HEAD nounSUBJ < >COMPS < >SPR < >
HEAD nounSUBJ < >COMPS < >SPR < >
HEAD verbSUBJ < >COMPS < >SLASH < >
chargedwere
we
2HEAD verbSUBJ < >COMPS < >REL < >
HEAD nounSUBJ < >COMPS < >
HEAD verbSUBJ < >COMPS < >SLASH < >
3
HEAD verbSUBJ < >COMPS < >
34
4
3
2
HEAD verbSUBJ < >COMPS < >SLASH < >2
3
2
2
1
1
HEAD detSUBJ < >COMPS < >
the
1
HEAD nounSUBJ < >COMPS < >SPR < >
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources261
Lexicon extraction
• Collecting leaf nodes of HPSG parse trees• Generalizing leaf nodes into lexical entry
templates• Applying inverse lexical rules• Assigning predicate argument structures
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources262
Overview
Collection of leaf nodes &generalization
Application ofinverse lexicalrules
Assignment ofpredicateargumentstructures
HEAD verbSUBJ < >COMPS < >MOD
HEAD verbSUBJ < >COMPS < >
NL
HEAD verbSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
the offermaking
HEAD verbSUBJ < >COMPS < >
1
HEAD verbSUBJ < >COMPS < >
1
is officially
1
12
HEAD verbSUBJ < >COMPS < >
12
3
3
HEAD verbSUBJ < >COMPS < >
14
4
2
HEAD verbSUBJ < HEAD noun >COMPS < HEAD noun >
making:
HEAD verbSUBJ < HEAD noun >COMPS < HEAD noun >
make:make:
HEAD verb
HEAD nounCONT 2COMPS < >
HEAD nounCONT 1
SUBJ < >
CONTmake’ARG1ARG2 2
1
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources263
Collecting leaf nodes
• Leaf nodes of HPSG parse trees are instances of lexical entries
NL
HEAD verbSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
the offermaking
HEAD advMOD
officially
1HEAD verbSUBJ < >COMPS < >
1
HEAD verbSUBJ < >COMPS < >
12
HEAD verbSUBJ < >COMPS < >
12
3 3
is
HEAD verbSUBJ < >COMPS < >
1
HEAD verbSUBJ < >COMPS < >
14
4
2
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources264
Generalization into lexical entry templates
• Unnecessary constraints are removed (restriction)
HEAD: verb
SUBJ: <HEAD: >nounPOSTHEAD: minus
HEAD: verbSUBJ: <HEAD: noun>
A leaf node ofthe HPSG treebank
Lexical entry template
lexical_entry_template($WordInfo, $Sign, $Template) :- copy($Sign, $Template), $Template = (SYNSEM\LOCAL\(CAT\HEAD\$Head & VAL\(SUBJ\$Subj & COMPS\$Comps & SPR\$SPR))), ... restriction($SubjSynsem, [NONLOCAL\]), restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, POSTHEAD\]), restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, AUX\]), restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, TENSE\]), ...
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources265
Application of inverse lexical rules
• Converting lexical entries of inflected words into lexical entries of lexemes using inverse lexical rules
• Derivational rules: Ex. passive rule
• Inflectional rules: Ex. past-tense rule
HEAD: verbSUBJ: <HEAD: noun>COMPS: <HEAD: prep_by>
HEAD: verbSUBJ: <HEAD: noun>COMPS: <HEAD: noun>
HEAD:verbVFORM: finiteTENSE: past
HEAD:verbVFORM: base
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources266
Predicate argument structures
• Create mappings from syntactic arguments into semantic arguments
COMPS < >
SUBJ < >
HEAD verb
make’ARG1ARG2
CAT|HEAD nounCONT 1
CONT 12
VALCAT|HEAD nounCONT 2
CAT
Ex. lexical entry for “make”
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources267
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources268
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources269
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources270
Probabilistic models
• Feature forest model– A solution to the problem of the probabilistic
modeling of feature structures
• Design of features– How to represent preferences of HPSG parse trees
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources271
Example: PCFG
S
NP VP
She dances
0.30.3 0.2 0.2
S
NP VP
I dance
S
NP VP
She danced
S
NP VP
I danced
0.150.15 0.2 0.2Estimated prob.
S → NP VPNP → SheNP → I
VP → dancesVP → danceVP → danced
CFG rule probabilities1.00.50.5
0.30.30.4
Observed freq.
Training data
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources272
What is the problem?
• PCFG assigns probabilities to ungrammatical structures– “She dance” (0.15), “I dances” (0.15)
S
NP VP
She dances
0.30.3 0.2 0.2
S
NP VP
I dance
S
NP VP
She danced
S
NP VP
I danced
0.150.15 0.2 0.2Estimated prob.
Observed freq.
Training data
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources273
Feature structure constraints
• In HPSG, feature structures explain grammatical constraints
• “She dance” “I dances” are never generated• However, constraints of feature structures
violate “independence assumption” of probabilistic models (Abney 1997)
S → NPAGR 1 VPAGR 1
NPAGR:3sg → SheNPAGR:no3sg → I
VPAGR:3sg → dancesVPAGR:no3sg → danceVP → danced
How can we estimate probabilities in this situation?
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources274
Solution: ME model
• Probabilities of parse trees are estimated by maximum entropy models (Berger et al. 1996)
• Probability p(T) of parse tree T
• Optimal parameters are computed so as to maximize the likelihood of training data
iii Tf
ZTp )(exp
1)(
feature functionfeature function
parameter(feature weight)
parameter(feature weight)normalization factornormalization factor
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources275
ME model of parse trees
• If feature functions correspond to CFG rules, this model is an extension of PCFG model
• Probabilities of parse tress are estimated without independence assumption
1f
)(3
)(2
)(1
332211
3211
)()()(exp1
)(
TfTfTf
Z
TfTfTfZ
Tp
S
NP
She2f
VP
dances3f
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources276
Estimation by a ME model
S
NP VP
She dances0.30.3 0.2 0.2
S
NP VP
I dance
S
NP VP
She danced
S
NP VP
I danced
0.30.3 0.2 0.2Estimated prob.
Observed freq.
Training data
S → NPAGR 1 VPAGR 1 1.0NPAGR:3sg → She 1.0NPAGR:no3sg → I 1.0
VPAGR:3sg → dances 1.145VPAGR:no3sg → dance 1.145VP → danced 0.763
ME parameters
1.1451.145 0.763 0.763i
ii f )exp(
)exp( i
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources277
Combinatorial explosion of parse trees
• Exponentially many parse trees are assigned to sentences (i.e., a set of T is exponential)
S
NP1 VP1
By expanding...S
NP1 VP1
S
NP2 VP1
S
NP1 VP2
S
NP2 VP2
Size: nm
VP2NP2
n m
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources278
Problems by combinatorial explosion
• Parameter estimation is intractable– Computation of
– Computation of
• Searching for the most probable parse is intractable– Computation of
T
ii TpTffE )()()(
T iii TfZ )(exp
)(maxargˆ TpTT
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources279
Solutions in HMM and PCFG
• Probabilistic models are divided into independent probabilities, and dynamic programming is applied– Forward-backward probability– Baum-Welch algorithm– Inside-outside probability– Viterbi search
• Inside/outside probabilities can be computed at a cost proportional to the number of nodes, assuming a forest structure of parse trees
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources280
Feature forest model
• Dynamic programming can also be applied to maximum entropy estimation
• Feature forest:– Forest structure
isomorphic to CFG parse forest
– Assign feature functions to nodesrather than symbols
• A ME model isestimated without unpacking feature forests
f(S)
f(NP1) f(VP1)
Size: n+m
feature forest
f(NP2) f(VP2)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources281
Feature forest representation of a parse tree
• A feature forest represents exponentially many trees of features
f(S)
f(NP1) f(VP1)
Size: n+m
feature forest representation
S
NP1 VP1 VP2NP2
n mf(NP2) f(VP2)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources282
feature forest representation
Outside TO(NP1)Outside TO(NP1)
Inside TI(NP1)Inside TI(NP1)
• Focus on a set of trees below/above the targeted node
• Inside trees TI(n):Trees below n
• Outside trees TO(n):Trees above n
Inside/outside trees of a feature forest
f(S)
f(NP1) f(VP1)f(NP2) f(VP2)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources283
Estimation algorithms for ME models
• Estimation of parameters requires computation of model expectations (Malouf 2002)
xi
Dkxki
iii
xpxfxfD
fEfEG
)()()(||
1
)()(~
)(λ
Objective function
Dkx ikii Zxf
DG log)(
||
1)( λ
Gradient
Computed from training data
Recomputed at each iteration
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources284
Inside/outside products
• Unnormalized product
• Inside product
• Outside product
)(
))(exp1NP
11NP (NPOT i
Oii Tf
)(
))(exp1NP
11NP (NPIT i
Iii Tf
feature forest representation
f(S)
f(NP1) f(VP1)f(NP2) f(VP2)
iii xfxq )(exp)(
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources285
1N'1N
• The inside product of NP1 is a product of inside products of its daughters
Computation of inside products
iii
nn
nn
iii
f
f
)(
)(
)
(
},,{''
},,{
1
N'N'NN
1
N'NN'N
N'NN'NNP
NP
NP
2121
2212
21111
feature forest representation
f(NP1) f(NP2)
f(N1) f(N2) f(N’1) f(N’2)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources286
1VP
• The outside products of NP1 is a product of the mother’s outside products and sister’s inside products
Computation of outside products
iii
nn
iii
f
f
)(
)(
},,{
S
S
2VP1VPS
2VPS1VPS1NP
feature forest representation
f(S)
f(NP1) f(VP1)f(NP2) f(VP2)
S
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources287
Computation of model expectations
• Sum of unnormalized products of trees including NP1
• Expectation of fi at NP1
11
1
NPNPNP
includesTT
Tq:
)(
1NP1NP11NP NP Z
ffE ii
1)()(
feature forest representation
f(S)
f(NP1) f(VP1)f(NP2) f(VP2)
1NP
1NP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources288
Viterbi search
• Almost the same as the computation of inside products– “max” rather than
“sum”
)( 1N'q)( 1Nq
feature forest representation
f(NP1) f(NP2)
f(N1) f(N2) f(N’1) f(N’2)
iii
n
n
f
nq
nqq
)(
)'(max
)(max)(max
},,{'
},,{
1
2N'1N'
2N1N1
NP
NP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources289
Design of features
• Feature engineering is important for higher accuracy
• Feature functions are designed for capturing syntactic/semantic preferences of HPSG parse trees
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources290
A chart for HPSG parsing
he saw a girl with a telescope
HEAD nounSUBCAT <>HEAD prep
MOD NPSUBCAT <NP>
HEAD prepMOD NPSUBCAT <>
HEAD nounSUBCAT <>
HEAD verbSUBCAT <NP,NP>
HEAD nounSUBCAT <>
HEAD nounSUBCAT <>
HEAD prepMOD VPSUBCAT <NP>
HEAD prepMOD VPSUBCAT <>HEAD verb
SUBCAT <NP>
HEAD verbSUBCAT <>
HEAD verbSUBCAT <NP>
Equivalent signs are packed
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources291
Feature forest representation of a chart
• Node= each rule application
HEAD prepMOD NPSUBCAT <>
HEAD nounSUBCAT <>
HEAD verbSUBCAT <NP,NP>
HEAD nounSUBCAT <>
HEAD prepMOD VPSUBCAT <NP>
HEAD prepMOD VPSUBCAT <>
HEAD verbSUBCAT <>
HEAD verbSUBCAT <NP>
HEAD nounSUBCAT <>
HEAD verbSUBCAT <NP>
HEAD verbSUBCAT <NP>
HEAD prepMOD VPSUBCAT <>
HEAD nounSUBCAT <>
HEAD verbSUBCAT <NP>
HEAD nounSUBCAT <>
HEAD nounSUBCAT <>
he
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources292
Feature forest representation of predicate argument structures
• Node = already-determined predicate argument relations
fact
ARG1wantARG1
4ARG2 dispute1I
fact
wantARG1
ARG2dispute2
I
ARG1
4
ARG2 3
3
wantARG1
ARG2dispute1
I
ARG1
1
1 2
wantARG1
ARG2dispute2
I
ARG1
2
ARG2
She ignored the fact that I wanted to dispute
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources293
Extraction of probabilistic events
extract_binary_event("hpsg-forest", "bin", $RuleName, $LDtr, $RDtr, _, _,
$Event) :- $Event = [$RuleName, $Dist, $Depth|$HDtrFeatures]) :- find_head($Rule, $LSign, $RSign, $Head, $NonHead), rule_name_mapping($Rule, $Head, $NonHead, $RuleName), encode_distance($LSign, $RSign, $Dist), encode_depth($LSign, $RSign, $Depth), encode_sign($Head, $HDtrFeatures, $NDtrFeatures), encode_sign($NonHead, $NDtrFeatures, []).
<subj-head, 2, 1, VP, ran, VBD, V_intrans-past, 2, NP, boys, NNS, N_plural, 2>
S
NP VP
ADVPnever
Cool ranboys
NTS POS
word
lexical entry
depthdistance
schemaspan
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources294
Atomic features
• RULE: name of applied rule• DIST: distance between head words• COMMA: whether the phrase includes commas• SPAN: number of words the phrase dominates• SYM: nonterminal symbol (e.g. S, VP, …)• WORD: head word• POS: part-of-speech• LE: lexical entry• ARG: argument label (ARG1, ARG2, ...)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources295
Example: syntactic features
• Feature for the Head-Modifier construction for “saw a girl” and “with a telescope”
prep-mod-vpwith,IN,PP,3,
,transitiveVBD,saw,VP,3,
,0,modifier,3-head
LE,POS,WORD,SYM,SPAN
LE,POS,WORD,SYM,SPAN
COMMA,DIST,RULE,
rrrrr
lllllf
he saw a girl with a telescope
HEAD nounSUBCAT <>
HEAD verbSUBCAT <NP,NP>
HEAD nounSUBCAT <>
HEAD nounSUBCAT <>
HEAD prepMOD VPSUBCAT <NP>
HEAD prepMOD VPSUBCAT <>
HEAD verbSUBCAT <NP>
HEAD verbSUBCAT <>
HEAD verbSUBCAT <NP>
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources296
Example: semantic features
• Feature for the predicate argument relation between “he” and “saw”
pronounPRPhe
transitiveVBDsaw
1ARG1
LEPOSWORD
LEPOSWORD
DISTARG
,,
,,,
,,
,,
,,,
,,
nnn
hhhpaf
girl
saw
heARG1
ARG2
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources297
Feature generation
• Features are generated by abstracting descriptions of probabilistic events
feature_mask("hpsg-forest", "bin", [1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0]).feature_mask("hpsg-forest", "bin", [1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0]).feature_mask("hpsg-forest", "bin", [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]).
<subj-head, 2, _, _, ran, VBD, V_intrans-past, _, _, boys, NNS, N_plural, _>
<subj-head, 2, 1, VP, ran, VBD, V_intrans-past, 2, NP, boys, NNS, N_plural, 2>
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources298
Parsing
• Efficient processing of feature structures (details omitted)– Abstract machines, quick check, CFG filtering, etc.
• Efficient search with probabilistic HPSG– Beam thresholding– Iterative beam thresholding
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources299
Beam thresholding
• Thresholding out edges in each cell of the chart– Thresholding by number: for each cell, keep only
the best n edges– Thresholding by width: keep only the edges
whose FOM is greater than w, where w is the difference from the best FOM in the same cell
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources300
Effect of beam thresholding
• Precision and recall by changing parameters of beam search
• Recall drops, while precision retains
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources301
Iterative beam thresholding
• Start with a narrow beam width• Continue widening a beam width until
parsing succeeds
Iterative_parse(sentence) { w := beam_width_start; while(w < beam_width_end) { parse(sentence, w); if(parse succeeds) return; w := w + beam_width_step;}
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources302
Efficacy of iterative beam thresholding
• Evaluated on Penn Treebank Section 24 (< 15 words)
Precision Recall F-score Avg. time (ms)
Viterbi 88.2% 87.9% 88.1% 103923
Beam 89.0% 82.4% 85.5% 88
Iterative 87.6% 87.2% 87.4% 99
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources303
Distribution of parsing time
• Black: Viterbi, Red: iterative beam thresholding
1
10
100
1000
10000
100000
1000000
10000000
100000000
0 5 10 15
Sentence length (words)
Par
sing
tim
e (m
s)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources304
Evaluation
• Evaluation of the lexical entries extracted from Penn Treebank– Investigation of obtained lexical entries– Coverage
• Evaluation of the disambiguation model– Parsing accuracy
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources305
Experimental settings
• Training data: Sections 2-21 of Penn Treebank II (39,832 sentences)
• Test data:– Development set: Section 22 (1,700 sentences)– Final test set: Section 23 (2,416 sentences)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources306
Number of tree conversion rules
Target of conversion Number
Penn-II errors 102
Category mapping 85
Head annotation and binarization 63
Difference of phrase structures 15
Predicate argument structures 13
Long distance dependencies 13
Others 52
Total 343
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources307
Result of treebank conversion & lexicon extraction
• Treebank conversion and HPSG annotation succeeded for 37,886 sentences
• Extracted lexicon:
# words 34,765
# lexical entries 1,942
Average # lexical entries/word 1.43
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources308
Sources of treebank conversion failures
• Classification of failures of treebank conversion in Section 02 (67 failures/1989 sentences)
Shortcomings of tree conversion rules 18
Errors in Penn Treebank 16
Constructions currently unsupported 20
Constructions unsupported by HPSG 13
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources309
Breakdown of extracted lexical entries
# words# lexical entries
Avg. # lex. entries
noun 21,925 186 1.14
verb 4,094 945 1.94
adjective 8,078 62 1.28
adverb 1,295 72 2.75
preposition 159 193 9.17
particle 58 10 1.69
determiner 36 33 3.86
conjunction 94 321 9.46
punctuation 15 120 22.00
Total 34,765 1,942 1.43
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources310
Example lexical entries
HEADnounMOD <>
VAL
SPR < HEAD det >SUBJ <>COMPS <>
Common nounEx. review/NNappeared 140,805 times
HEADverbMOD <>VFORM base
VALSPR <>SUBJ <HEAD noun>COMPS <HEAD noun>
Transitive verbappeared 12,244 times
HEAD
adjMOD <HEAD noun>POSTHEAD -
VALSPR <>SUBJ <>COMPS <>
Pre-head adjectiveappeared 55,049 times
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources311
Evaluation of coverage
• The ratio of lexical entries in the test data covered by the grammar is measured
• A sentence is covered when all of the lexical entries in the sentence are covered (strong coverage)
Lexical entry
Sentence
w/o unknown word handling
96.52% 54.7%
w/ unknown word handling 99.15% 84.8%
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources312
Treebank size vs. coverage
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources313
Sentence length vs. coverage
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources314
Error analysis
• Classification of randomly selected uncovered lexical entries
Errors of Penn Treebank 10
Errors of treebank conversion 48
Lack of lexical entries 23
Constructions currently unsupported 9
Idioms 6
Non-linguistic expressions (ex. list) 4
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources315
Examples of uncovered lexical entries
• Lack of mappings from words into lexical entries because of data sparseness– Post-noun adjectives (younger, crucial)– Coordination conjunctions of NP and S’– Verbs taking present-participle as a complement
• Unsupported constructions– Free relatives, extrapositions
• Incorrect lexical entries obtained because of idiomatic expressions– (ADVP in part) because …
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources316
Evaluation of parsing accuracy
• Empirical evaluation of the probabilistic models– Overall accuracy– Treebank size vs. accuracy – Sentence length vs. accuracy– Contribution of features– Coverage and accuracy– Error analysis
• Measure: precision/recall of<predicate word, argument position, argument word, predicate type>
– e.g.) <saw, ARG1, he, transitive> girlsaw
heARG1
ARG2
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources317
Effect of feature forest models
• Accuracy for Section 23 (< 40 words)
Precision Recall
baseline 78.10 77.39
with syntactic features 86.92 86.28
with semantic features 84.29 83.74
with all features 86.54 86.02
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources318
Treebank size vs. accuracy
0
20
40
60
80
100
0 10000 20000 30000 40000
# sentences
Pre
cisi
on
/reca
ll (
%)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources319
Sentence length vs. accuracy
0
20
40
60
80
100
0 20 40 60
Sentence length
Coverage (%)
Sentencecoverage
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources320
Contribution of features (1/2)
precision recall # features
All 87.12 85.45 623,173
- RULE 86.98 85.37 620,511
- DIST 86.74 85.09 603,748
- COMMA 86.55 84.77 608,117
- SPAN 86.53 84.98 583,638
- SYM 86.90 85.47 614,975
- WORD 86.67 84.98 116,044
- POS 86.36 84.71 430,876
- LE 87.03 85.37 412,290
None 78.22 76.46 24,847
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources321
Contribution of features (2/2)
precision recall # features
All 87.12 85.45 623,173
- DIST,SPAN 85.54 84.02 294,971
- DIST,SPAN,COMMA 83.94 82.44 286,489
- RULE,DIST, SPAN,COMMA
83.61 81.98 283,897
- WORD,LE 86.48 84.91 50,258
- WORD,POS 85.56 83.94 64,915
- WORD,POS,LE 84.89 83.43 33,740
- SYM,WORD, POS,LE
82.81 81.48 26,761
None 78.22 76.46 24,847
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources322
Coverage and accuracy
• Accuracies for strongly covered/uncovered sentences
• We can expect accuracy improvements by improving grammar coverage
Precision
Recall# sentences
Covered sentences
89.36 88.96 1,825
Uncovered sentences
75.57 74.04 319
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources323
Error analysis
• Classification of errors in randomly selected sentences (100 sentences)
PP-attachment ambiguity 76
Distinction of arguments/modifiers 49
Ambiguity of lexical entries 44
Errors in test data 22
Ambiguity of commas 32
Others 75
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources324
Examples of errors (1/2)
• Antecedent of a relative clause– It's made only in years when the grapes ripen perfectly (the
last was 1979) and comes from a single acre of [NP grapes [S' that yielded a mere 75 cases in 1987]].
• Argument/modifier distinction of to-phrases– More than a few CEOs say the red-carpet treatment tempts
them [VP-modifier to return to a heartland city for future meetings].
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources325
Examples of errors (2/2)
• Preposition or verb phrase?– Mitsui Mining & Smelting Co. posted a 62 % rise in pretax
profit to 5.276 billion yen ($ 36.9 million) in its fiscal first half ended Sept. 30 [VP compared with 3.253 billion yen a year earlier].
• Selection of subcategorization frames– [NP-subject ``Nasty innuendoes,''] [VP says [NP-object
John Siegal, Mr. Dinkins's issues director, ``designed to prosecute a case of political corruption that simply doesn't exist.'']]
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources326
Advanced topics
• Domain adaptation– Adapting the grammar and/or the disambiguation
model to a new domain using a small amount of training data
• Generation– Using the grammar for sentence generation
• Semantics construction– Obtaining representations of formal semantics
from HPSG parsing
• Applications
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources327
Domain adaptation (1/2)
• Disambiguation models are adapted to a bio domain using small training data– An original probabilistic model is incorporated into
a new model as a reference distribution– Parameters of the new model are estimated so as
to maximize the likelihood of the new training data
iiiorignew xgxp
Zxp exp)(
1
Reference distribution
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources328
Domain adaptation (2/2)
• Evaluation with a bio-domain corpus• Training data:
– Penn Treebank (News): 39,832 sentences– GENIA Treebank (Bio): 3,524 sentences
Precision Recall
News domain 87.69% 87.16%
Bio domain(w/o
adaptation)85.50% 83.91%
Bio domain 87.19% 85.58%
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources329
Generation (1/2)
• The methods for HPSG parsing are applied to a chart generator of HPSG– Feature forest model– Iterative beam thresholding
he(x) buy(e) the(y) book(z) past(e)
{3}{2}{1}{0}
{0,3}{0,2} {2,3}{1,3}{1,2}
{1,2,3}{0,2,3}{0,1,3}{0,1,2}
{0,1,2,3}
0 1 2 3chart generation
He bought the book.
3210
0-3
1-30-2
2-31-20-1
chart parsing
0 1 2 3
{0,1}
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources330
Generation (2/2)
• Evaluation on Penn Treebank Section 23
Beam width
Coverage (%)
Avg. generation
time (msec.)BLEU
Beam thresholding
4 44.76 621 0.8196
8 67.70 1776 0.8294
12 73.12 3074 0.8327
16 72.90 4287 0.8341
20 71.81 5273 0.8333
Iterative beam thresholding
8-20 82.47 1668 0.7982
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources331
• Mapping from HPSG parse trees into semantic representations of typed dynamic logic (TDL)– Typed dynamic logic: a variant of dynamic
semantics that includes plural semantics, event semantics, and situation semantics (Bekki, 2005)
– Completely compositional semantics: lambda calculus composes semantic representations of phrases from lexical representations
Semantics construction (1/2)
Few boys fell. They died.
few(x)[boy’x][fall’x] ref(x)[die’x]Λ
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources332
• Approach:– Mapping HPSG lexical entries into lexical
representations of TDL– Semantic representations of phrases are
composed along HPSG parse trees
• Coverage: around 90% of Penn Treebank Section 23 are assigned well-formed semantic representations
Semantics construction (2/2)
yethemexeagentelove
yobj
x
sbjsbjobj
,',''
.
.
..
PHON “loves”HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources333
Applications: information extraction
• Extraction of protein-protein interactions from biomedical paper abstracts– Patterns on predicate argument structures are
learned from small annotated data– Precision/recall: 71.8%/48.4%
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1Recall
Pre
cisi
on
(Yakushiji 2005)(Ramani et al., 2005)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources334
Applications: text retrieval
• Retrieval of relational concepts– All sentences in MEDLINE are parsed into
predicate argument structures– Relational concepts, such as “what causes
cancer”, are retrieved by matching with predicate argument structures
– Precision/recall: 60-96%/30-50%
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources335
Summary
• Conversion of Penn Treebank II into an HPSG treebank– Pattern-based tree conversion and principle application
• Extraction of lexical entries from the HPSG treebank– Generalization, application of inverse lexical rules, and
assignment of predicate argument structures
• Probabilistic modeling of feature structures– Feature forest model
• Techniques for efficient parsing with probabilistic HPSG– Iterative beam thresholding
• Evaluation– Coverage and parsing accuracy
• Advanced topics– Domain adaptation, sentence generation, semantics
construction, and practical applications
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources336
Publications
• Corpus-oriented development of HPSG– Y. Miyao, T. Ninomiya, and J. Tsujii. (2003). Lexicalized Grammar
Acquisition. In Proc. 10th EACL Companion Volume.– Y. Miyao, T. Ninomiya, and J. Tsujii. (2004) Corpus-oriented
grammar development for acquiring a Head-Driven Phrase Structure Grammar from the Penn Treebank. In Proc. IJCNLP 2004.
– H. Nakanishi, Y. Miyao, and J. Tsujii. (2004). Using Inverse Lexical Rules to Acquire a Wide-coverage Lexicalized Grammar. In the IJCNLP 2004 Workshop on “Beyond Shallow Analyses.”
– H. Nakanishi, Y. Miyao and J. Tsujii. (2004). An Empirical Investigation of the Effect of Lexical Rules on Parsing with a Treebank Grammar. In Proc. TLT 2004.
– K. Yoshida. (2005). Corpus-Oriented Development of Japanese HPSG Parsers. In 43rd ACL Student Research Workshop.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources337
Publications
• Feature forest model– Y. Miyao and J. Tsujii. (2002) Maximum entropy estimation
for feature forests. In Proc. HLT 2002.
• Probabilistic models for HPSG– Y. Miyao and J. Tsujii. (2003). A model of syntactic
disambiguation based on lexicalized grammars. In Proc. 7th CoNLL.
– Y. Miyao, T. Ninomiya and J. Tsujii. (2003). Probabilistic modeling of argument structures including non-local dependencies. In Proc. RANLP 2003.
– Y. Miyao, and J. Tsujii. (2005). Probabilistic disambiguation models for wide-coverage HPSG parsing. In Proc. ACL 2005.
– T. Ninomiya, T. Matsuzaki, Y. Tsuruoka, Y. Miyao, and J. Tsujii. (2006). Extremely Lexicalized Models for Accurate and Fast HPSG Parsing. In Proc. EMNLP 2006.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources338
Publications
• Parsing strategies for probabilistic HPSG– Y. Tsuruoka, Y. Miyao and J. Tsujii. (2004). Towards efficient
probabilistic HPSG parsing: integrating semantic and syntactic preference to guide the parsing. In the IJCNLP-04 Workshop on “Beyond shallow analyses.”
– T. Ninomiya, Y. Tsuruoka, Y. Miyao, and J. Tsujii. (2005). Efficacy of Beam Thresholding, Unification Filtering and Hybrid Parsing in Probabilistic HPSG Parsing. In Proc. IWPT 2005.
– T. Ninomiya, Y. Tsuruoka, Y. Miyao, K. Taura, and J. Tsujii. (2006). Fast and Scalable HPSG Parsing. Traitement automatique des langues (TAL). 46(2).
• Domain adaptation– T. Hara, Y. Miyao, and J. Tsujii. (2005). Adapting a
probabilistic disambiguation model of an HPSG parser to a new domain. In Proc. IJCNLP 2005.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources339
Publications
• Generation– H. Nakanishi, Y. Miyao, and J. Tsujii. (2005). Probabilistic models
for disambiguation of an HPSG-based chart generator. In Proc. IWPT 2005.
• Semantics construction– M. Sato, D. Bekki, Y. Miyao, and J. Tsujii. (2006). Translating
HPSG-style Outputs of a Robust Parser into Typed Dynamic Logic. In Proc. COLING-ACL 2006 Poster Session.
• Applications– Y. Miyao, T. Ohta, K. Masuda, Y. Tsuruoka, K. Yoshida, T.
Ninomiya, and J. Tsujii. (2006). Semantic Retrieval for the Accurate Identification of Relational Concepts. In Proc. COLING-ACL 2006.
– A. Yakushiji, Y. Miyao, T. Ohta, Y. Tateisi, and J. Tsujii. (2006). Automatic Construction of Predicate-Argument Structure Patterns for Biomedical Information Extraction. In EMNLP 2006 Poster Session.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources340
Comparing LFG, CCG, HPSG and TAG Acquisition
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources341
Comparing LFG, CCG, HPSG and TAG Acquisition
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources342
Demos
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources343
Demos
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources344
Future Work & Discussion
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources345
Future Work & Discussion