ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course:...

345
ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources 1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef van Genabith, Dublin City University Yusuke Miyao, University of Tokyo Julia Hockenmaier, University of Pennsylvania and University of Edinburgh ESSLLI 2006 18 th European Summer School for Language, Logic and Information, University of Malaga, July – August 2006

Transcript of ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course:...

Page 1: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources1

Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources

Josef van Genabith, Dublin City University

Yusuke Miyao, University of Tokyo

Julia Hockenmaier, University of Pennsylvania and University of Edinburgh

ESSLLI 200618th European Summer School for Language, Logic

and Information, University of Malaga, July – August 2006

Page 2: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources2

• Josef van Genabith, National Centre for Language Technology NCLT, School of Computing, Dublin City University, Dublin 9, Ireland, [email protected]

• Julia Hockenmaier, [email protected]

• Yusuke Miyao, Department of Computer Science, The University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, JAPAN, [email protected]

Lecturer Contact Information

Page 3: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources3

Motivation

• What do grammars do?– Grammars define languages as sets of strings– Grammars define what strings are grammatical

and what strings are not– Grammars tell us about the syntactic structure of

(associated with) strings• “Shallow” vs. “Deep” grammars• Shallow grammars do all of the above• Deep grammars (in addition) relate text to information/meaning

representation• Information: predicate-argument-adjunct structure, deep

dependency relations, logical forms, …• In natural languages, linguistic material is not always

interpreted locally where you encounter it: long-distance dependencies (LDDs)

• Resolution of LDDs crucial to construct accurate and complete information/meaning representations.

• Deep grammars := (text <-> meaning) + (LDD resolution)

Page 4: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources4

Motivation

• Unification (Constraint-Based) Grammar Formalisms (FU, GPSG, PATR-II, …)

– Lexical-Functional Grammar (LFG)– Head-Driven Phrase Structure Grammar (HPSG)– Combinatory Categorial Grammar (CCG)– Tree-Adjoining Grammar (TAG)

• Traditionally, deep constraint-based grammars are hand-crafted• LFG ParGram, HPSG LingoErg, Core Language Engine CLE, Alvey

Tools, RASP, ALPINO, …• Wide-coverage, deep unification (constraint-based) grammar

development is knowledge extensive and expensive!• Very hard to scale hand-crafted grammars to unrestricted text! • English XLE (Riezler et al. 2002); German XLE (Forst and Rohrer

2006); Japanese XLE (Masuichi and Okuma 2003); RASP (Carroll and Briscoe 2002); ALPINO (Bouma, van Noord and Malouf, 2000)

Page 5: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources5

Motivation

• Instance of “knowledge acquisition bottleneck” familiar from classical “rationalist” rule/knowledge-based AI/NLP

• Alternative to classical “rationalist” rule/knowledge-based AI/NLP• “Empiricist” research paradigm (AI/NLP):

– Corpora, treebanks, …, machine-learning-based and statistical approaches, …

– Treebank-based grammar acquisition, probabilistic parsing– Advantage: grammars can be induced (learned) automatically – Very low development cost, wide-coverage, robust, but …

• Most treebank-based grammar induction/parsing technology produces “shallow” grammars

• Shallow grammars don’t resolve LDDs (but see (Johnson 2002); …), do not map strings to information/meaning representations …

Page 6: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources6

Motivation

• Poses a research question:• Can we address the knowledge acquisition bottleneck for

deep grammar development by combining insights from rationalist and empiricist research paradigms?

• Specifically:• Can we automatically acquire wide-coverage “deep”,

probabilistic, constraint-based grammars from treebanks?• How do we use them in parsing?• Can we use them for generation?• Can we acquire resources for different languages and

treebank encodings?• How do these resources compare with hand-crafted

resources?• …

Page 7: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources7

Course Overview

Monday:

Tuesday:

Wednesday:

Thursday:

Friday:

Motivation, Course Overview, Introductions to TAG, LFG, CCG, HPSG and Penn-II TreeBank, TAG Resources

Penn-II-Based Acquisition of LFG Resources

Penn-II-Based Acquisition of CCG Resources

Penn-II-Based Acquisition of HPSG Resources

Multilingual Resources, Formal Semantics, Comparing LFG, CCG, HPSG and TAG-Based Approaches, Demos, Current and Future Work, Discussion

Page 8: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources8

Course Overview

Tuesday/Wednesday/Thursday

Penn-II-Based Acquisition of XXG Resources:

• Treebank Preprocessing/Clean-Up

• Treebank Annotation/Conversion

• Grammar and Lexicon Extraction

• Parsing (Architectures, Probability Models, Evaluation)

• Generation (Architectures, Probability Models, Evaluation)

• Other (Sematics, Domain Variation, …)

Page 9: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources9

Grammar Formalisms

Grammar Formalisms

Page 10: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources10

Grammar formalisms and linguistic theories

• Linguistics aims to explain natural language:– What is universal grammar?– What are language-specific constraints?

• Formalisms are mathematical theories:– They provide a language in which linguistic theories

can be expressed (like calculus for physics)– They define elementary objects (trees, strings,

feature structures) and recursive operations which generate complex objects from simple objects.

– They do impose linguistic constraints (e.g. on the kinds of dependencies they can capture)

Page 11: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources11

Lexicalised Grammar Formalisms:

TAG, CCG, LFG and HPSG

Page 12: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources12

Lexicalised formalisms (TAG, CCG, LFG and HPSG)

• The lexicon:– pairs words with elementary objects– specifies all language-specific information

(number and location of arguments, control and binding theory)

• The grammatical operations:– are universal– define (and impose constraints on) recursion

Page 13: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources13

TAG, CCG, LFG and HPSG

• They describe different kinds of linguistic objects:– TAG is a theory of trees– CCG is a theory of (syntactic and semantic) types– LFG is a multi-level theory based on a projection

architecture relating different types of linguistic objects (trees, AVMs, linear logic–based semantics)

– HPSG uses single, uniform formalism (typed feature structures) to describe phonological, morphological, syntactic and semantic representations (signs)

• They differ in details:– treatment of wh-movement, coordination, etc.

Page 14: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources14

TAG, CCG, LFG and HPSG

• TAG and CCG are weakly equivalent.

• Both are mildly context-sensitive:– can capture Dutch crossing dependencies – but are still efficiently parseable (in polynomial

time)

• LFG context-sensitive

Page 15: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources15

Tree-Adjoining Grammar (TAG)

Tree-Adjoining Grammar

Page 16: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources16

(Lexicalized) Tree-Adjoining Grammar

• TAG is a tree-rewriting formalism:– TAG defines operations (substitution and adjunction) on

trees.– The elementary objects in TAG are trees (not strings)

• TAG is lexicalized:– Each elementary tree is anchored to a lexical item (word)– “Extended domain of locality”:

The elementary tree contains all arguments of the anchor.– TAG requires a linguistic theory which specifies the shape

of these elementary trees.

• TAG is mildly context-sensitive:– can capture Dutch crossing dependencies– but is still efficiently parseable

AK Joshi and Y Schabes (1996) Tree Adjoining Grammars. In G. Rosenberg and A. Salomaa, Eds., Handbook of Formal Languages

Page 17: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources17

TAG substitution (arguments)

SubstituteX YX Y

X Y

Derivation tree:

Derived tree:

Page 18: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources18

ADJOIN

TAG adjunction (modifiers)

XX*

X

X

X*

Auxiliary tree

Foot node

Derived tree:

Derivation tree:

Page 19: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources19

A small TAG lexicon

S

NP VP

VBZ NP

eats

NP

John

VP

RB VP*

always

NP

tapas

Page 20: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources20

A TAG derivation

S

NP VP

VBZ NP

eats

NP

John

NP

tapas

VPRB VP*

always

NP

NP

NP

NP

Page 21: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources21

A TAG derivation

S

NP VP

VBZ NP

eats tapas

VPRB VP*

always

John

VP

VP

Page 22: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources22

A TAG derivation

S

NP

VBZ

VP

NP

eats tapas

VPRB VP*

always

John

Page 23: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources23

Combinatory Categorial Grammar (CCG)

Combinatory Categorial Grammar

Page 24: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources24

Combinatory Categorial Grammar

• CCG is a lexicalized grammar formalism(the “rules” of the grammar are completely general,all language-specific information is given in the lexicon)

• CCG is nearly context-free(can capture Dutch crossing dependencies, but is still efficiently parseable)

• CCG has a flexible constituent structure• CCG has a simple, unified treatment of

extraction and coordination • CCG has a transparent syntax-semantics interface

(every syntactic category and operation has a semantic counterpart)

• CCG rules are monotonic(movement or traces don’t exist)

• CCG rules are type-driven, not structure-driven(this means e.g. that intransitive verbs and VPs are indistinguishable)

Page 25: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources25

• Categories: specify subcat lists of words/constituents.

• Combinatory rules: specify how constituents can combine.

• The lexicon: specifies which categories a word can have.

• Derivations: spell out process of combining constituents.

CCG: the machinery

Page 26: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources26

CCG categories

• Simple categories: NP, S, PP

• Complex categories: functions which return a result when combined with an argument:

VP or intransitive verb: S\NPTransitive verb: (S\NP)/NPAdverb: (S\NP)\(S\NP)PPs: ((S\NP)\(S\NP))/NP

(NP\NP)/NP• Every category has a semantic

interpretation

Page 27: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources27

Function application

• Combines a function with its argument to yield a result:

(S\NP)/NP NP -> S\NPeats tapas eats tapas

NP S\NP -> SJohn eats tapas John eats tapas

• Used in all variants of categorial grammar

Page 28: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources28

A (C)CG derivation

Page 29: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources29

Type-raising and function composition

• Type-raising: turns an argument into a function.Corresponds to case:

NP -> S/(S\NP) (nominative)NP -> (S\NP)/((S\NP)/NP) (accusative)

• Function composition: composes two functions (complex categories)

(S\NP)/PP PP/NP -> (S\NP)/NPS/(S\NP) (S\NP)/NP -> S/NP

Page 30: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources30

Type-raising and Composition

• Wh-movement:

• Right-node raising:

Page 31: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources31

Another CCG derivation

• We will only be concerned with canonical “normal-form” derivations, which only use function composition and type-raising when syntactically necessary.

Page 32: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources32

CCG: semantics

• Every syntactic category and rule has a semantic counterpart:

Page 33: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources33

The CCG lexicon

• Pairs words with their syntactic categories(and semantic interpretation):

eats (S\NP)/NP xy.eats’xyS\NP x.eats’x

• The main bottleneck for wide-coverage CCG parsing

Page 34: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources34

Why use CCG for statistical parsing?

• CCG derivations are binary trees: we can use standard chart parsing techniques.

• CCG derivations represent long-range dependencies and complement-adjunct distinctions directly:

Page 35: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources35

A comparison with Penn Treebank parsers

• Standard Treebank parsers do not recover the null elements and function tags that are necessary for semantic interpretation:

Page 36: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources36

Lexical-Functional Grammar (LFG)

Lexical-Functional Grammar

Page 37: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources37

Lexical-Functional Grammar LFG

Lexical-Functional Grammar (LFG) (Bresnan & Kaplan 1981, Bresnan 2001, Dalrymple 2001) is a unification- (or constraint-) based theory of grammar.

Two (basic) levels of representation:

• C-structure: represents surface grammatical configurations such as word order, annotated CFG data structures

• F-structure: represents abstract syntactic functions such as SUBJ(ject), OBJ(ect), OBL(ique), PRED(icate), COMP(lement), ADJ(unct) …, AVM attribute-value matrices/structures

F-structure approximates to basic predicate-argument structure, dependency representation, logical form (van Genabith and Crouch, 1996; 1997)

Page 38: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources38

Lexical-Functional Grammar LFG

Page 39: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources39

Lexical-Functional Grammar LFG

Page 40: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources40

Lexical-Functional Grammar LFG

Page 41: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources41

LFG Grammar Rules and Lexical Entries

Page 42: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources42

LFG Parse Tree (with Equations/Constraints)

Page 43: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources43

LFG Constraint Resolution (1/3)

Page 44: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources44

LFG Constraint Resolution (2/3)

Page 45: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources45

LFG Constraint Resolution (3/3)

Page 46: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources46

LFG Subcategorisation & Long Distance Dependencies

• Subcategorisation:

– Semantic forms (subcat frames): sign< SUBJ, OBJ>

– Completeness: all GFs in semantic form present at local f-structure

– Coherence: only the GFs in semantic form present at local f-structure

• Long Distance Dependencies (LDDs): resolved at f-structure with Functional Uncertainty Equations (regular expressions specifying paths in f-structure).

Page 47: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources47

LFG LDDs: Complement Relative Clause

Page 48: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources48

LFG LDDs: Complement Relative Clause

Page 49: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources49

LFG LDDs: Complement Relative Clause

Page 50: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources50

Head-Driven Phrase Structure Grammar (HPSG)

Head-Driven Phrase Structure Grammar

Page 51: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources51

Head-Driven Phrase Structure Grammar HPSG

• HPSG (Pollard and Sag 1994, Sag et al. 2003) is a unification-/constraint-based theory of grammar

• HPSG is a lexicalized grammar formalism• HPSG aims to explain generic regularities that underlie

phrase structures, lexicons, and semantics, as well as language-specific/-independent constraints

• Syntactic/semantic constraints are uniformly denoted by signs, which are represented with feature structures

• Two components of HPSG– Lexical entries represent word-specific constraints

(corresponding to elementary objects)– Principles express generic grammatical regularities

(corresponding to grammatical operations)

Page 52: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources52

Sign

• Sign is a formal representation of combinations of phonological forms, syntactic and semantic constraints

signPHON string

SYNSEM LOCAL

NONLOCAL

CAT

CONT content

HEAD

VAL

valenceSPR listSUBJ listCOMPS list

headMOD synsem

synsemlocal

category

nonlocalQUE listREL listSLASH list

phonological formsyntactic/

semanticconstraints

local constraints

syntactic category

syntactic headmodifying

constraintssubcategorization

framessemantic

representationsnon-local

dependenciesDTRS dtrs daughter structures

Page 53: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources53

Lexical entries

• Lexical entries express word-specific constraints

PHON “loves”HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>

We use simplified notations in this lecture

Page 54: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources54

Principles

• Principles describe generic regularities of grammar– Not corresponding to construction rules

• Head Feature Principle– The value of HEAD must be percolated from the head

daughter

• Valence Principle– Subcats not consumed are percolated to the mother

• Immediate Dominance (ID) Principle– A mother and her immediate daughters must satisfy one of

ID schemas

• Many other principles: percolation of NONLOCAL features, semantics construction, etc.

HEAD 1 HEAD 1head daughter

Page 55: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources55

ID schemas

• ID schemas correspond to construction rules in CFGs and other grammar formalisms– For subject-head constructions (ex. “John runs”)

– For head-complement constructions (ex. “loves Mary”)

– For filler-head constructions (ex. “what he bought”)

COMPS < | >1 2 1

SUBJ <> 1SUBJ < >1

COMPS 2

SLASH < | >1 21SLASH 2

Page 56: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources56

Example: HPSG parsing

• Lexical entries determine syntactic/semantic constraints of words

HEAD nounSUBJ <>COMPS <>

John Mary

HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>

HEAD nounSUBJ <>COMPS <>

saw

Lexical entries

Page 57: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources57

Example: HPSG parsing

• Principles determine generic constraints of grammar

HEAD nounSUBJ <>COMPS <>

John Mary

HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>

HEAD nounSUBJ <>COMPS <>

saw

HEAD SUBJCOMPS < | >

23 4

13

HEAD SUBJCOMPS

12

4

Unification

Page 58: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources58

Example: HPSG parsing

• Principle application produces phrasal signs

HEAD nounSUBJ <>COMPS <>

John Mary

HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>

HEAD nounSUBJ <>COMPS <>

saw

HEAD verbSUBJ <HEAD noun>COMPS <>

Page 59: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources59

Example: HPSG parsing

• Recursive applications of principles produce syntactic/semantic structures of sentences

HEAD nounSUBJ <>COMPS <>

John Mary

HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>

HEAD nounSUBJ <>COMPS <>

saw

HEAD verbSUBJ <HEAD noun>COMPS <>

HEAD verbSUBJ <>COMPS <>

Page 60: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources60

Example: LDDs

• NONLOCAL features(SLASH, REL, etc.) explain long-distance dependencies– WH

movements– Topicalization– Relative

clauses etc...

prices

HEAD nounSUBJ < >COMPS < >SPR < >

HEAD nounSUBJ < >COMPS < >SPR < >

HEAD verbSUBJ < >COMPS < >SLASH < >

chargedwere

we

2HEAD verbSUBJ < >COMPS < >REL < >

HEAD nounSUBJ < >COMPS < >

HEAD verbSUBJ < >COMPS < >SLASH < >

3

HEAD verbSUBJ < >COMPS < >

34

4

3

2

HEAD verbSUBJ < >COMPS < >SLASH < >2

3

2

2

1

1

HEAD detSUBJ < >COMPS < >

the

1

HEAD nounSUBJ < >COMPS < >SPR < >

Page 61: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources61

Brief Intro to Penn Treebank

Page 62: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources62

The Penn Treebank

• The first large syntactically annotated corpus

• Contains text from different domains:– Wall Street Journal (50,000 sentences, 1 Million words)– Switchboard– Brown corpus– ATIS

• The annotation:– POS-tagged (Ratnaparkhi’s MXPOST) – Manually annotated with phrase-structure trees– Traces and other null elements used to represent non-local

dependencies (movement, PRO, etc.)– Designed to facilitate extraction of predicate-argument

structure

Page 63: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources63

A Treebank tree

• Relatively flat structures:– There is no noun level– VP arguments and adjuncts appear at the same level

• Co-indexed null elements indicate long-range dependencies• Function tags indicate complement-adjunct distinction (?)

Page 64: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources64

Penn-II Treebank

Page 65: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources65

Penn-II Treebank

• Until Congress acts , the government hasn't any authority to issue new debt obligations of any kind , the Treasury said .

Page 66: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources66

Penn-II Treebank

Page 67: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources67

Penn-II Treebank

Page 68: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources68

Penn-II Treebank

Page 69: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources69

Penn-II TreebankADJP ADJP-ADV ADJP-CLR ADJP-HLN ADJP-LOC ADJP-MNR ADJP-PRD ADJP-SBJ ADJP-TMP ADJP-TPC ADJP-TTL ADVP ADVP-CLR ADVP-DIR ADVP-EXT ADVP-HLN ADVP-LOC ADVP-MNR ADVP-PRD ADVP-PRP ADVP-PUT ADVP-TMP ADVP-TPC ADVP|PRT CONJP FRAG FRAG-ADV FRAG-HLN FRAG-PRD FRAG-TPC FRAG-TTL

INTJ INTJ-CLR INTJ-HLN LST NAC NAC-LOC NAC-TMP NAC-TTL NP NP-ADV NP-BNF NP-CLR NP-DIR NP-EXT NP-HLN NP-LGS NP-LOC NP-MNR NP-PRD NP-SBJ NP-TMP NP-TPC NP-TTL NP-VOC NX NX-TTL PP PP-BNF PP-CLR PP-DIR PP-DTV

PP-EXT PP-HLN PP-LGS PP-LOC PP-MNR PP-NOM PP-PRD PP-PRP PP-PUT PP-SBJ PP-TMP PP-TPC PP-TTL PRN PRT PRT|ADVP QP RRC S S-ADV S-CLF S-CLR S-HLN S-LOC S-MNR S-NOM S-PRD S-PRP S-SBJ S-TMP S-TPC

S-TTL SBAR SBAR-ADV SBAR-CLR SBAR-DIR SBAR-HLN SBAR-LOC SBAR-MNR SBAR-NOM SBAR-PRD SBAR-PRP SBAR-PUT SBAR-SBJ SBAR-TMP SBAR-TPC SBAR-TTL SBARQ SBARQ-HLN SBARQ-NOM SBARQ-PRD SBARQ-TPC SBARQ-TTL SINV SINV-ADV SINV-HLN SINV-TPC SINV-TTL SQ SQ-PRD SQ-TPC SQ-TTL

UCP UCP-ADV UCP-CLR UCP-DIR UCP-EXT UCP-HLN UCP-LOC UCP-MNR UCP-PRD UCP-PRP UCP-TMP UCP-TPC VP VP-TPC VP-TTL WHADJP WHADVP WHADVP-TMP WHNP WHPP X X-ADV X-CLF X-DIR X-EXT X-HLN X-PUT X-TMP X-TTL X-TTL

Page 70: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources70

Penn-II Treebank (Simple Transitive Verb)

Page 71: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources71

Penn-II Treebank (Simple Coordination)

Page 72: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources72

Penn-II Treebank (Passive)

Page 73: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources73

Penn-II Treebank (Subject WH-Relative Clause)

Page 74: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources74

Penn-II Treebank (WH-Less Complement Relative Cl.)

Page 75: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources75

Penn-II Treebank (Control and WH-Compl. Rel. Cl.)

Page 76: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources76

Penn-II Treebank (Adv. Relative Clause)

Page 77: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources77

Penn-II Treebank (Coord. and Right Node Raising)

Page 78: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources78

The Parseval measure

• Standard evaluation metric for Treebank parsers.Two components: – Precision: how many of the proposed NTs are correct?– Recall: how many of the correct NTs are proposed?

• Measures recovery of nonterminals(span + syntactic category)

• Ignores function tags and null elements

Has biased research towards parsers that produce linguistically shallow output (Collins, Charniak)

Page 79: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources79

Treebank-Based Acquisition

of TAG resources

Page 80: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources80

Extracting a TAG from the Treebank

• Two different approaches:– F. Xia. Automatic Grammar Generation From Two

Different Perspectives. PhD thesis, University of Pennsylvania, 2001.

– J. Chen, S. Bangalore, K. Vijaj-Shanker. Automated Extraction of Tree-Adjoining Grammars from Treebanks, Natural Language Engineering (forthcoming)

• This lecture: just the basic ideas!

Page 81: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources81

Extracting a TAG from the Penn Treebank

• Input: a Treebank tree (= the TAG derived tree)

•Output: a set of elementary trees(= the TAG lexicon)

Page 82: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources82

Extracting a TAG: the head

- Identify the head path (requires a head percolation table)

S

VPVBG

making

VP

- Find the arguments of the head (requires an argument table)- Ignore modifiers (requires an adjunct table)

- Merge unary productions (VP -> VP)

NP-SBJ

NP

Page 83: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources83

Extracting a TAG: the head

• This is the elementary tree for the head:

Page 84: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources84

Extracting a TAG: arguments

• Arguments are combined via substitution• Recurse on the arguments:

Page 85: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources85

Extracting a TAG: adjuncts

• Adjuncts require auxiliary trees(use adjunction to be combined with the head)

• Auxiliary trees require a foot node (with the same label as the root)

is

VBZ

VP

VP

ADVP-MNR

officially

NP

DTthe

Page 86: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources86

Extracting a TAG: adjuncts

• Adjuncts require auxiliary trees(use adjunction to be combined with the head)

• Auxiliary trees require a foot node (with the same label as the root)

Page 87: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources87

Special cases

• Coordination• Null elements (e.g. traces for wh-

movement):The trace has to be part of the elementary treeof the main verb

• Punctuation marks

Page 88: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources88

Wh-movement: relative clauses

(NP (NP a charge))

(SBAR (WHNP-2 (-NONE- 0))

(S (NP-SBJ Mr. Coleman))

(VP (VBZ denies)

(NP (-NONE- *T*-2)))))))

NP

NP

NP

SBAR

NP

S

VP

VBZ

WHNP

-NONE-

-NONE-

*T*-2

0

denies

Page 89: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources89

Evaluating an extracted grammar/lexicon

• Grammar/lexicon size?– Depends on head table, argument/adjunct distinction,

treatment of null elements, mapping of Treebank labels/POS tags to categories in extracted grammar etc.

– For TAGs, between 3,000-8,500 elementary tree types,and 100,000-130,000 lexical entries.

• Lexical coverage? – For TAGs, around 92-93%

• Distribution of tree types?• Convergence?• Quality?

– Inspection, comparison with manual grammar

Page 90: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources90

References: TAG extraction

TAG:A.K. Joshi and Y. Schabes (1996) Tree Adjoining Grammars. In G. Rosenberg and A.

Salomaa, Eds., Handbook of Formal Languages

TAG extraction:F. Xia. Automatic Grammar Generation From Two Different Perspectives. PhD

thesis, University of Pennsylvania, 2001.J. Chen, S. Bangalore, K. Vijaj-Shanker. Automated Extraction of Tree-Adjoining

Grammars from Treebanks, Natural Language Engineering (forthcoming)Also: L. Shen and A.K. Joshi, Building an LTAG Treebank,  Technical Report MS-CIS-

05-15, CIS Department, University of Pennsylvania, 2005

Parsing with extracted TAGs:D. Chiang. Statistical parsing with an automatically extracted tree adjoining

grammar. In Data Oriented Parsing, CSLI Publications, pages 299–316.

L. Shen and A.K. Joshi. Incremental LTAG parsing, HLT/EMNLP 2005

Page 91: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources91

Penn-II-Based Acquisition of LFG Resources

Lexical-Functional Grammar

Page 92: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources92

Penn-II-Based Acquisition of LFG Resources

• Introduction

• Treebank Preprocessing/Clean-Up

• Treebank Annotation/Conversion

• Grammar and Lexicon Extraction

• Parsing (Architectures, Probability Models, Evaluation)

• Generation (Architectures, Probability Models, Evaluation)

• Other (Semantics, Domain Variation, …)

Page 93: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources93

Introduction: Penn-II & LFG

• If we had f-structure annotated version of Penn-II, we could use (standard) machine learning methods to extract probabilistic, wide-coverage LFG resources

• How do we get f-structure annotated Penn-II?

• Manually? No: 50,000 trees …!

• Automatically! Yes: F-Structure annotation algorithm … !

• Penn-II is a 2nd generation treebank – contains lots of annotations to support derivation of deep meaning representations: trees, Penn-II “functional” tags, traces & coindexation – f-structure annotation algorithm can exploit those.

Page 94: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources94

Introduction: Penn-II & LFG

• What is the task?

• Given a Penn-II tree, the f-structure annotation algorithm has to traverse the tree and associate all tree nodes with f-structure equations (including lexical equations at the leaves of the tree).

• A simple example

Page 95: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources95

Introduction: Penn-II & LFG

S

NP-SBJ VP

NN NNS

Factory payrolls

VBD PP-TMP

fellIN NP

NNPin

↑=↓

↑subj=↓

↑=↓

↑=↓

↓↑adjunct

↑=↓ ↓↑adjunct

↑=↓

↑obj=↓

↑=↓

September

Factory payrolls fell in September.

Page 96: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources96

Introduction: Penn-II & LFG

subj : pred : payroll num : pl pers : 3 adjunct : 2 : pred : factory num : sg pers : 3adjunct : 1 : pred : in obj : pred : september num : sg pers : 3pred : falltense : past

Page 97: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources97

Treebank Preprocessing/Clean-Up: Penn-II & LFG

• Penn-II treebank: often flat analyses (coordination, NPs …), a certain amount of noise: inconsistent annotations, errors …

• No treebank preprocessing or clean-up in the LFG approach

• Take Penn-II treebank as is, but

• Remove all trees with FRAG or X labelled constituents

• Frag = fragments, X = not known how to annotate

• Total of 48,424 trees as they are.

Page 98: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources98

Treebank Annotation: Penn-II & LFG

• Annotation-based (rather than conversion-based)• Automatic annotation of nodes in Penn-II treebank tress

with f-structure equations• F-structure Annotation Algorithm• Annotation Algorithm exploits:

– Head information – Categorial information– Configurational information– Penn-II functional tags– Trace information

Page 99: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources99

Treebank Annotation: Penn-II & LFG

• Architecture of a modular algorithm to assign LFG f-structure equations to trees in the Penn-II treebank:

Left-Right Context Annotation Principles

Coordination Annotation Principles

Catch-All and Clean-Up

Traces

ProtoF-Structures Proper

F-Structures

Head-Lexicalisation [Magerman,1994]

Page 100: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources100

Treebank Annotation: Penn-II & LFG

• Head Lexicalisation: modified rules based on (Magerman, 1994)

Page 101: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources101

Treebank Annotation: Penn-II & LFG

Left-Right Context Annotation Principles:

• Head of NP likely to be rightmost noun …• Mother → Left Context Head Right Context

LeftContext

Right Context

Head

Page 102: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources102

Treebank Annotation: Penn-II & LFG

Left Context Head Right Context

DT: ↑spec:det=↓ QP: ↑spec:quant=↓JJ, ADJP: ↓↑adjunct

NN, NNS: ↑=↓

NP: ↓↑app PP: ↓↑adjunctS, SBAR: ↓↑relmod

NP

DT

RB

ADJP

very politicized

NN

JJ deala

NP

↑spec:det=↓

DT

RB

↓↑adjunct

ADJP

very politicized

↑=↓

NN

JJ deala

NP:

Left-Right Annotation Matrix

Page 103: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources103

Treebank Annotation: Penn-II & LFGADJPADJP-ADVADJP-CLRADJP-HLNADJP-LOCADJP-MNRADJP-PRDADJP-SBJADJP-TMPADJP-TPCADJP-TTLADVPADVP-CLRADVP-DIRADVP-EXTADVP-HLNADVP-LOCADVP-MNRADVP-PRDADVP-PRPADVP-PUTADVP-TMPADVP-TPCADVP|PRTCONJPFRAGFRAG-ADVFRAG-HLNFRAG-PRDFRAG-TPCFRAG-TTL

INTJINTJ-CLRINTJ-HLNLSTNACNAC-LOCNAC-TMPNAC-TTLNPNP-ADVNP-BNFNP-CLRNP-DIRNP-EXTNP-HLNNP-LGSNP-LOCNP-MNRNP-PRDNP-SBJNP-TMPNP-TPCNP-TTLNP-VOCNXNX-TTLPPPP-BNFPP-CLRPP-DIRPP-DTV

PP-EXTPP-HLNPP-LGSPP-LOCPP-MNRPP-NOMPP-PRDPP-PRPPP-PUTPP-SBJPP-TMPPP-TPCPP-TTLPRNPRTPRT|ADVPQPRRCSS-ADVS-CLFS-CLRS-HLNS-LOCS-MNRS-NOMS-PRDS-PRPS-SBJS-TMPS-TPC

S-TTLSBARSBAR-ADVSBAR-CLRSBAR-DIRSBAR-HLNSBAR-LOCSBAR-MNRSBAR-NOMSBAR-PRDSBAR-PRPSBAR-PUTSBAR-SBJSBAR-TMPSBAR-TPCSBAR-TTLSBARQSBARQ-HLNSBARQ-NOMSBARQ-PRDSBARQ-TPCSBARQ-TTLSINVSINV-ADVSINV-HLNSINV-TPCSINV-TTLSQSQ-PRDSQ-TPCSQ-TTL

UCPUCP-ADVUCP-CLRUCP-DIRUCP-EXTUCP-HLNUCP-LOCUCP-MNRUCP-PRDUCP-PRPUCP-TMPUCP-TPCVPVP-TPCVP-TTLWHADJPWHADVPWHADVP-TMPWHNPWHPPXX-ADVX-CLFX-DIRX-EXTX-HLNX-PUTX-TMPX-TTLX-TTL

Page 104: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources104

Treebank Annotation: Penn-II & LFG

• Do annotation matrix for each of the monadic categories

(without –Fun tags) in Penn-II

• Based on analysing the most frequent rule types for each

category

such that

sum total of token frequencies of these rule types is greater

than 85% of total number of rule tokens for that category

100% 85% 100% 85%

NP 6595 102 VP 10239 307

S 2602 20 ADVP 234 6

• Apply annotation matrix to all (i.e. also unseen) rules/sub-trees,

i.e. also those NP-LOC, NP-TMP etc.

Page 105: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources105

Treebank Annotation: Penn-II & LFG

• Co-ordination Annotation Principles• Often flat Penn-II analysis of coordination:

Co-ordinated ElementObjectModifier

Page 106: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources106

Treebank Annotation: Penn-II & LFG

• Unlike constituents coordination:

Co-ordinated Element

Page 107: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources107

Treebank Annotation: Penn-II & LFG

Traces Module:

• Long Distance Dependencies

• Topicalisation• Wh- and wh-less questions• Relative clauses• Passivisation• Control constructions• ICH (interpret constituent here)• RNR (right node raising)• …

• Translate Penn-II traces and coindexation into corresponding reentrancy in f-structure

Page 108: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources108

Treebank Annotation: WH-Relative Clauses

Page 109: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources109

Treebank Annotation: Wh-Less Relative Clauses

Page 110: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources110

Treebank Annotation: Control & Wh-Rel. LDD

Page 111: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources111

Treebank Annotation: Adv. Relative Clause

Page 112: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources112

Treebank Annotation: Right Node Raising

Page 113: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources113

Treebank Annotation: Right Node Raising

Page 114: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources114

Treebank Annotation: Penn-II & LFG

Catch-All and Clean-Up Module:

• Penn-II Functional Tags are used to identify potential errors– e.g. Nodes with the tag -SBJ should be annotated as the subject …

• Correction of Overgeneralisations– e.g. Change a second OBJ annotations to OBJ2 …– e.g. Change arguments of head nouns erroneously annotated as

relative clauses to COMP arguments: • … signs [that managers expect declines]_RELCL …• … signs [that managers expect declines]_COMP …

• Unannotated Nodes– Defaults …

Page 115: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources115

Treebank Annotation: Penn-II & LFG

Left-Right Context Annotation Principles

Coordination Annotation Principles

Catch-All and Clean-Up

Traces

ProtoF-Structures Proper

F-Structures

Head-Lexicalisation [Magerman,1995]

Constraint Solver

Page 116: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources116

Treebank Annotation: Penn-II & LFG

• Collect f-structure equations• Send to constraint solver• Generates f-structures

• F-structure annotation algorithm implemented in Java, constraint solver in Prolog

• ~3 min annotating approx. 50,000 Penn-II trees• ~5 min producing approx. 50,000 f-structures

Page 117: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources117

Treebank Annotation: Penn-II & LFG

Page 118: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources118

Treebank Annotation: Penn-II & LFG

Page 119: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources119

Evaluation (Quantitative):

• Burke (2006)

• Coverage:

Over 99.8% of Penn-II sentences (without X and FRAG constituents) receive a single covering and connected f-structure:

0 F-structures 45 0.093% 1 F-structure 48329 99.804% 2 F-structures 50 0.103%

Treebank Annotation: Penn-II & LFG

Page 120: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources120

Evaluation (Qualitative):

• Burke (2006)

• F-structure quality evaluation against DCU 105, a manually annotated dependency gold standard of 105 sentences randomly extracted from WSJ section 23.

• Triples are extracted from the gold standard and the automatically produced f-structures using the evaluation software from (Crouch et al. 2002) and (Riezler et al. 2002)

relation(predicate~0, argument~1)

• Results calculated in terms of Precision and Recall

Treebank Annotation: Penn-II & LFG

Page 121: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources121

Treebank Annotation: Penn-II & LFG

• Precision and Recall for DCU 105 Dependency Bank results are calculated for All Annotations and for Preds-Only

DCU 105 All Annotations Preds-Only

Precision 97.06% 94.28%

Recall 96.80% 94.28%

Page 122: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources122

Treebank Annotation: Penn-II & LFG

DCU 105

Feature Precision Recall F-Score

adjunct 892/968 = 92 892/950 = 94 93 app 16/16 = 100 16/19 = 84 91 comp 88/92 = 96 88/102 = 86 91 coord 153/184 = 83 153/167 = 92 87 obj 442/459 = 96 442/461 = 96 96 obl 50/52 = 96 50/61 = 82 88 oblag 12/12 = 100 12/12 = 100 100passive 76/79 = 96 76/80 = 95 96poss 74/79 = 94 74/81 = 91 92quant 40/64 = 62 40/52 = 77 69relmod 46/48 = 96 46/50 = 92 94subj 396/412 = 96 396/414 = 96 96topic 13/13 = 100 13/13 = 100 100 topicrel 46/49 = 94 46/52 = 88 91 xcomp 145/153 = 95 145/146 = 99 97

Page 123: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources123

Treebank Annotation: Penn-II & LFG

• Following (Kaplan et al. 2004) Precision and Recall for PARC 700 Dependency Bank calculated for:

all annotations PARC features preds-only

• Mapping required• (Burke 2006)

PARC 700 PARC features

Precision 88.31%

Recall 86.38%

Page 124: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources124

Grammar and Lexicon Extraction : Penn-II & LFG

Lexical Resources:

• Lexical information extremely important in modern lexicalised grammar formalisms

• LFG, HPSG, CCG, TAG, … • Lexicon development is time consuming and extremely

expensive • Rarely if ever complete• Familiar knowledge acquisition bottleneck …• Subcategorisation frame induction (LFG semantic forms) from

f-Structure annotated version of Penn-II and -III• Evaluation against COMLEX

Page 125: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources125

Grammar and Lexicon Extraction: Penn-II & LFG

• Lexicon Construction– Manual vs. Automated

Our Approach:

– F-Structure Annotation of Penn-II and Penn-III– Frames not Predefined– Functional and Categorial Information– Parameterised for Prepositions and Particles– Active and Passive – Long Distance Dependencies– Conditional Probabilities

Page 126: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources126

Grammar and Lexicon Extraction: Penn-II & LFG

• Extraction Methodology– Automatic F-Structure Annotation of Penn-II & III– Lexical Extraction Algorithm– Examples

• Evaluation– Gold Standards (COMLEX, OALD)– Experimental Architecture– Results

Page 127: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources127

Grammar and Lexicon Extraction: Penn-II & LFG

sign<subj,obj>

Page 128: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources128

Grammar and Lexicon Extraction: Penn-II & LFG

• Semantic Forms: PRED<GF1, GF2, …, GFn>

• Governable Grammatical Functions (Arguments)

– SUBJ, OBJ, OBJθ, OBL, OBLθ, COMP, XCOMP, PART…

• Non-Governable Grammatical Functions (Adjuncts)

– ADJ, XADJ, APP, RELMOD, …

Page 129: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources129

Grammar and Lexicon Extraction: Penn-II & LFG

Penn-II Treebank

Automatic F-Structure Annotation Algorithm

LFG F-Structures

Extraction Algorithm

Semantic Forms

Page 130: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources130

Grammar and Lexicon Extraction: Penn-II & LFG

Extraction Algorithm:

For each f-structure F

For each level of embedding in F Determine the local predicate PRED Collect all subcategorisable grammatical functions GF1, …,

GFn

Return: PRED<GF1, GF2, …, GFn>

Page 131: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources131

Grammar and Lexicon Extraction: Penn-II & LFG

subj : spec : det : pred : the pred : inquiry num : sg pers : 3adjunct : 1 : pred : soonpred : focustense : pastobl : pform : on obj : spec : det : pred : the pred : judge num : sg pers : 3

“The inquiry soon focused on the judge” (wsj_0267_72)

Prepositions and OBLs:

focus([subj,obl:on])

on([obj])

Page 132: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources132

Grammar and Lexicon Extraction: Penn-II & LFG

topic : index : [1] subj : spec : det : pred : the num : sing pred : government pers : 3

……

pred : have tense : pressubj : spec : det : pred : the pers : 3 pred : treasury num : singcomp : index : [1] subj : spec : det : pred : the num : sing pred : government pers : 3

… …

pred : have tense : prespred : saytense : past

LDDs:

say([subj,comp])

“Until Congress acts , the government hasn't any authority to issue new debt obligations of any kind, the Treasury said.” (wsj_0008_2)

Page 133: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources133

Grammar and Lexicon Extraction: Penn-II & LFG

subj : pred : pro pron_form : itpassive : +to_inf : +pred : bexcomp : subj : pred : pro pron_form : it passive : + pred : consider tense : past obl : pform : as obj : spec : det : pred : a ……… ……… pred : risk num : sg pers : 3

Passive:

consider([subj,obl:as],p)

“… to be considered as an additional risk for the investor…”(wsj_0018_14)

Page 134: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources134

Grammar and Lexicon Extraction: Penn-II & LFG

subj : spec : det : pred : the cat : dt

pred : inquiry num : sg pers : 3 cat : nnadjunct : 1 : pred : soon

cat : rbpred : focustense : pastcat : vbdobl : pform : on obj : spec : det : pred : the

cat : dt pred : judge num : sg pers : 3

cat : nn

CFG categories:

focus(v,[subj,obl:on])focus(v,[subj(n),obl:on])

“The inquiry soon focused on the judge.” (wsj_0267_72)

Page 135: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources135

Grammar and Lexicon Extraction: Penn-II & LFG

Semantic Form Conditional Probability

accept([subj,obj]) 0.813 accept([subj],p) 0.060 accept([subj,comp]) 0.033 accept([subj,obl:as],p) 0.020 accept([subj,obj,obl:as]) 0.020 accept([subj,obj,obl:from]) 0.020 accept([subj]) 0.013 Others 0.021

Without Prep/Part With Prep/Part Lemmas 3586 3586 Semantic Forms 10969 14348 Frame Types 38 577

Lexicon extracted from Penn-II (O’Donovan et al 2005):

Page 136: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources136

Grammar and Lexicon Extraction: Penn-II & LFG

• Evaluation for all active verbs (2992) extracted from Penn-II against COMLEX

• Largest evaluation for English subcat frame extraction system • Carroll and Rooth (1998) – 200 verbs• Schulte im Walde (2000) – over 3000 German verbs

• (VERB :ORTH “reimburse” :SUBC ((NP-NP)

(NP-PP :PVAL (“for”))

(NP)))

• (vp-frame np-np :cs ((np 2)(np 3))

:gs (:subject 1 :obj 2 :obj2 3)

:ex “she asked him his name”)

Page 137: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources137

Grammar and Lexicon Extraction: Penn-II & LFG

• Following Schulte im Walde (2000):

• Experiment 1: Exclude prepositional phrases entirely (e.g. [subj,obl:on] is [subj])

• Experiment 2: Include prepositional phrase but not specific preposition (e.g. [subj,obl]). – 2a (+ Part value)

• Experiment 3: Include details of specific preposition (e.g. [subj,obl:on]) – 3a (+ Part value)

• Relative Thresholds of 1% and 5%

Page 138: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources138

Grammar and Lexicon Extraction: Penn-II & LFG

Threshold of 1% Threshold of 5% P R F P R F

Exp. 1 79.0% 59.6% 68.0% 83.5% 54.7% 66.1% Exp. 2 77.1% 50.4% 61.0% 81.4% 44.8% 57.8% Exp. 2a 76.4% 44.5% 56.3% 80.9% 39.0% 52.6% Exp. 3 73.7% 22.1% 34.0% 78.0% 18.3% 29.6% Exp. 3a 73.3% 19.9% 31.3% 77.6% 16.2% 26.8%

Page 139: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources139

Grammar and Lexicon Extraction: Penn-II & LFG

• Directional Prepositions (about, across, along, around, behind, below, beneath, between, beyond, by, down, from…) included in COMLEX by “default” for verbs that have at least one p-dir …

Exp. 3 Exp. 3a Recall 40.8% 35.4% Increase 18.7% 15.5% F-Score 54.4% 49.7% Increase 20.4% 18.4%

(VERB :ORTH "cycle" :SUBC ((PP :PVAL ("p-dir")))

Page 140: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources140

Grammar and Lexicon Extraction: Penn-II & LFG

• Penn-III = Penn-II + the parsed section of the Brown Corpus

– About 300,000 of a total of 1 Million Words Brown Corpus– Balanced Corpus (8 genres) e.g. Humour, Science Fiction

etc.

• Subcategorisation variation across domains • More data, more verbs

• -CLR tag (closely related)

Page 141: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources141

Grammar and Lexicon Extraction: Penn-II & LFG

Page 142: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources142

Grammar and Lexicon Extraction: Penn-II & LFG

Page 143: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources143

Grammar and Lexicon Extraction: Penn-II & LFG

Page 144: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources144

Grammar and Lexicon Extraction: Penn-II & LFG

• Applications:

• Porting to other languages– German (TIGER) – Spanish (CAST3LB )– Chinese (CTB-I and II)

• LDD resolution in parsing new text (Cahill et al., 2004)

Page 145: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources145

Grammar and Lexicon Extraction: Penn-II & LFG

Parsing-Based Subcat Frame Extraction (O’Donovan 2006):

• Treebank-based vs. parsing-based subcat frame extraction

• We parsed British National Corpus BNC (100 million words) with our automatically induced LFGs

• 19 days on single machine: ~5 million words per day

• Subcat frame extraction for ~10,000 verb lemmas

• Evaluation against COMLEX and OALD

• Evaluation against Korhonen (2002) gold standard

• Our method is statistically significantly better

Page 146: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources146

Parsing: Penn-II and LFG

• Overview Parsing Architectures:

Pipeline & Integrated

• Long-Distance Dependency Resolution at F-Structure

• Evaluation

Page 147: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources147

Parsing: Penn-II and LFG

Page 148: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources148

Parsing: Penn-II and LFG

• PCFG consists of CFG rules with associated probabilities

• A-PCFG treats strings consisting of CFG categories followed by 1 or more functional annotation(s) as monadic categories (e.g. NP[up-obj=down] )

• Probabilistic parsing technology (PCFGs, History-Based and Lexicalised Parsers) produces trees without LDDs

• Exceptions: (Collins 1999): wh-relclauses; (Johnson 2002) post-processing; …

• In our (standard) architecture new text is parsed into proto f-structures.

• LDD resolution at f-structure

Page 149: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources149

Parsing: Penn-II and LFG

• Penn-II tree with traces and co-indexation for LDDs

“U.N. signs treaty, the paper said”

S

S-1 NP VP

NP VP DT NN VBD S

NNP VBZ NP -NONE-

NN *T*-1U.N. signs

treaty

the papersaid

Page 150: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources150

Parsing: Penn-II and LFG

• Trace and coindexaction in tree translated into reentrancy at f-structure by annotation algorithm:

“U.N. signs treaty, the headline said”

Page 151: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources151

Parsing: Penn-II and LFG

• Parse tree from PCFG and History-Based Parsers without traces:

“U.N. signs treaty, the paper said”

S

S NP VP

NP VP DT NN VBD

NNP VBZ NP

NNU.N. signs

treaty

the paper said

Page 152: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources152

Parsing: Penn-II and LFG

• Basic, but possibly incomplete, predicate-argument structures (proto-f-structures):

“U.N. signs treaty, the headline said”

Page 153: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources153

Parsing: Penn-II and LFG

• Require:– subcategorisation frames (O’Donovan et al., 2004, 2005;

O’Donovan 2006)– functional uncertainty equations

• Previous Example:– say([subj,comp]) topic = comp*comp (search along a path of 0 or more

comps)

Page 154: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources154

Parsing: Penn-II and LFG

Subcat Frames:

• Automatically acquired from automatically f-structure-

annotated Penn-II Treebank following (O’Donovan et al. 2004,

2005; O’Donovan 2006)

• Distinction between active and passive frames

• Associated with probabilities

• O’Donovan et al. evaluate against COMLEX resource

• Extracted from sections 02-21

• 10960 active lemma-frame types (semantic forms/subcat

frames), 2241 passive types

Page 155: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources155

Parsing: Penn-II and LFG

Functional Uncertainty equations:

• Automatically acquire finite approximations of FU-equations

• Extract paths between co-indexed material in automatically

generated f-structures from sections 02-21 from Penn-II

• 26 TOPIC, 60 TOPICREL, 13 FOCUS path types

• 99.69% coverage of paths in section 23

• Each path type associated with a probability

Page 156: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources156

Parsing: Penn-II and LFG

Sample TOPICREL paths with frequencies:

up-subj7894up-obj 1167up-xcomp 956up-xcomp:obj 793

up-xcomp:xcomp 161up-xcomp:xcomp:obj 135up-comp:subj 119up-xcomp:subj

92

Sample TOPIC paths with probabilities:up-topic=up-comp 0.940up-topic=up-xcomp:comp 0.006up-topic=up-comp:comp 0.001

Page 157: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources157

Parsing: Penn-II and LFG

LDD Resolution Algorithm: recursively traverse an f-structure and

– find TOPIC:T attribute-value pair

– retrieve TOPIC paths

– for each path p of the form GF1:…: GFn:GF, traverse the f-

structure along the TOPIC path GF1:…: GFn to local sub f-

structure g

• at g retrieve local PRED:P

• add GF:T to g iff

– GF is not present at g

– g together with GF is locally complete and coherent with respect to a semantic form s for P

– multiply path and semantic form probabilities involved to rank resolution

Page 158: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources158

Parsing: Penn-II and LFG

Page 159: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources159

Subcategorisation Framessay([subj]) 0.06say([comp,subj]) 0.87say([subj,xcomp]) 0.02... ...

Subcategorisation Frames say([subj]) 0.06Subcategorisation Frames say([subj]) 0.06say([comp,subj]) 0.87

topic : pred : sign subj : pred : U.N. obj : pred : treatypred : saysubj : spec : the pred : paper

Parsing: Penn-II and LFG

comp : pred : sign subj : pred : U.N. obj : pred : treaty

FU-path approximationsup-topic=up-comp 0.940up-topic=up-xcomp:comp 0.006up-topic=up-comp:comp 0.001... ...

topic

pred : say

0.9400.87

FU-path approximationsup-topic=up-comp

Page 160: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources160

Parsing: Penn-II and LFG

• How do treebank-based constraint grammars compare to deep hand-crafted grammars like XLE and RASP?

• XLE (Riezler et al. 2002, Kaplan et al. 2004)– hand-crafted, wide-coverage, deep, state-of-the-art English LFG

and XLE parsing system with log-linear-based probability models for disambiguation

– PARC 700 Dependency Bank gold standard (King et al. 2003), Penn-II Section 23-based

• RASP (Carroll and Briscoe 2002)– hand-crafted, wide-coverage, deep, state-of-the-art English

probabilistic unification grammar and parsing system (RASP Rapid Accurate Statistical Parsing)

– CBS 500 Dependency Bank gold standard (Carroll, Briscoe and Sanfillippo 1999), Susanne-based

Page 161: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources161

Parsing: Penn-II and LFG

Page 162: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources162

• Choose best treebank-based LFG system to compare with XLE/RASP:

• C-structure engines (state-of-the-art history based, lexicalised parsers):– (Collins 1999)– (Charniak 2000)– (Bikel 2002)

• (Bikel 2002) retrained to retain Penn-II functional tags (-SBJ, -SBJ, -LOC, -TMP, -CLR, etc.)

• Pipeline architecture: tagged text Bikel retrained + f-structure annotation algorithm + LDD resolution f-structures automatic conversion evaluation against XLE/RASP gold standards PARC-700/CBS-500 dependency banks

Parsing: Penn-II and LFG

Page 163: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources163

• Systematic differences between our f-structures and PARC 700 and CBS 500 dependency representations

• Automatic conversion of our f-structures to PARC 700 / CBS 500 -like structures (Burke et al. 2004, Burke 2006, Cahill et al. under review)

• Best XLE and RASP resources with better results than those reported in literature to date

• (Crouch et al. 2002) and (Carroll and Briscoe 2002) evaluation software

• (Noreen 1989) Approximate Randomisation Test to test for statistical significance of results

Parsing: Penn-II and LFG

Page 164: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources164

Parsing: Penn-II and LFG

• Result dependency f-scores:

PARC 700 XLE vs. BKR-LFG:– 80.55% XLE– 83.08% BKR-LFG (+2.53%)

CBS 500 RASP vs. BKR-LFG:– 76.57% RASP– 80.23% BKR-LFG (+3.66%)

• Results statistically significant at 95% level (Noreen 1989) Approximate Randomisation Test

• BKR-LFG = treebank-induced Lexical-Functional Grammar resources with Bickel retrained (BKR) as c-structure engine in pipeline

architecture

Page 165: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources165

Parsing: Penn-II and LFG

PARC 700 Evaluation:

Page 166: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources166

Parsing: Penn-II and LFG

Page 167: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources167

Parsing: Penn-II and LFG

Page 168: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources168

Parsing: Penn-II and LFG

Page 169: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources169

Parsing: Penn-II and LFG

Page 170: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources170

Parsing: Penn-II and LFG

Page 171: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources171

Probability Models: Penn-II & LFG

Probability Models:

• Our approach does not constitute proper probability model (Abney, 1996)

• Why? Probability model leaks:

• Highest ranking parse tree may feature f-structure equations that cannot be resolved into f-structure

• Probability associated with that parse tree is lost

• Doesn’t happen often in practise (coverage >99.5% on unseen data)

• Research on appropriate discriminative, log-linear or maximum entropy models is important (Miyao and Tsujii, 2002) (Riezler et al. 2002)

Page 172: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources172

Generation: Penn-II & LFG

Cahill and van Genabith, 2006

Page 173: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources173

Generation: Penn-II & LFG

Page 174: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources174

Generation: Penn-II & LFG

Page 175: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources175

Generation: Penn-II & LFG

Page 176: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources176

Generation: Penn-II & LFG

Page 177: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources177

Generation: Penn-II & LFG

Page 178: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources178

Generation: Penn-II & LFG

Page 179: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources179

Generation: Penn-II & LFG

Page 180: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources180

Generation: Penn-II & LFG

Page 181: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources181

Generation: the Good, the Bad and the Ugly

• Orig: Supporters of the legislation view the bill as an effort to add stability and certainty to the airline-acquisition process , and to preserve the safety and fitness of the industry .

• Gen: Supporters of the legislation view the bill as an effort to add stability and certainty to the airline-acquisition process , and to preserve the safety and fitness of the industry.

• Orig: The upshot of the downshoot is that the A 's go into San Francisco 's Candlestick Park tonight up two games to none in the best-of-seven fest .

• Gen: The upshot of the downshoot is that the A 's tonight go into San Francisco 's Candlestick Park up two games to none in the best-of-seven fest .

• Orig: By this time , it was 4:30 a.m. in New York , and Mr. Smith fielded a call from a New York customer wanting an opinion on the British stock market , which had been having troubles of its own even before Friday 's New York market break .

• Gen: Mr. Smith fielded a call from New a customer York wanting an opinion on the market British stock which had been having troubles of its own even before Friday 's New York market break by this time and in New York , it was 4:30 a.m. .

• Orig: Only half the usual lunchtime crowd gathered at the tony Corney & Barrow wine bar on Old Broad Street nearby .

• Gen: At wine tony Corney & Barrow the bar on Old Broad Street nearby gathered usual , lunchtime only half the crowd , .

Page 182: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources182

Domain Variation, Multilingual LFG Resources, etc.

• Domain variation: ATIS (Judge et al 2005) and QuestionBank (Judge et al 2006)

• F-Str -> (Q)LF Quasi-Logical Forms (Cahill et al. 2003)

• Multilingual treebank-based LFG acquisition:

– German: TIGER treebank (Cahill et al 2003), (Cahill et al 2005)

– Chinese: Chinese Penn Treebank (Burke et al 2004)

– Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van Genabith 2006)

• GramLab Project at DCU (2005-2008): Chinese, Japanese, Arabic, Spanish, French and German

Page 183: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources183

Demo System

• http://lfg-demo.computing.dcu.ie/lfgparser.html

Page 184: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources184

Publications

A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFG-Approximations, COLING/ACL 2006, Sydney, Australia

J. Judge, A. Cahill and J. van Genabith, QuestionBank: Creating a Corpus of Parse-Annotated Questions, COLING/ACL 2006, Sydney, Australia

G. Chrupala and J. van Genabith, Using Machine-Learning to Assign Function Labels to Parser Output for Spanish, COLING/ACL 2006, Sydney, Australia

M. Burke, Automatic Treebank Annotation for the Acquisition of LFG Resources, Ph.D. Thesis, School of Computing, Dublin City University, Dublin 9, Ireland. 2005

R. O’Donovan, Automatic Extraction of Large-Scale Multilingual Lexical Resources, Ph.D. Thesis, School of Computing, Dublin City University, Dublin 9, Ireland. 2005

R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005

A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language and Computation; Special Issue on "Shared Representations in Multilingual Grammar Engineering", (eds.) E. Bender, D. Flickinger, F. Fouvry and M. Siegel, Kluwer Academic Press, 2005

Page 185: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources185

Publications

R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005

J. Judge, M. Burke, A. Cahill, R. O'Donovan, J. van Genabith, and A. Way. Strong Domain Variation and Treebank-Induced LFG Resources; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway,2005

M. Burke, A. Cahill, J. van Genabith, and A. Way. Evaluating Automatically Acquired F-Structures against PropBank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005

M. Burke, A. Cahill, M. McCarthy, R.O'Donovan, J. van Genabith and A. Way. Evaluating Automatic F-Structure Annotation for the Penn-II Treebank; Journal of Language and Computation; Special Issue on "Treebanks and Linguistic Theories", (eds.) E. Hinrichs and K.Simov, Kluwer Academic Press. 2005. pages 523-547

A. Cahill. Parsing with Automatically Acquired, Wide-Coverage, Robust, Probabilistic LFG Approximations. Ph.D. Thesis. School of Computing, Dublin City University, Dublin 9, Ireland. 2004

M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLIC-18 Conference, Waseda University, Tokyo, Japan, pages 161-172, 2004

Page 186: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources186

Publications

M. Burke, A. Cahill, R. O'Donovan, J. van Genabith, and A. Way. The Evaluation of an Automatic Annotation Algorithm against the PARC 700 Dependency Bank, In Proceedings of the Ninth International Conference on LFG, Christchurch, New Zealand, pages 101-121, 2004

A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26 2004, pages 320-327, Barcelona, Spain, 2004

R. O'Donovan, M. Burke, A. Cahill, J. van Genabith, and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank, In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26 2004, pages 368-375, Barcelona, Spain, 2004

M. Burke, Cahill A., R. O' Donovan, J. van Genabith and A. Way. Treebank-Based Acquisition of Wide-Coverage, Probabilistic LFG Resources: Project Overview, Results and Evaluation, The First International Joint Conference on Natural Language Processing (IJCNLP-04), Workshop "Beyond shallow analyses - Formalisms and statistical modeling for deep analyses"; March 22-24, 2004 Sanya City, Hainan Island, China, 2004

Cahill A., M. Forst, M. McCarthy, R. O' Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Multilingual Unification-Grammar Development. In the Proceedings of the Workshop on Ideas and Strategies for Multilingual Grammar Development, at the 15th European Summer School in Logic Language and Information, Vienna, Austria, 18th - 29th August 2003

Page 187: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources187

Publications

Cahill A, M. McCarthy, J. van Genabith and A. Way. Quasi-Logical Forms for the Penn Treebank; In (eds.) Harry Bunt, Ielka van der Sluis and Roser Morante; Proceedings of the Fifth International Workshop on Computational Semantics, IWCS-05, January 15-17, 2003, Tilburg, The Netherlands, ISBN: 90-74029-24-8, pp.55-71, 2003

Cahill A, M. McCarthy, J. van Genabith and A. Way. Evaluating Automatic F-Structure Annotation for the Penn-II Treebank. TLT 2002, Treebanks and Linguistic Theories 2002, 20th and 21st September 2002, Sozopol, Bulgaria, (eds.) E. Hinrichs and K. Simov, Proceedings of the First Workshop on Treebanks and Linguistic Theories (TLT 2002), pp. 42-60, 2002

Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure Annotation, In M. Butt and T. Holloway-King (eds.): Proceedings of the Seventh International Conference on LFG CSLI Publications, Stanford, CA., pp.76--95. 2002

Cahill A, and J. van Genabith. TTS - A Treebank Tool; in LREC 2002, The Third International Conference on Language Resources and Evaluation, Las Palmas de Grand Canaria, Spain, May 27th--June 2nd, 2002, Proceedings of the Conference, Volume V, (eds.) M.G.Rodriguez and C.P. Suarez Arnajo, ISBN 2-9517408-0-8, pp. 1712-1717, 2002

Cahill A, M. McCarthy, J. van Genabith and A. Way. Automatic Annotation of the Penn-Treebank with LFG F-Structure Information; LREC 2002 workshop on Linguistic Knowledge Acquisition and Representation - Bootstrapping Annotated Language Data, LREC 2002, Third International Conference on Language Resources and Evaluation, post-conference workshop, June 1st, 2002, proceedings of the workshop, (eds.) A. Lenci, S. Montemagni and V. Pirelli, ELRA - European Language Resources Association, Paris France, pp. 8-15, 2002

Page 188: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources188

Penn-II-Based Acquisition of CCG Resources

Combinatory Categorial Grammar

Page 189: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources189

This lecture

• Recap: CCG

• Translating the Penn Treebank to CCG– The translation algorithm– CCGbank: the acquired grammar and lexicon

• Wide-coverage parsing with CCG

Page 190: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources190

• Categories: specify subcat lists of words/constituents.

• Combinatory rules: specify how constituents can combine.

• The lexicon: specifies which categories a word can have.

• Derivations: spell out process of combining constituents.

CCG: the machinery

Page 191: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources191

CCG categories

• Simple categories: NP, S, PP

• Complex categories: functions which return a result when combined with an argument:

VP or intransitive verb: S\NPTransitive verb: (S\NP)/NPAdverb: (S\NP)\(S\NP)PPs: ((S\NP)\(S\NP))/NP

(NP\NP)/NP

Page 192: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources192

The combinatory rules

• Function application: x.f(x) a f(a) X/Y Y X (>)Y X\Y X (<)

• Function composition: x.f(x) y.g(y) x.f(g(x))X/Y Y/Z X/Z (>B)Y\Z X\Y X/Z (<B)X/Y Y\Z X\Z (>Bx)Y/Z X\Y X/Z (<Bx)

• Type-raising: a f.f(a)X T/(T\X) (>T)

X T\(T/X) (<T)

Page 193: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources193

CCG derivations

• Canonical “normal-form” derivations (mostly function application):

• Alternative derivations:

Page 194: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources194

Type-raising and Composition

• Wh-movement:

• Right-node raising:

Page 195: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources195

CCG: semantics

• Every syntactic category and rule has a semantic counterpart:

Page 196: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources196

From the Penn Treebank to CCG

• The basic translation algorithm• Dealing with null elements• Type-changing rules in the grammar• Preprocessing• CCGbank: The extracted lexicon/grammar

Page 197: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources197

Input: Penn Treebank tree

• Flat phrase-structure tree• Traces/null elements and indices

represent underlying dependencies• Function tags

Page 198: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources198

Output: CCG derivation

• Binary derivation treewith explicit “deep”dependency structuresand subcategorization information.

• No null elements

Page 199: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources199

I. Identify heads, arguments, adjuncts

Page 200: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources200

II. Binarise the tree

Page 201: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources201

III. Assign CCG categories

Page 202: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources202

Morphosyntactic Features

• Features on verbal categories:declarative, infinitival, past participle,present participle, passive

• Sentential features:wh-questions, yes-no questions, embedded questions, embedded declaratives, fragments, etc.

• CCGbank has no case or number distinction!

Page 203: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources203

III. Assign CCG categories: adjuncts

Page 204: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources204

III. Assign CCG categories: arguments

Page 205: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources205

IV. Assign predicate-argument structure

• We approximate predicate-argument structure by word-word dependencies

• These are defined by the argument slots of functor catgeories:

just (S\NP)/(S\NP) opened opened (S[dcl]\NP)/NP doors

Page 206: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources206

IV. Assign predicate-argument structure

• Non-local dependencies arise through:– Binding and control: “He may want you to listen”– Extraction: “the tapas that he told us she ate”

• Both are mediated by lexical categories:– Control verbs, auxiliaries/modals– Relative pronouns

• We represent this via coindexation: (NP\NPi)/(S[dcl]/NPi)

In CCGbank: added automatically to certain category types

Page 207: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources207

Lexical categories that mediate dependencies

• Auxiliaries/modals, raising verbs: will, might, seem(S[dcl]\NPi)/(S[b]\NPi)

• Control verbs: persuade you to go((S[dcl]\NP)/(S[to]\NPi))/NPi

• Relative pronouns: which, who, that(NP\NPi)/(S[dcl]/NPi)

• Many more (listed in CCGbank manual)

Page 208: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources208

Summary: The basic algorithm

1. Identify heads, complements and adjuncts.2. Binarize the tree.3. Assign CCG categories.4. Add co-indexation to lexical categories.5. Create predicate-argument structure.

Page 209: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources209

Problems with basic algorithm

• Depends on Treebank markup:– Complement/adjunct distinction– The analyses don’t always correspond to CCG

analysis– Errors in Treebank annotation

• Proliferation of categories:

Page 210: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources210

The need for preprocessing

• Eliminating (some of) the noise:– POS-tagging errors– Bracketing errors (coordination!)

• Changing the Treebank analyses:– Small clauses

• Adding more structure:– Insert a noun level into NPs– Analyze QPs, fragments, parentheticals,

multiword-expressions

Page 211: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources211

Compacting the grammar: Type-changing rules

• Type-changing rules for adjuncts capture syntactic regularities:

Page 212: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources212

Null elements, traces, and coindexation

• *-null elements: passive, PRO• *T*-traces: wh-movement, tough movement• *RNR*-traces: right-node raising• Other null elements:

– *EXP*: expletive,– *ICH* (“insert constituent here”): extraposition – *U* (units): $ 500 *U* – *PPA* (permanent predictable ambiguity)

• =-coindexation: argument cluster coordination and gapping

Page 213: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources213

• Used for passive or PRO (arbitrary or controlled):

• Only the passive * matters for translation:(S with null subject = VP = S\NP)

* null elements

Page 214: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources214

Unbounded long-range dependencies

• … arising through extraction (*T*):– Wh-movement (relative clauses and wh-questions):

the articles that (you believed he saw that…) I filed– Tough-movement:

Peter is easy to please– Parasitic gaps:

the articles that I filed without reading

• … arising through coordination (*RNR* and =):– Right-node raising:

[[Mary ordered] and [John ate]] the tapas. – Argument cluster coordination:

Mary ordered [[tapas for herself] and [wine for John]].– Sentential gapping:

[[Mary ordered tapas] and [John beer]].

Page 215: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources215

Dealing with extraction

• Penn Treebank: *T* traces indicate extraction

Page 216: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources216

Dealing with extraction

• Pass the extracted NP up to relative clause.• The relative pronoun subcategorizes for an

‘incomplete’ sentence:(NP\NP)/(S[dcl]\NP) for subject relatives(NP\NP)/(S[dcl]/NP) for object relatives

• The derivation uses type-raising and composition

Page 217: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources217

Right node raising in the Penn Treebank

Page 218: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources218

Right node raising in CCGbank

Page 219: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources219

Argument-cluster coordination

• “Template gapping” annotation: Co-indexation between constituents in conjuncts

• The first conjunct contains the head

Page 220: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources220

Argument-cluster coordination in CCGbank

• The shared constituents are coordinated (via type-raising and composition):

X T\(T/NP) (<T)NP (S\NP)\((S\NP)/NP) (<T)

Page 221: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources221

Sentential Gapping

• In the Treebank:

• CCG uses decomposition to obtain the types(interpretation is given extragrammatically)

Page 222: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources222

Remaining problems: NP level

• Lists and appositives are indistinguishable:

• Compound nouns have no internal structure:

Page 223: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources223

Remaining problems: other constructions

• Complement-adjunct distinction:

Page 224: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources224

Putting it all together….

Funds that are or soon will be listed in New York or London

Page 225: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources225

The CCG derivation

Page 226: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources226

that: (NPi\NPi)/(S[dcl]\NPi) funds are,will

The relative clause:

Page 227: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources227

The right-node-raising VP

Page 228: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources228

CCGbank

• Coverage of the translation algorithm:99.44% of all sentences in the Treebank(main problem: sentential gapping)

• The lexicon (sec.02-21): – 74,669 entries for 44,210 word types– 1286 lexical category types

(439 appear once, 556 appear 5 times or more)

• The grammar (sec. 02-21):– 3262 rule instantiations (1146 appear once)

Page 229: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources229

The most ambiguous words

Page 230: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources230

Frequency distribution of categories

Page 231: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources231

Lexical coverage

• How well does our lexicon cover unseen data?“Training” data: sections 02-21

Test data: section 00

• The lexicon contains the correct entries for94.0% of the tokens in section 00.

• 3.8% of the tokens in section 00 do not appearin sections 02-21.

35% of the unknown tokens are N29% of the unknown tokens are N/N

Page 232: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources232

Statistical Parsing with CCG

• The data: CCGbank• The algorithms: standard CKY chart parsing

(and a supertagger)• The models:

– Generative: Hockenmaier and Steedman (2002)– Conditional: Clark and Curran (2004)

Page 233: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources233

Parsing algorithms for CCG

• CCG derivations are binary trees.• Standard chart parsing algorithms (eg. CKY)

can be used.• Complexity: O(n6)

(or O(n3) if the category set is fixed)• Recovery of “deep” dependencies require

feature structures. • Supertagging: assign most likely categories

to words before parsing. Significantly speeds up parsing!

Page 234: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources234

Parsing models

• Generative models: P(s,)Model the process which generates the derivation – Advantage: easy to guarantee consistency– Disadvantage: requires good smoothing techniques,

difficult to include complex features

Good baseline

• Conditional models: P( |s)Given a sentence s, predict most likely derivation – Advantage: more natural for parsing– Disadvantage: large model size, difficult to estimate

Page 235: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources235

Evaluation: recovery of dependency structures

LabelledUnlabelled

Generative: 83.3 90.3(Hockenmaier and Steedman, 2002)

Conditional: 84.6 91.2(Clark and Curran, 2004)

This includes long-range dependencies

Page 236: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources236

ccg2sem: from CCG to DRT

• A Prolog package which translates CCGbank derivations into Discourse Representation Theory structures (Bos, 2005)

Page 237: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources237

CCGbanks for other languages

• German (Hockenmaier, 2006):– Translation of German TIGER corpus into CCG.– Many crossing dependencies, etc.:

context-free approximations are inappropriate– Current coverage: 92.4% of all graphs

(excluding headlines, fragments etc.)

• Turkish (Cakici, 2005):– Extracts a CCG lexicon from the METU Sabanci

Treebank.

Page 238: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources238

A few referencesGeneral CCG references:M. Steedman (2000). The Syntactic Process, MIT Press.M. Steedman (1996). Surface Structure and Interpretation, MIT Press.CCGbank(s) and wide-coverage CCG parsing:J. Hockenmaier and M. Steedman (2005). CCGbank: User’s Manual, MS-CIS-05-09,

Dept. of Computer and Information Science, University of Pennsylvania.J. Hockenmaier and M. Steedman (2002). Acquiring Compact Lexicalized

Grammars from a Cleaner Treebank, LREC, Las Palmas, Spain.J. Hockenmaier (2003). Data and Models for Statistical Parsing with Combinatory

Categorial Grammar. PhD thesis, Infomatics, University of Edinburgh.J. Hockenmaier and M. Steedman (2002). Generative Models for Statistical Parsing

with Combinatory Categorial Grammar, ACL ‘02, Philadelphia, PA, USA.S. Clark and J. R. Curran (2004). Parsing the WSJ using CCG and Log-Linear

Models ACL '04, Barcelona, Spain.S. Clark and J. R. Curran (2004). The Importance of Supertagging for Wide-

Coverage CCG Parsing. Coling’04, Geneva, Switzerland.J. Bos (2005): Towards Wide-Coverage Semantic Interpretation. IWCS-6.R. Cakici (2005). Automatic Induction of a CCG Grammar for Turkish.

ACL Student Research Workshop, Ann Arbor, Mi, USA.J. Hockenmaier (2006). Creating a CCGbank and a wide-coverage CCG lexicon for

German. ACL/COLING ‘06, Sydney, Australia.

Page 239: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources239

More references

• The CCG website: http://groups.inf.ed.ac.uk/ccgwith lots of general references about CCG(as well as CCGbank, CCG parsing, etc.)

• CCGbank is available from the Linguistic Data Consortium (LDC) at the University of Pennsylvania.

Page 240: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources240

Penn- II-Based Acquisition of HPSG Resources

Head-Driven Phrase Structure Grammar

Page 241: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources241

Penn- II-Based Acquisition of HPSG Resources

• Introduction• Treebank conversion and HPSG annotation• Lexicon extraction• Probabilistic models

– Feature forest model– Design of features

• Parsing• Evaluation• Advanced topics

Page 242: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources242

Introduction

• If we had an HPSG version of Penn-II, we could obtain lexical entries and probabilistic models

• How do we get HPSG-annotated Penn-II?• Converting Penn-II into an HPSG-conformant

treebank• How do we verify the conformity with the HPSG

theory?• Principles are exploited for the verification

– Implementation of principles is relatively easy, while construction of the lexicon is extremely difficult

– Principles are hand-coded, while lexical entries are acquired from a converted treebank

Page 243: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources243

Introduction

• We develop a treebank rather than a lexicon• A treebank provides more information than a

lexicon– Verification of the consistency of the grammar– Statistics

Principles

Lexicon

Treebank

Page 244: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources244

Methodology

TreebankTreebank

PrinciplesPrinciples LexiconLexicon

pretty/JJ

database/NN

Treebankconversion

Treebankconversion

HPSG treebankHPSG treebank

Lexiconextraction

Lexiconextraction

Grammarwriter

Page 245: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources245

Comparison with conventional grammar development

Lexiconextractor

Lexiconextractor

LexiconLexicon

PrinciplesTreebankPrinciplesTreebank

ParserParserGrammar writer

PrinciplesLexicon

PrinciplesLexicon

TreebankTreebank

CorpusCorpus editedit

verifyverify

Treebank-baseddevelopment

Manualdevelopment

Page 246: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources246

Treebank conversion and HPSG annotation

• Convert Penn-style parse trees into HPSG-style parse trees– Correcting frequent errors in Penn Treebank

• Ex. Confusion of VBD/VBN

– Converting tree structures• Small clauses, passives, NP structures, auxiliary/control

verbs, LDDs, etc.

– Mapping into HPSG-style representations• Head/argument/modifier distinction, schema name

assignment• Mapping into HPSG categories

– Applying HPSG principles/schemas• Undetermined features are filled• Violations of feature constraints are detected

Page 247: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources247

HEAD verbSUBJ < >COMPS < >MOD

HEAD verbSUBJ < >COMPS < >

Overview

S

making

the offer

NP

NL

NP

is officially

VP

VP

VP

head

head

head mod head

arg

arg

arg

S

making

the offer

NP

NL

NP

is

officially

VP

VP

ADVP

Error correction &tree conversion

Mapping intoHPSG-stylerepresentation

NL

HEAD verbSUBJ < >COMPS < >

subject-head

HEAD nounSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

the offermaking

HEAD adv

HEAD verbSUBJ < >1

HEAD verb

HEAD verbSUBJ < >1

HEAD verb

is officially

HEAD verb

head-comp

head-mod head-comp

Principleapplication

NL

HEAD verbSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

the offermaking

HEAD verbSUBJ < >COMPS < >

1

HEAD verbSUBJ < >COMPS < >

1

is officially

1

12

HEAD verbSUBJ < >COMPS < >

12

3

3

HEAD verbSUBJ < >COMPS < >

14

4

2

Page 248: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources248

Tree conversion

• Coordination, quotation, insertion, and apposition

• Small clauses, “than” phrases, quantifier phrases, complementizers, etc.

• Disambiguation of non-/pre-terminal symbols (TO, etc.)

• HEAD features (CASE, INV, VFORM, etc.)• Noun phrase structures• Auxiliary/control verbs• Subject extraction• Long distance dependencies• Relative clauses, reduced relatives

Page 249: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources249

Pattern-based tree conversion

tree_transform_rule("predicative", $Input, $Output) :- tree_match(TREE_NODE\$Node & TREE_DTRS\[tree_any & ANY_TREES\$LeftTrees, (TREE_NODE\SYM\"S" & TREE_DTRS\($PRDTrees & [tree_any, tree & TREE_NODE\

FUNC\"PRD", tree_any])), tree_any & ANY_TREES\$RightTrees], $Input), append_list([$LeftTrees, $PRDTrees, $RightTrees], $Dtrs), $Output = TREE_NODE\$Node & TREE_DTRS\$Dtrs.S

NP VP

SNP ADJP

himself

He considered

superior

S

NP VP

NP ADJPhimself

He considered

superior

Tree pattern

Page 250: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources250

Passive

• “be + VBN” constructions are assigned“VFORM passive”

S

been

out

VP

*-2

NP-SBJ-2

have n’t VP

VP

the details

worked/VBN NP PRT

VFORM passive

Page 251: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources251

Noun phrase structures

• Determiners are raised• Possessive structures are explicitly

represented

NP

of

plant

NPMonsanto

NP

’s

director PP

sciences

NP

of

plant

NPMonsanto

DP

’s director PP

sciences

N’

NP

Page 252: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources252

Auxiliary/control verbs

• Auxiliary/control verbs are annotated as taking unsaturated constituents

S

VP

have

to

choose

this particular moment

S

NP VP

VP

NP

they

NP-1

did n’t

*-1

VP

VP

SUBJ < >1

1 SUBJ < >2

SUBJ < >2

SUBJ < >3

3=

S

VP

have

to

choose

this particular moment

VP

VP

NP

they

NP-1

did n’t

VP

VP

Page 253: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources253

Subject extraction

• HPSG does not allow subject extraction• Relativizers are treated as ordinary subjects

in relative clauses

NP

WHNP-1

SBAR

SThe company

NP

which NP VP

VPhas

reported NP

*T*-1

net losses

NP

WHNP-1

SBAR

The company

NP

which

VP

VPhas

reported NP

net losses

Page 254: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources254

Subject relative

• Relativizers have a non-empty list in REL• The element of REL is consumed in a head-

relative construction and represents the relative-antecedent relation

NP

WHNP-1

SBAR

The company

NP

which

VP

VPhas

reported NP

net losses

REL < >2

REL < >22

REL < >

Page 255: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources255

LDDs: Object relative

• SLASH represents moved arguments• REL represents relative-antecedent relations

REL < >

SLASH < >1

2REL < >SLASH < >

2

REL < >SLASH < >

NP

WHNP-3

SBAR

Sthe energy and ambitions

NP

that NP-2

reformers

VP

Swanted

reward

VP

*T*-3

1

NP

to VP

NP

*-2

SLASH < >1

SLASH < >1

SLASH < >1

SLASH < >1

2

Page 256: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources256

Mapping into HPSG-style representations

• Convert nonterminal symbols into HPSG-style categories

• Assign schema names to internal nodes

NNHEAD: nounAGR: 3sg

HEAD: verbVFORM: finiteTENSE: past

VBD

Page 257: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources257

Category mapping & schema name assignment

• Example: “NL is officially making the offer”

S

making

the offer

NP

NL

NP

is officially

VP

VP

VP

head

head

head mod head

arg

arg

argNL

HEAD verbSUBJ < >COMPS < >

subject-head

HEAD nounSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

the offermaking

HEAD adv

HEAD verbSUBJ < >1

HEAD verb

HEAD verbSUBJ < >1

HEAD verb

is officially

HEAD verb

head-comp

head-mod head-comp

Page 258: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources258

Principle application

inverse_schema_binary(subj_head_schema, $Mother, $Left, $Right) :- $Left = (SYNSEM\($LeftSynsem & LOCAL\CAT\(HEAD\MOD\[] & VAL\(SUBJ\[] & COMPS\[] & SPR\[])))), $Right = (SYNSEM\LOCAL\CAT\(HEAD\$Head & VAL\(SUBJ\[$LeftSynsem] & COMPS\[] &

SPR\[]))), $Mother = (SYNSEM\LOCAL\CAT\(HEAD\$Head & VAL\(SUBJ\[] & COMPS\[] & SPR\

[]))).

HEAD: noun HEAD: verbHe considered ...

HEAD: verbSUBJ: <HEAD: noun>

HEAD: verbSUBJ: <>

considered ...

HEAD: nounSUBJ: <>

HEAD: verb

He

structure-sharing

Page 259: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources259

Principle application

NL

HEAD verbSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

the offermaking

HEAD advMOD

officially

1HEAD verbSUBJ < >COMPS < >

1

HEAD verbSUBJ < >COMPS < >

12

HEAD verbSUBJ < >COMPS < >

12

3 3

is

HEAD verbSUBJ < >COMPS < >

1

HEAD verbSUBJ < >COMPS < >

14

4

2

NL

HEAD verbSUBJ < >COMPS < >

subject-head

HEAD nounSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

the offermaking

HEAD adv

HEAD verbSUBJ < >1

HEAD verb

HEAD verbSUBJ < >1

HEAD verb

is officially

HEAD verb

head-comp

head-mod head-comp

Page 260: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources260

Complicated example

NP

we were

VP

the prices

NP

S

SBAR

WHNP-1head

head

head

head

arg

arg

arg0

charged

NP

VP

*-2 *T*-1

arg

argarghead

prices

HEAD nounSUBJ < >COMPS < >SPR < >

HEAD nounSUBJ < >COMPS < >SPR < >

HEAD verbSUBJ < >COMPS < >SLASH < >

chargedwere

we

2HEAD verbSUBJ < >COMPS < >REL < >

HEAD nounSUBJ < >COMPS < >

HEAD verbSUBJ < >COMPS < >SLASH < >

3

HEAD verbSUBJ < >COMPS < >

34

4

3

2

HEAD verbSUBJ < >COMPS < >SLASH < >2

3

2

2

1

1

HEAD detSUBJ < >COMPS < >

the

1

HEAD nounSUBJ < >COMPS < >SPR < >

Page 261: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources261

Lexicon extraction

• Collecting leaf nodes of HPSG parse trees• Generalizing leaf nodes into lexical entry

templates• Applying inverse lexical rules• Assigning predicate argument structures

Page 262: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources262

Overview

Collection of leaf nodes &generalization

Application ofinverse lexicalrules

Assignment ofpredicateargumentstructures

HEAD verbSUBJ < >COMPS < >MOD

HEAD verbSUBJ < >COMPS < >

NL

HEAD verbSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

the offermaking

HEAD verbSUBJ < >COMPS < >

1

HEAD verbSUBJ < >COMPS < >

1

is officially

1

12

HEAD verbSUBJ < >COMPS < >

12

3

3

HEAD verbSUBJ < >COMPS < >

14

4

2

HEAD verbSUBJ < HEAD noun >COMPS < HEAD noun >

making:

HEAD verbSUBJ < HEAD noun >COMPS < HEAD noun >

make:make:

HEAD verb

HEAD nounCONT 2COMPS < >

HEAD nounCONT 1

SUBJ < >

CONTmake’ARG1ARG2 2

1

Page 263: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources263

Collecting leaf nodes

• Leaf nodes of HPSG parse trees are instances of lexical entries

NL

HEAD verbSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

the offermaking

HEAD advMOD

officially

1HEAD verbSUBJ < >COMPS < >

1

HEAD verbSUBJ < >COMPS < >

12

HEAD verbSUBJ < >COMPS < >

12

3 3

is

HEAD verbSUBJ < >COMPS < >

1

HEAD verbSUBJ < >COMPS < >

14

4

2

Page 264: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources264

Generalization into lexical entry templates

• Unnecessary constraints are removed (restriction)

HEAD: verb

SUBJ: <HEAD: >nounPOSTHEAD: minus

HEAD: verbSUBJ: <HEAD: noun>

A leaf node ofthe HPSG treebank

Lexical entry template

lexical_entry_template($WordInfo, $Sign, $Template) :- copy($Sign, $Template), $Template = (SYNSEM\LOCAL\(CAT\HEAD\$Head & VAL\(SUBJ\$Subj & COMPS\$Comps & SPR\$SPR))), ... restriction($SubjSynsem, [NONLOCAL\]), restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, POSTHEAD\]), restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, AUX\]), restriction($SubjSynsem, [LOCAL\, CAT\, HEAD\, TENSE\]), ...

Page 265: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources265

Application of inverse lexical rules

• Converting lexical entries of inflected words into lexical entries of lexemes using inverse lexical rules

• Derivational rules: Ex. passive rule

• Inflectional rules: Ex. past-tense rule

HEAD: verbSUBJ: <HEAD: noun>COMPS: <HEAD: prep_by>

HEAD: verbSUBJ: <HEAD: noun>COMPS: <HEAD: noun>

HEAD:verbVFORM: finiteTENSE: past

HEAD:verbVFORM: base

Page 266: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources266

Predicate argument structures

• Create mappings from syntactic arguments into semantic arguments

COMPS < >

SUBJ < >

HEAD verb

make’ARG1ARG2

CAT|HEAD nounCONT 1

CONT 12

VALCAT|HEAD nounCONT 2

CAT

Ex. lexical entry for “make”

Page 267: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources267

Page 268: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources268

Page 269: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources269

Page 270: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources270

Probabilistic models

• Feature forest model– A solution to the problem of the probabilistic

modeling of feature structures

• Design of features– How to represent preferences of HPSG parse trees

Page 271: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources271

Example: PCFG

S

NP VP

She dances

0.30.3 0.2 0.2

S

NP VP

I dance

S

NP VP

She danced

S

NP VP

I danced

0.150.15 0.2 0.2Estimated prob.

S → NP VPNP → SheNP → I

VP → dancesVP → danceVP → danced

CFG rule probabilities1.00.50.5

0.30.30.4

Observed freq.

Training data

Page 272: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources272

What is the problem?

• PCFG assigns probabilities to ungrammatical structures– “She dance” (0.15), “I dances” (0.15)

S

NP VP

She dances

0.30.3 0.2 0.2

S

NP VP

I dance

S

NP VP

She danced

S

NP VP

I danced

0.150.15 0.2 0.2Estimated prob.

Observed freq.

Training data

Page 273: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources273

Feature structure constraints

• In HPSG, feature structures explain grammatical constraints

• “She dance” “I dances” are never generated• However, constraints of feature structures

violate “independence assumption” of probabilistic models (Abney 1997)

S → NPAGR 1 VPAGR 1

NPAGR:3sg → SheNPAGR:no3sg → I

VPAGR:3sg → dancesVPAGR:no3sg → danceVP → danced

How can we estimate probabilities in this situation?

Page 274: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources274

Solution: ME model

• Probabilities of parse trees are estimated by maximum entropy models (Berger et al. 1996)

• Probability p(T) of parse tree T

• Optimal parameters are computed so as to maximize the likelihood of training data

iii Tf

ZTp )(exp

1)(

feature functionfeature function

parameter(feature weight)

parameter(feature weight)normalization factornormalization factor

Page 275: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources275

ME model of parse trees

• If feature functions correspond to CFG rules, this model is an extension of PCFG model

• Probabilities of parse tress are estimated without independence assumption

1f

)(3

)(2

)(1

332211

3211

)()()(exp1

)(

TfTfTf

Z

TfTfTfZ

Tp

S

NP

She2f

VP

dances3f

Page 276: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources276

Estimation by a ME model

S

NP VP

She dances0.30.3 0.2 0.2

S

NP VP

I dance

S

NP VP

She danced

S

NP VP

I danced

0.30.3 0.2 0.2Estimated prob.

Observed freq.

Training data

S → NPAGR 1 VPAGR 1   1.0NPAGR:3sg → She 1.0NPAGR:no3sg → I 1.0

VPAGR:3sg → dances 1.145VPAGR:no3sg → dance 1.145VP → danced 0.763

ME parameters

1.1451.145 0.763 0.763i

ii f )exp(

)exp( i

Page 277: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources277

Combinatorial explosion of parse trees

• Exponentially many parse trees are assigned to sentences (i.e., a set of T is exponential)

S

NP1 VP1

By expanding...S

NP1 VP1

S

NP2 VP1

S

NP1 VP2

S

NP2 VP2

Size: nm

VP2NP2

n m

Page 278: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources278

Problems by combinatorial explosion

• Parameter estimation is intractable– Computation of

– Computation of

• Searching for the most probable parse is intractable– Computation of

T

ii TpTffE )()()(

T iii TfZ )(exp

)(maxargˆ TpTT

Page 279: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources279

Solutions in HMM and PCFG

• Probabilistic models are divided into independent probabilities, and dynamic programming is applied– Forward-backward probability– Baum-Welch algorithm– Inside-outside probability– Viterbi search

• Inside/outside probabilities can be computed at a cost proportional to the number of nodes, assuming a forest structure of parse trees

Page 280: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources280

Feature forest model

• Dynamic programming can also be applied to maximum entropy estimation

• Feature forest:– Forest structure

isomorphic to CFG parse forest

– Assign feature functions to nodesrather than symbols

• A ME model isestimated without unpacking feature forests

f(S)

f(NP1) f(VP1)

Size: n+m

feature forest

f(NP2) f(VP2)

Page 281: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources281

Feature forest representation of a parse tree

• A feature forest represents exponentially many trees of features

f(S)

f(NP1) f(VP1)

Size: n+m

feature forest representation

S

NP1 VP1 VP2NP2

n mf(NP2) f(VP2)

Page 282: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources282

feature forest representation

Outside TO(NP1)Outside TO(NP1)

Inside TI(NP1)Inside TI(NP1)

• Focus on a set of trees below/above the targeted node

• Inside trees TI(n):Trees below n

• Outside trees TO(n):Trees above n

Inside/outside trees of a feature forest

f(S)

f(NP1) f(VP1)f(NP2) f(VP2)

Page 283: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources283

Estimation algorithms for ME models

• Estimation of parameters requires computation of model expectations (Malouf 2002)

xi

Dkxki

iii

xpxfxfD

fEfEG

)()()(||

1

)()(~

)(λ

Objective function

Dkx ikii Zxf

DG log)(

||

1)( λ

Gradient

Computed from training data

Recomputed at each iteration

Page 284: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources284

Inside/outside products

• Unnormalized product

• Inside product

• Outside product

)(

))(exp1NP

11NP (NPOT i

Oii Tf

)(

))(exp1NP

11NP (NPIT i

Iii Tf

feature forest representation

f(S)

f(NP1) f(VP1)f(NP2) f(VP2)

iii xfxq )(exp)(

Page 285: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources285

1N'1N

• The inside product of NP1 is a product of inside products of its daughters

Computation of inside products

iii

nn

nn

iii

f

f

)(

)(

)

(

},,{''

},,{

1

N'N'NN

1

N'NN'N

N'NN'NNP

NP

NP

2121

2212

21111

feature forest representation

f(NP1) f(NP2)

f(N1) f(N2) f(N’1) f(N’2)

Page 286: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources286

1VP

• The outside products of NP1 is a product of the mother’s outside products and sister’s inside products

Computation of outside products

iii

nn

iii

f

f

)(

)(

},,{

S

S

2VP1VPS

2VPS1VPS1NP

feature forest representation

f(S)

f(NP1) f(VP1)f(NP2) f(VP2)

S

Page 287: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources287

Computation of model expectations

• Sum of unnormalized products of trees including NP1

• Expectation of fi at NP1

11

1

NPNPNP

includesTT

Tq:

)(

1NP1NP11NP NP Z

ffE ii

1)()(

feature forest representation

f(S)

f(NP1) f(VP1)f(NP2) f(VP2)

1NP

1NP

Page 288: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources288

Viterbi search

• Almost the same as the computation of inside products– “max” rather than

“sum”

)( 1N'q)( 1Nq

feature forest representation

f(NP1) f(NP2)

f(N1) f(N2) f(N’1) f(N’2)

iii

n

n

f

nq

nqq

)(

)'(max

)(max)(max

},,{'

},,{

1

2N'1N'

2N1N1

NP

NP

Page 289: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources289

Design of features

• Feature engineering is important for higher accuracy

• Feature functions are designed for capturing syntactic/semantic preferences of HPSG parse trees

Page 290: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources290

A chart for HPSG parsing

he saw a girl with a telescope

HEAD nounSUBCAT <>HEAD prep

MOD NPSUBCAT <NP>

HEAD prepMOD NPSUBCAT <>

HEAD nounSUBCAT <>

HEAD verbSUBCAT <NP,NP>

HEAD nounSUBCAT <>

HEAD nounSUBCAT <>

HEAD prepMOD VPSUBCAT <NP>

HEAD prepMOD VPSUBCAT <>HEAD verb

SUBCAT <NP>

HEAD verbSUBCAT <>

HEAD verbSUBCAT <NP>

Equivalent signs are packed

Page 291: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources291

Feature forest representation of a chart

• Node= each rule application

HEAD prepMOD NPSUBCAT <>

HEAD nounSUBCAT <>

HEAD verbSUBCAT <NP,NP>

HEAD nounSUBCAT <>

HEAD prepMOD VPSUBCAT <NP>

HEAD prepMOD VPSUBCAT <>

HEAD verbSUBCAT <>

HEAD verbSUBCAT <NP>

HEAD nounSUBCAT <>

HEAD verbSUBCAT <NP>

HEAD verbSUBCAT <NP>

HEAD prepMOD VPSUBCAT <>

HEAD nounSUBCAT <>

HEAD verbSUBCAT <NP>

HEAD nounSUBCAT <>

HEAD nounSUBCAT <>

he

Page 292: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources292

Feature forest representation of predicate argument structures

• Node = already-determined predicate argument relations

fact

ARG1wantARG1

4ARG2 dispute1I

fact

wantARG1

ARG2dispute2

I

ARG1

4

ARG2 3

3

wantARG1

ARG2dispute1

I

ARG1

1

1 2

wantARG1

ARG2dispute2

I

ARG1

2

ARG2

She ignored the fact that I wanted to dispute

Page 293: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources293

Extraction of probabilistic events

extract_binary_event("hpsg-forest", "bin", $RuleName, $LDtr, $RDtr, _, _,

$Event) :- $Event = [$RuleName, $Dist, $Depth|$HDtrFeatures]) :- find_head($Rule, $LSign, $RSign, $Head, $NonHead), rule_name_mapping($Rule, $Head, $NonHead, $RuleName), encode_distance($LSign, $RSign, $Dist), encode_depth($LSign, $RSign, $Depth), encode_sign($Head, $HDtrFeatures, $NDtrFeatures), encode_sign($NonHead, $NDtrFeatures, []).

<subj-head, 2, 1, VP, ran, VBD, V_intrans-past, 2, NP, boys, NNS, N_plural, 2>

S

NP VP

ADVPnever

Cool ranboys

NTS POS

word

lexical entry

depthdistance

schemaspan

Page 294: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources294

Atomic features

• RULE: name of applied rule• DIST: distance between head words• COMMA: whether the phrase includes commas• SPAN: number of words the phrase dominates• SYM: nonterminal symbol (e.g. S, VP, …)• WORD: head word• POS: part-of-speech• LE: lexical entry• ARG: argument label (ARG1, ARG2, ...)

Page 295: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources295

Example: syntactic features

• Feature for the Head-Modifier construction for “saw a girl” and “with a telescope”

prep-mod-vpwith,IN,PP,3,

,transitiveVBD,saw,VP,3,

,0,modifier,3-head

LE,POS,WORD,SYM,SPAN

LE,POS,WORD,SYM,SPAN

COMMA,DIST,RULE,

rrrrr

lllllf

he saw a girl with a telescope

HEAD nounSUBCAT <>

HEAD verbSUBCAT <NP,NP>

HEAD nounSUBCAT <>

HEAD nounSUBCAT <>

HEAD prepMOD VPSUBCAT <NP>

HEAD prepMOD VPSUBCAT <>

HEAD verbSUBCAT <NP>

HEAD verbSUBCAT <>

HEAD verbSUBCAT <NP>

Page 296: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources296

Example: semantic features

• Feature for the predicate argument relation between “he” and “saw”

pronounPRPhe

transitiveVBDsaw

1ARG1

LEPOSWORD

LEPOSWORD

DISTARG

,,

,,,

,,

,,

,,,

,,

nnn

hhhpaf

girl

saw

heARG1

ARG2

Page 297: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources297

Feature generation

• Features are generated by abstracting descriptions of probabilistic events

feature_mask("hpsg-forest", "bin", [1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0]).feature_mask("hpsg-forest", "bin", [1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0]).feature_mask("hpsg-forest", "bin", [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]).

<subj-head, 2, _, _, ran, VBD, V_intrans-past, _, _, boys, NNS, N_plural, _>

<subj-head, 2, 1, VP, ran, VBD, V_intrans-past, 2, NP, boys, NNS, N_plural, 2>

Page 298: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources298

Parsing

• Efficient processing of feature structures (details omitted)– Abstract machines, quick check, CFG filtering, etc.

• Efficient search with probabilistic HPSG– Beam thresholding– Iterative beam thresholding

Page 299: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources299

Beam thresholding

• Thresholding out edges in each cell of the chart– Thresholding by number: for each cell, keep only

the best n edges– Thresholding by width: keep only the edges

whose FOM is greater than w, where w is the difference from the best FOM in the same cell

Page 300: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources300

Effect of beam thresholding

• Precision and recall by changing parameters of beam search

• Recall drops, while precision retains

Page 301: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources301

Iterative beam thresholding

• Start with a narrow beam width• Continue widening a beam width until

parsing succeeds

Iterative_parse(sentence) { w := beam_width_start; while(w < beam_width_end) { parse(sentence, w); if(parse succeeds) return; w := w + beam_width_step;}

Page 302: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources302

Efficacy of iterative beam thresholding

• Evaluated on Penn Treebank Section 24 (< 15 words)

Precision Recall F-score Avg. time (ms)

Viterbi 88.2% 87.9% 88.1% 103923

Beam 89.0% 82.4% 85.5% 88

Iterative 87.6% 87.2% 87.4% 99

Page 303: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources303

Distribution of parsing time

• Black: Viterbi, Red: iterative beam thresholding

1

10

100

1000

10000

100000

1000000

10000000

100000000

0 5 10 15

Sentence length (words)

Par

sing

tim

e (m

s)

Page 304: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources304

Evaluation

• Evaluation of the lexical entries extracted from Penn Treebank– Investigation of obtained lexical entries– Coverage

• Evaluation of the disambiguation model– Parsing accuracy

Page 305: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources305

Experimental settings

• Training data: Sections 2-21 of Penn Treebank II (39,832 sentences)

• Test data:– Development set: Section 22 (1,700 sentences)– Final test set: Section 23 (2,416 sentences)

Page 306: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources306

Number of tree conversion rules

Target of conversion Number

Penn-II errors 102

Category mapping 85

Head annotation and binarization 63

Difference of phrase structures 15

Predicate argument structures 13

Long distance dependencies 13

Others 52

Total 343

Page 307: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources307

Result of treebank conversion & lexicon extraction

• Treebank conversion and HPSG annotation succeeded for 37,886 sentences

• Extracted lexicon:

# words 34,765

# lexical entries 1,942

Average # lexical entries/word 1.43

Page 308: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources308

Sources of treebank conversion failures

• Classification of failures of treebank conversion in Section 02 (67 failures/1989 sentences)

Shortcomings of tree conversion rules 18

Errors in Penn Treebank 16

Constructions currently unsupported 20

Constructions unsupported by HPSG 13

Page 309: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources309

Breakdown of extracted lexical entries

# words# lexical entries

Avg. # lex. entries

noun 21,925 186 1.14

verb 4,094 945 1.94

adjective 8,078 62 1.28

adverb 1,295 72 2.75

preposition 159 193 9.17

particle 58 10 1.69

determiner 36 33 3.86

conjunction 94 321 9.46

punctuation 15 120 22.00

Total 34,765 1,942 1.43

Page 310: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources310

Example lexical entries

HEADnounMOD <>

VAL

SPR < HEAD det >SUBJ <>COMPS <>

Common nounEx. review/NNappeared 140,805 times

HEADverbMOD <>VFORM base

VALSPR <>SUBJ <HEAD noun>COMPS <HEAD noun>

Transitive verbappeared 12,244 times

HEAD

adjMOD <HEAD noun>POSTHEAD -

VALSPR <>SUBJ <>COMPS <>

Pre-head adjectiveappeared 55,049 times

Page 311: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources311

Evaluation of coverage

• The ratio of lexical entries in the test data covered by the grammar is measured

• A sentence is covered when all of the lexical entries in the sentence are covered (strong coverage)

Lexical entry

Sentence

w/o unknown word handling

96.52% 54.7%

w/ unknown word handling 99.15% 84.8%

Page 312: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources312

Treebank size vs. coverage

Page 313: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources313

Sentence length vs. coverage

Page 314: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources314

Error analysis

• Classification of randomly selected uncovered lexical entries

Errors of Penn Treebank 10

Errors of treebank conversion 48

Lack of lexical entries 23

Constructions currently unsupported 9

Idioms 6

Non-linguistic expressions (ex. list) 4

Page 315: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources315

Examples of uncovered lexical entries

• Lack of mappings from words into lexical entries because of data sparseness– Post-noun adjectives (younger, crucial)– Coordination conjunctions of NP and S’– Verbs taking present-participle as a complement

• Unsupported constructions– Free relatives, extrapositions

• Incorrect lexical entries obtained because of idiomatic expressions– (ADVP in part) because …

Page 316: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources316

Evaluation of parsing accuracy

• Empirical evaluation of the probabilistic models– Overall accuracy– Treebank size vs. accuracy – Sentence length vs. accuracy– Contribution of features– Coverage and accuracy– Error analysis

• Measure: precision/recall of<predicate word, argument position, argument word, predicate type>

– e.g.) <saw, ARG1, he, transitive> girlsaw

heARG1

ARG2

Page 317: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources317

Effect of feature forest models

• Accuracy for Section 23 (< 40 words)

Precision Recall

baseline 78.10 77.39

with syntactic features 86.92 86.28

with semantic features 84.29 83.74

with all features 86.54 86.02

Page 318: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources318

Treebank size vs. accuracy

0

20

40

60

80

100

0 10000 20000 30000 40000

# sentences

Pre

cisi

on

/reca

ll (

%)

Page 319: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources319

Sentence length vs. accuracy

0

20

40

60

80

100

0 20 40 60

Sentence length

Coverage (%)

Sentencecoverage

Page 320: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources320

Contribution of features (1/2)

precision recall # features

All 87.12 85.45 623,173

- RULE 86.98 85.37 620,511

- DIST 86.74 85.09 603,748

- COMMA 86.55 84.77 608,117

- SPAN 86.53 84.98 583,638

- SYM 86.90 85.47 614,975

- WORD 86.67 84.98 116,044

- POS 86.36 84.71 430,876

- LE 87.03 85.37 412,290

None 78.22 76.46 24,847

Page 321: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources321

Contribution of features (2/2)

precision recall # features

All 87.12 85.45 623,173

- DIST,SPAN 85.54 84.02 294,971

- DIST,SPAN,COMMA 83.94 82.44 286,489

- RULE,DIST, SPAN,COMMA

83.61 81.98 283,897

- WORD,LE 86.48 84.91 50,258

- WORD,POS 85.56 83.94 64,915

- WORD,POS,LE 84.89 83.43 33,740

- SYM,WORD, POS,LE

82.81 81.48 26,761

None 78.22 76.46 24,847

Page 322: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources322

Coverage and accuracy

• Accuracies for strongly covered/uncovered sentences

• We can expect accuracy improvements by improving grammar coverage

Precision

Recall# sentences

Covered sentences

89.36 88.96 1,825

Uncovered sentences

75.57 74.04 319

Page 323: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources323

Error analysis

• Classification of errors in randomly selected sentences (100 sentences)

PP-attachment ambiguity 76

Distinction of arguments/modifiers 49

Ambiguity of lexical entries 44

Errors in test data 22

Ambiguity of commas 32

Others 75

Page 324: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources324

Examples of errors (1/2)

• Antecedent of a relative clause– It's made only in years when the grapes ripen perfectly (the

last was 1979) and comes from a single acre of [NP grapes [S' that yielded a mere 75 cases in 1987]].

• Argument/modifier distinction of to-phrases– More than a few CEOs say the red-carpet treatment tempts

them [VP-modifier to return to a heartland city for future meetings].

Page 325: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources325

Examples of errors (2/2)

• Preposition or verb phrase?– Mitsui Mining & Smelting Co. posted a 62 % rise in pretax

profit to 5.276 billion yen ($ 36.9 million) in its fiscal first half ended Sept. 30 [VP compared with 3.253 billion yen a year earlier].

• Selection of subcategorization frames– [NP-subject ``Nasty innuendoes,''] [VP says [NP-object

John Siegal, Mr. Dinkins's issues director, ``designed to prosecute a case of political corruption that simply doesn't exist.'']]

Page 326: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources326

Advanced topics

• Domain adaptation– Adapting the grammar and/or the disambiguation

model to a new domain using a small amount of training data

• Generation– Using the grammar for sentence generation

• Semantics construction– Obtaining representations of formal semantics

from HPSG parsing

• Applications

Page 327: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources327

Domain adaptation (1/2)

• Disambiguation models are adapted to a bio domain using small training data– An original probabilistic model is incorporated into

a new model as a reference distribution– Parameters of the new model are estimated so as

to maximize the likelihood of the new training data

iiiorignew xgxp

Zxp exp)(

1

Reference distribution

Page 328: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources328

Domain adaptation (2/2)

• Evaluation with a bio-domain corpus• Training data:

– Penn Treebank (News): 39,832 sentences– GENIA Treebank (Bio): 3,524 sentences

Precision Recall

News domain 87.69% 87.16%

Bio domain(w/o

adaptation)85.50% 83.91%

Bio domain 87.19% 85.58%

Page 329: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources329

Generation (1/2)

• The methods for HPSG parsing are applied to a chart generator of HPSG– Feature forest model– Iterative beam thresholding

he(x) buy(e) the(y) book(z) past(e)

{3}{2}{1}{0}

{0,3}{0,2} {2,3}{1,3}{1,2}

{1,2,3}{0,2,3}{0,1,3}{0,1,2}

{0,1,2,3}

0 1 2 3chart generation

He bought the book.

3210

0-3

1-30-2

2-31-20-1

chart parsing

0 1 2 3

{0,1}

Page 330: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources330

Generation (2/2)

• Evaluation on Penn Treebank Section 23

Beam width

Coverage (%)

Avg. generation

time (msec.)BLEU

Beam thresholding

4 44.76 621 0.8196

8 67.70 1776 0.8294

12 73.12 3074 0.8327

16 72.90 4287 0.8341

20 71.81 5273 0.8333

Iterative beam thresholding

8-20 82.47 1668 0.7982

Page 331: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources331

• Mapping from HPSG parse trees into semantic representations of typed dynamic logic (TDL)– Typed dynamic logic: a variant of dynamic

semantics that includes plural semantics, event semantics, and situation semantics (Bekki, 2005)

– Completely compositional semantics: lambda calculus composes semantic representations of phrases from lexical representations

Semantics construction (1/2)

Few boys fell. They died.

few(x)[boy’x][fall’x] ref(x)[die’x]Λ

Page 332: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources332

• Approach:– Mapping HPSG lexical entries into lexical

representations of TDL– Semantic representations of phrases are

composed along HPSG parse trees

• Coverage: around 90% of Penn Treebank Section 23 are assigned well-formed semantic representations

Semantics construction (2/2)

yethemexeagentelove

yobj

x

sbjsbjobj

,',''

.

.

..

PHON “loves”HEAD verbSUBJ <HEAD noun>COMPS <HEAD noun>

Page 333: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources333

Applications: information extraction

• Extraction of protein-protein interactions from biomedical paper abstracts– Patterns on predicate argument structures are

learned from small annotated data– Precision/recall: 71.8%/48.4%

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1Recall

Pre

cisi

on

(Yakushiji 2005)(Ramani et al., 2005)

Page 334: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources334

Applications: text retrieval

• Retrieval of relational concepts– All sentences in MEDLINE are parsed into

predicate argument structures– Relational concepts, such as “what causes

cancer”, are retrieved by matching with predicate argument structures

– Precision/recall: 60-96%/30-50%

Page 335: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources335

Summary

• Conversion of Penn Treebank II into an HPSG treebank– Pattern-based tree conversion and principle application

• Extraction of lexical entries from the HPSG treebank– Generalization, application of inverse lexical rules, and

assignment of predicate argument structures

• Probabilistic modeling of feature structures– Feature forest model

• Techniques for efficient parsing with probabilistic HPSG– Iterative beam thresholding

• Evaluation– Coverage and parsing accuracy

• Advanced topics– Domain adaptation, sentence generation, semantics

construction, and practical applications

Page 336: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources336

Publications

• Corpus-oriented development of HPSG– Y. Miyao, T. Ninomiya, and J. Tsujii. (2003). Lexicalized Grammar

Acquisition. In Proc. 10th EACL Companion Volume.– Y. Miyao, T. Ninomiya, and J. Tsujii. (2004) Corpus-oriented

grammar development for acquiring a Head-Driven Phrase Structure Grammar from the Penn Treebank. In Proc. IJCNLP 2004.

– H. Nakanishi, Y. Miyao, and J. Tsujii. (2004). Using Inverse Lexical Rules to Acquire a Wide-coverage Lexicalized Grammar. In the IJCNLP 2004 Workshop on “Beyond Shallow Analyses.”

– H. Nakanishi, Y. Miyao and J. Tsujii. (2004). An Empirical Investigation of the Effect of Lexical Rules on Parsing with a Treebank Grammar. In Proc. TLT 2004.

– K. Yoshida. (2005). Corpus-Oriented Development of Japanese HPSG Parsers. In 43rd ACL Student Research Workshop.

Page 337: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources337

Publications

• Feature forest model– Y. Miyao and J. Tsujii. (2002) Maximum entropy estimation

for feature forests. In Proc. HLT 2002.

• Probabilistic models for HPSG– Y. Miyao and J. Tsujii. (2003). A model of syntactic

disambiguation based on lexicalized grammars. In Proc. 7th CoNLL.

– Y. Miyao, T. Ninomiya and J. Tsujii. (2003). Probabilistic modeling of argument structures including non-local dependencies. In Proc. RANLP 2003.

– Y. Miyao, and J. Tsujii. (2005). Probabilistic disambiguation models for wide-coverage HPSG parsing. In Proc. ACL 2005.

– T. Ninomiya, T. Matsuzaki, Y. Tsuruoka, Y. Miyao, and J. Tsujii. (2006). Extremely Lexicalized Models for Accurate and Fast HPSG Parsing. In Proc. EMNLP 2006.

Page 338: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources338

Publications

• Parsing strategies for probabilistic HPSG– Y. Tsuruoka, Y. Miyao and J. Tsujii. (2004). Towards efficient

probabilistic HPSG parsing: integrating semantic and syntactic preference to guide the parsing. In the IJCNLP-04 Workshop on “Beyond shallow analyses.”

– T. Ninomiya, Y. Tsuruoka, Y. Miyao, and J. Tsujii. (2005). Efficacy of Beam Thresholding, Unification Filtering and Hybrid Parsing in Probabilistic HPSG Parsing. In Proc. IWPT 2005.

– T. Ninomiya, Y. Tsuruoka, Y. Miyao, K. Taura, and J. Tsujii. (2006). Fast and Scalable HPSG Parsing. Traitement automatique des langues (TAL). 46(2).

• Domain adaptation– T. Hara, Y. Miyao, and J. Tsujii. (2005). Adapting a

probabilistic disambiguation model of an HPSG parser to a new domain. In Proc. IJCNLP 2005.

Page 339: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources339

Publications

• Generation– H. Nakanishi, Y. Miyao, and J. Tsujii. (2005). Probabilistic models

for disambiguation of an HPSG-based chart generator. In Proc. IWPT 2005.

• Semantics construction– M. Sato, D. Bekki, Y. Miyao, and J. Tsujii. (2006). Translating

HPSG-style Outputs of a Robust Parser into Typed Dynamic Logic. In Proc. COLING-ACL 2006 Poster Session.

• Applications– Y. Miyao, T. Ohta, K. Masuda, Y. Tsuruoka, K. Yoshida, T.

Ninomiya, and J. Tsujii. (2006). Semantic Retrieval for the Accurate Identification of Relational Concepts. In Proc. COLING-ACL 2006.

– A. Yakushiji, Y. Miyao, T. Ohta, Y. Tateisi, and J. Tsujii. (2006). Automatic Construction of Predicate-Argument Structure Patterns for Biomedical Information Extraction. In EMNLP 2006 Poster Session.

Page 340: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources340

Comparing LFG, CCG, HPSG and TAG Acquisition

Page 341: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources341

Comparing LFG, CCG, HPSG and TAG Acquisition

Page 342: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources342

Demos

Page 343: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources343

Demos

Page 344: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources344

Future Work & Discussion

Page 345: ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.

ESSLLI 2006

Treebank-Based Acquisition of LFG, HPSG and CCG

Resources345

Future Work & Discussion