NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of...

14
NAACL 2007 Treebank-Based Acquisition of Multilingual LFG 1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland Treebank Workshop NAACL 2007

Transcript of NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of...

Page 1: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 1

Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer

Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland

Treebank Workshop NAACL 2007

Page 2: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 2

• “Shallow” grammar: defines language (set of strings)• “Deep” Grammar: as above + maps strings to “meaning”

representation: predicate-argument structure, dependencies, simple logical form …, usually involves some form of long-distance dependency (LDD) resolution

• Deep grammars (HPSG, LFG, CCG, TAG …) usually hand-crafted • Very difficult & expensive to scale to unrestricted text• Motivation for treebank-based deep grammar acquisition

(LFG/CCG/HPSG/TAG/DepGr/…)!!

• LFG: [Kaplan and Bresnan, 82; Dalrymple, 2001; Bresnan, 2001]• Constraint-based (“unification”), lexicalised• c(onstituent)-str & f(unctional) structure• c-str: surface configuration (CFG trees)• f-str: abstract grammatical functions/relations (SUBJ, OBJ, OBL,

COMP, XCOMP, ADJN, POSS, APP, …)• f-str: AVM (feature-structure) encoding of dependencies/pred-

arg.

Lexical-Functional Grammar (LFG)

Page 3: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 3

Lexical-Functional Grammar LFG

Page 4: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 4

Lexical-Functional Grammar LFG

• Treebank: trees• How do we get from trees to f-structures?• What’s missing is the equations!

• Automatic f-structure annotation algorithm • Traverses tree and assigns LFG equations • Principle-based c-str/f-str interface

Page 5: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 5

F-Structure Annotation Algorithm

• Algorithm exploits:

– Categorial information (NP, VP, VBZ, …)

– Configurational information:• Local head, left/right of head• Leftmost NP sister to right of V(erbal) head: (OBJ)=

– Morphological information:• Him: (OBJ)=

– “Functional” tag information: • -LGS (PASSIVE)=+ , -SBJ, -CLR, …

– Trace/co-indexation information • Translate traces + co-indexation to corresponding re-

entrancies at f-str.

Page 6: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 6

F-Structure Annotation Algorithm

Left-Right Context Annotation Principles

Coordination Annotation Principles

Catch-All and Clean-Up

Traces

ProtoF-Structures Proper

F-Structures

Head-Lexicalization [Magerman,1994]

Lemmatization + Macros Lexical Entries

Defaults – “Functional Tags”

Page 7: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 7

Treebank Annotation: Control & Wh-Rel. LDD

Page 8: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 8

Multilingual Treebank-Based LFG Resources

• English + Penn-II: parsers (+ LDD resolution), generators, subcat-frame extraction, bootstrapping of new TB-resources (QuestionBank), transfer

• Pilots/proof of concept: multilingual treebank-based LFG acquisition:

– German: TIGER (Cahill et al 2003, 2005)

– Chinese: CTB (Burke et al 2004)

– Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van Genabith 2006)

• GramLab Project (2005-2008): Chinese, Japanese, Arabic, Spanish, French and German

Page 9: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 9

Multilingual Treebank-Based LFG Resources

LanguageTreebank

• English Penn-II• Chinese CTB 5.1• Japanese KTC 4.0• German TIGER 2.0• German TűBa-D/Z• Spanish Cast3LB• Arabic ATB• French P7T

Size Coding/Data

50,000 CFG+traces+FT 18,000 CFG+traces+FT38,000 Dep (+traces)50,000 Graphs+CFG+Dep22,000 CFG+Dep+f-traces 3,500 CFG+Dep+f-traces300,000 (words)20,000 CFG+Dep+f-traces-------- > 200,000

Page 10: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 10

Q2

• What was missing in TB resource?

– F-structures, pred-argument structure, dependencies => f-structure annotation algorithm

– Limited domain in Penn-II (most treebanks …) => bootstrap grammar and QuestionBank (4000 questions from TREC and CCG)

– GFs, active/passive, decl/interrog/imp, control, raising, LDDs, pro-drop, zero-anaphora, tense/aspect, …

• What was done by hand?

– F-structure annotation algorithm (principle-based c-/f-str interface)

– No restructuring, no clean-up of TB (unlike CCG/HPSG/TAG – but see P7T)

– No manual additions (unlike CCG/HPSG/TAG)

– Future work …

Page 11: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 11

Q3

• Methodological Issues - Quality Assurance:

• Evaluation against hand-crafted/corrected Gold Standard DepBanks

– PARC 700

– CBS 500

– PropBank

– Own Gold standard DepBanks for: English, Chinese, Japanese, German, Arabic, Spanish, French (200-500)

• CCG-style evaluation against automatically annotated Gold (Silver-) Standard DepBanks based on WSJ Sec. 23 trees (CCG, HPSG)

• Quality of annotation process and parsing resources: treebank-based LFG parsing statistically significantly outperform XLE and RASP (PARC 700 & CBS 500)

Page 12: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 12

Q4

• Phrase Structure or Dependencies?

• Both!!! Why?:

• Phrase Structure good for parsing and generation => tab into lots of mature, efficient & well understood technology (but see dependency parsing)

• Dependencies close to f-structure/predicate-argument structures …

– Penn-II: CFG-trees + traces/co-indexation + “functional” labels/tags

– TIGER: graphs + CFG-categories + grammatical function labels + LDDs through crossing edges

– Cast3LB/P7T/TűBa-DZ: CFG trees + grammatical function labels + LDDs through GF paths

Page 13: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 13

Q5 & Q6

• Pros/Cons Formalism-Specific Treebank?– Formalism-Specific Treebank? Bad! Limits usefulness/user group/…

– Better to have generic TB with CFG + Dep Label + LDDs + other feature labels (as required). And then extract LFG/HPSG/CCG/TAG/Dependency Grammars

• Grammar First vs. Treebank First?– Depends on what you want to do …

– If you want high-quality, wide-coverage resources (that can parse unrestricted text) then its definitely better to do treebanking-first (or use bootstrapping)

– Problem: many traditionally trained linguists see TreeBanking as menial task

– Highly qualified and interesting task: empirical linguistics: confront/rather than invent data

– Sociological task: how to make treebanking/bootstrapping sexy?

Page 14: NAACL 2007 Treebank-Based Acquisition of Multilingual LFG Resources1 Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and.

NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources 14

Some Resources

• ESSLLI 2006 course material: Treebank-Based Acquisition of LFG, HPSG and CCG Resources. J. van Genabith, Y. Miyao and J. Hockenmaier

• http://www.computing.dcu.ie/~josef/Malaga06.ppt

• LFG parser demo:

• http://lfg-demo.computing.dcu.ie/lfgparser.html

• A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFG-Approximations, COLING/ACL 2006, Sydney, Australia

• J. Judge, A. Cahill and J. van Genabith, QuestionBank: Creating a Corpus of Parse-Annotated Questions, COLING/ACL 2006, Sydney, Australia

• R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005

• A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language and Computation; Kluwer Academic Press, 2005

• R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005

• M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLING-18 Conference, Waseda University, Tokyo, Japan, pages 161-172, 2004

• A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of ACL-04, pp. 320-7, Barcelona, Spain, 2004

• Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure Annotation, In M. Butt and T. Holloway-King (eds.): LFG’02, Athens, Greece, CSLI Publications, Stanford, CA., pp.76--95. 2002