The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer...
-
Upload
elizabeth-manning -
Category
Documents
-
view
221 -
download
2
Transcript of The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer...
The Prague (Czech-)English Dependency Treebank
Jan HajičCharles University in Prague
Computer Science School
Institute of Formal and Applied Linguistics
Major contributions by:
E: Silvie Cinková, Jana Šindlerová, Josef Toman, (J. Semecký)
C: Marie Mikulová, Zdeňka Urešová, Jan Štěpánek
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
2
Today...
• The family of Prague Dependency Treebanks– Incl. the Prague (Czech-)English Dependency Treebank
• English “Tectogrammatical Representation” (TR)– Annotation layers– From Penn Treebank (et al.) to PDT-style English
tectogrammatics– TR annotation of 5 interesting English phenomena
• The annotation process– TrEd, EngVallex and the current status
• To take home + pointers
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
3
The Family of Prague Dependency Treebanks
• Prague Dependency Treebank (Czech)– 2001: version 1.0 (no deep syntax/semantics)– 2006: version 2.0 (w/deep syntax, semantics)
• Prague Czech-English Dependency TB 1.0– 2004: automatic annotation– English: PTB, Czech: 1/3rd of PTB translated
• Prague Arabic Dependency Treebank 1.0– 2004: ~ PDT 1.0 (no deep syntax)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
4
The Prague Czech-EnglishDependency Treebank
• Penn Treebank+ PropBank
+ BBN (co-reference and Named Entities)
+ NP structure (D. Vadas, J. R. Curran, ACL’07)
+ “Czech-like” tectogrammatics
• Translation to Czech– Manual annotation (with auto pre-annotation)
• Morphology, Syntax, Tectogrammatics (TR)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
5
Example: English TR
• Words
• Dependencies
• Sem. function
• Valency (predicates)
• Coref (BBN)
• Named Entities (BBN)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
6
Layers of Annotation
• t-layer– tectogrammatics
• a-layer– (surface) syntax
• m-layer– Morphology (POS)
• w-layer– words (tokens)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
7
English Surface Syntax
• From PTB:– Form– POS Tag– Function label– (Structure)
• Added– Lemma– Heads
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
8
Head Determination Rules
• Exhaustive set of rules – By J. Eisner + M. Čmejrek/J. Cuřín– 4000 rules (non-terminal based)
• Ex.: (S (NP-SBJ VP .)) → VP
– Additional rules• Coordination, Apposition• Punctuation (end-of-sentence, internal)
• Original idea (possibility of conversion)– J. Robinson (1960s)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
9
Example: Head Determination Rules
(board)
(board)(the)
(join)
(will) (join)
(join)
(join)
(NP (DT NN)) → NN
(VP (VB NP)) → VB
(VP (MD VP)) → VP
(S (… VP …)) → VP
Rules:
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
10
Conversion: Analytic Structure, Functions
• Syntactic Function assignment (conversion)• Rules
– based on PTB functional tags:-SBJ Sb -PRD Pnom -BNF Obj -DTV Obj
-LGS Obj -ADV Adv -DIR Adv -EXT Adv-LOC Adv -MNR Adv -PRP Adv -PUT Adv-TMP Adv
– Ad-hoc rules (if functional tags missing)– Lemmatization (years → year)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
11
Syntactic Structure, Functions: PTB to P(E)DT
(board)
(board)(the)
(join)
(will) (join)
(join)
(join)
→→
Penn Treebank structure
(with heads added) PDT-like Analytic
Representation
PRED.Fut
PAT
PDT-like
Tectogrammatic
Representation
(automatic
pre-annotation)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
12
English TR IPredicative Complement
• Free (non-valency) modification (of both a noun and a verb)• attribute compl.rf (green arrow to the noun)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
13
English TR IIWhich + Relative Clause
We have not answered your question completely, for which we apologize.
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
14
English TR III: Coordination
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
15
English TR IV: Comparison
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
16
English TR V: Restriction (“Exclusion”)
except, with the exception of, excluding, (all/none) but, beyond, apart from, unless, bar, barring, besides
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
17
English TR: (manual) annotation
• TrEd– Pre-annotated– Graphical
• TR dep. tree is primary
– Text + TR– Czech translation
• Valency (a.k.a. “propbanking”)– During TR annotation– Propbank origins and
examples• Linked, displayed
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
18
EngVallex (give)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
19
EngVallex Format (admit)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
20
Interannotator Agreement
2007-2009:- New annotators (lower numbers)- Annotation “by phenomenon”- Restarting now
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
21
Prague English Dependency Treebank
• Availability– Version 1.0 now (PTB license needed)
• 250k words
– Full version (parallel with Czech): late 2010
• Size– Full WSJ portion of PTB (2312 files)
– 49208 sentences, 1253013 tokens
– Now:– 17210 sentences (34.97%), 439983 tokens (35.11%)
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
22
Czech PDT-style Annotation
• All layers – morphology, syntax, tectogrammatical
• So far…– Automatic (many tools by many authors)
• Manual annotation– In progress (28124 sentences/639326 words)– Top-down
• Tectogrammatical first (lower layers automatically)• … then syntactic structure and morphology
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
23
Summary
• PDT is/has (a)…– (Family of) dependency-based treebanking project(s)
• Czech (English, Arabic, ...)
– ~ 1mil. words• sufficient size for ML experiments
– 4 interlinked layers of annotation• token, morphology, syntax, deep syntax/semantics++)• independent and “full” information at all levels• interlinked (for the development of parsers/generators)
– Parallel corpus Cze <-> Eng -> Machine Translation
June 8, 2009 Dependency Workshop Boulder, CO Czech-English Dependency Treebank
24
Pointers, Acknowledgements
• http://ufal.mff.cuni.cz/pedt
• http://ufal.mff.cuni.cz/pdt2.0
• http://ufal.mff.cuni.cz/~pajas/tred
• Acknowledgements– FP6-IST “Euromatrix”, FP7-IST “Euromatrix+”– LC536 (Center for Computational Linguistics)– GAČR 405/06/0589 (Speech and deep syntax)– MŠMT: MSM0021620838, ME838, ME09008