Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and...
-
Upload
alison-palmer -
Category
Documents
-
view
218 -
download
0
Transcript of Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and...
Treebanks and MWEs(Part 1)
Jan Hajič, Pavel Straňák, Jiří MírovskýInstitute of Formal and Applied Linguistics & LINDAT/CLARIN
School of Computer ScienceFaculty of Mathematics and Physics
Charles University in PragueCzech Republic
19.1.2015 PARSEME Training School Prague 2
Outline
• Treebanks– Phrase-(Constituency-) based: The Penn Treebank– Dependency: The Prague Dependency Treebanks
• The Penn Treebank (basics)• The Prague Dependency Treebank
– Layers of Annotation• Morphology• Syntax• Semantics• Valency
19.1.2015 PARSEME Training School Prague 4
Phrase- vs. Dependency-Based Treebanks
• The original: The Penn Treebank– Phrase-based style; good for parsing by CFG grammars
• Followers– Almost all Penn-based treebanks
• Chinese, Arabic, Korean, …
– Negra (German), many others
• Now: dependency parsing prevails• Conversion from phrase-based treebanks
– Might lose information, heads added „ad hoc“
• “native” dependency treebanks: annotated as such– Considered “better”– Hindi/Urdu, TIGER (sort of); both styles manually annotated– PDT (of course) and similar ones
» PDT style treebanks: Danish, Croatian, Slovene, Greek, Latin
The Penn Treebank
• Published (first) in 1993, now LDC99T42 (www.ldc.upenn.edu)– First the Wall Street Journal part (1 mil. words, 2312 documents)
• Added other text types– ATIS corpus (dialogs, travel reservations)
– Brown corpus annotated for syntax
– Switchboard (spoken language, tel. conversations)
19.1.2015 PARSEME Training School Prague 5
19.1.2015 PARSEME Training School Prague 6
Penn Treebank Format
• ( (S • (NP-SBJ • (NP (NNP Pierre) (NNP Vinken) )• (, ,) • (ADJP • (NP (CD 61) (NNS years) )• (JJ old) )• (, ,) )• (VP (MD will) • (VP (VB join) • (NP (DT the) (NN board) )• (PP-CLR (IN as) • (NP (DT a) (JJ nonexecutive) (NN director) ))• (NP-TMP (NNP Nov.) (CD 29) )))• (. .) ))
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
POS tag (NNS)(noun, plural)
Phrase label (NP)
Noun Phrase
“Preterminal”
The Penn Treebank(s)
• Extensions– Annotation of named entities, co-reference (BBN)
– cf. also previous slides
– Function labels (SBJ, OBJ, TMP, ...)– PropBank
• Penn Treebank syntax + Predicate-argument relations, added “frame files” (predicate dictionary)
(S (NP-SBJ (PRP I Arg0) VP (VBD gave Pred) (NP-DOBJ (PRP him Arg1) (NP-IOBJ (DET the) (NN book Arg2))) ... )
– NomBank• Like PropBank, but for nouns and their “arguments”
• Other languages (Chinese, Arabic, ...)
19.1.2015 PARSEME Training School Prague 7
19.1.2015 PARSEME Training School Prague 9
The Prague Dependency Treebanks: the Basics
• Original Treebank: PDT 1.0, 2001 (morf., dep. syntax)• First full release: PDT 2.0
– http://ufal.mff.cuni.cz/pdt2.0 • LDC2006T01, see http://www.ldc.upenn.edu
– Now: PDT 3.0: http://ufal.mff.cuni.cz/pdt3.0 • Basic general features
– Multilayered annotation, interlinked layers– Dependency-based syntax (both surface and deep)– Information structure of the sentence (topic/focus)– Grammatical and basic textual coreference– New: discourse relations, MWEs
• Languages: Czech, English (also parallel), Arabic– Student work on “samples”: Indonesian, Urdu, Russian, …– Spoken: work started on Czech and English (non-parallel, dialogs)
19.1.2015 PARSEME Training School Prague 10
The Prague Dependency
Treebank
• Three basic layers of annotation– Morphemic layer– Surface syntax (“analytical”) layer– “Tectogrammatical” layer:
underlying syntax, semantic roles (valency), inf. structure, coreference
• Size– 830,000 words (tokens) = 50000 sentences in 3165 full
documents (texts)• Format
– Prague Markup Language (XML-based)
– Now also: .treex format• For smooth uise in the TreeX platform• http://ufal.mff.cuni.cz/treex
19.1.2015 PARSEME Training School Prague 11
PDT (Czech) Data
• 4 sources:– Lidové noviny (daily newspaper, incl. extra sections)– DNES (Mladá fronta Dnes) (daily newspaper)– Vesmír (popular science magazine, monthly)– Českomoravský Profit (economical journal, weekly)
• Full articles selected– article ~ DOCUMENT (basic corpus unit)
• Time period: 1990-1995
• 1.8 million tokens (~110,000 sentences total)
19.1.2015 PARSEME Training School Prague 12
PDT Annotation Layers
• L0 (w) Words (tokens)– automatic segmentation and markup only
• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma
• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function
• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, “functor”, grammatemes, ellipsis
solution, coreference, topic/focus (deep word order), valency lexicon; PDT 3.0PDT 3.0: mass, clauses, formemes, discourse, ...
PD
T 1
.0
(200
1)
PD
T 2
.0
(200
6)
19.1.2015 PARSEME Training School Prague 13
PDT Annotation Layers
• L0 (w) Words (tokens)– automatic segmentation and markup only
• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma
• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function
• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, functor (detailed), grammatemes,
ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
19.1.2015 PARSEME Training School Prague 14
Morphological Attributes
Tag: 13 categories Example: AAFP3----3N---- Adjective no poss. Gender negated Regular no poss. Number no voice Feminine no person reserve1 Plural no tense reserve2 Dative superlative base var.
Lemma: POS-unique identifierBooks/verb -> book-1, went -> go, to/prep. -> to-1
Ex.: nejnezajímavějším“(to) the most uninteresting”
19.1.2015 PARSEME Training School Prague 15
PDT Annotation Layers
• L0 (w) Words (tokens)– automatic segmentation and markup only
• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma
• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function
• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, functor (detailed), grammatemes,
ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
19.1.2015 PARSEME Training School Prague 16
Layer 2 (a-layer): Analytical Syntax
• Dependency + Analytical Function
dependent
governor
The influence of the Mexicancrisis on Central and EasternEurope has apparently been underestimated.
19.1.2015 PARSEME Training School Prague 17
Analytical Syntax: Functions
• Main (for [main] semantic lexemes):• Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom• “Double” dependency: AtrAdv, AtrObj, AtrAtr
• Special (function words, punctuation,...):• Reflexives, particles: AuxT, AuxR, AuxO, AuxZ, AuxY• Prepositions/Conjunctions: AuxP, AuxC• Punctuation, Graphics: AuxX, AuxS, AuxG, AuxK
• Structural• Elipsis: ExD, Coordination etc.: Coord, Apos
PDT Annotation Layers
• L0 (w) Words (tokens)– automatic segmentation and markup only
• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma
• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function
• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, functor (detailed), grammatemes,
ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
19.1.2015 PARSEME Training School Prague 18
Tectogrammatical Annotation
• Underlying (deep) syntax
• 5 sublayers (integrated and/or standoff annotation):– dependency structure, (detailed) functors
• valency annotation
– topic/focus and deep word order
– coreference (mostly grammatical only)
– discourse
– all the rest (grammatemes): • detailed functors• underlying gender, number, mass nouns, ...
• Total: 39 attributes (vs. 5 at m-layer, 2 at a-layer)
19.1.2015 PARSEME Training School Prague 19
19.1.2015 PARSEME Training School Prague 20
Tectogrammatical vs. analytical syntax
TR: Nofunction words
AR: All words
Predicate verb
“Location”
In practice, that procedure will require making of certified copies.
Re-inserted elided actorof “making”
Dependency Structure
• Similar to the surface (Analytical) layer... ...but:– certain nodes deleted
• auxiliaries, non-autosemantic words, punctuation• (some) multiword expressions -> 1 node
– some nodes added• based on word (mostly verb, noun) valency• some ellipsis resolution
– detailed dependency relation labels (functors)
19.1.2015 PARSEME Training School Prague 21
Tectogrammatical Functors
• “Actants”: ACT, PAT, EFF, ADDR, ORIG
– modify: verbs, nouns, adjectives
– cannot repeat in a clause, usually obligatory
• Free modifications (~ 50), semantically defined– can repeat; optional, sometimes obligatory– Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP,
INTT, MANN; MAT, APP; ID, DPHR, CPHR, ...
• Special– Coordination, Rhematizers, Foreign phrases (#Forn),...
19.1.2015 PARSEME Training School Prague 22
“syntactic” semantic
MW
Es
Deep Word Order Topic/Focus
• Example:
• Baker bakes rolls. vs. BakerIC bakes rolls.
19.1.2015 PARSEME Training School Prague 23
Analyticaldep. tree:
Deep Word OrderTopic/Focus
• Deep word order:– from “old” information to the “new” one (left-to-
right) at every level (head included)– projectivity by definition (almost...)
• i.e., partial level-based order -> total d.w.o.
• Topic/focus/contrastive topic– attribute of every node (t, f, c)– restricted by d.w.o. and other constraints
19.1.2015 PARSEME Training School Prague 24
Coreference
• Grammatical (easy)– relative clauses
• which, who– Peter and Paul, who ...
– control• infinitival constructions
– John promised to go home
– reflexive pronouns• {him,her,thme}self(-ves)
– Mary saw herself in ...
19.1.2015 PARSEME Training School Prague 25
Johngo
he home
promisePRED
ACTPAT
ACT DIR3
Coreference
• Textual– Ex.: Peter moved to Iowa after he finished his PhD.
19.1.2015 PARSEME Training School Prague 26
Peter Iowafinish
he PhD
movePRED
ACT DIR1TWHEN
ACT PAT
heAPP
Grammatemes
• Detailed functors (“subfunctors”)– needed for some functors:
• TWHEN: before/after• LOC: next-to, behind, in-front-of, ...• also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT
• Lexical (underlying)– number (Sg/Pl), tense, modality, degree of
comparison, mass-noun?; is_person_name, is_dsp_root, ...
19.1.2015 PARSEME Training School Prague 27
MW
Es
Prague Dependency Treebank & Valency
• Valency in the PDT– Valency lexicon for PDT– General valency lexicon
• Valency in deep vs. surface syntax– Links between the layers w.r.t. valency
• Valency and word sense– Sense-disambiguated occurrences:
• Links from data to the lexicon
• Valency in translation, text generation
19.1.2015 PARSEME Training School Prague 31
Definition of Valency
• Ability (“desire”) of words (verbs, nouns, adjectives) to combine themselves with other units of meaning
• Properties of valency:– Specific for every word meaning (in general)
• leave: sb left sth for sb vs. sb left from somewhere• similar to PropBank leave.02 vs. leave.01
– Typically strongly correlates with surface form (Czech)• morphological case (~ ending), preposition+case, ...)
– Semantic constraints
19.1.2015 PARSEME Training School Prague 32
Structure of Valency
• word (lemma) – word sense group 1
• valency frame:– slot1 slot2 slot3
• surface expression
– word sense group 2 • ...
19.1.2015 PARSEME Training School Prague 33
vyměnit (to replace) vyměnit1
ACT PAT EFF
Nom. Acc.za+Acc.
vyměnit2
...
PDT-Vallex Entry
• dosáhnout: “to reach”, “to get [sb to do sth]”
• browser/user-formatted example:
19.1.2015 PARSEME Training School Prague 34
MWEs in PDT-Vallex
• Types included:– Reflexive particle (se, si)
• smát se – to laugh• všimnout si – to notice
– Idiomatic constructions• dosáhnout svého - to achieve one’s goals• běhá mi mráz po zádech – to give me the shivers
– Light verb constructions (and similar)• uzavřit dohodu – to agree [on sth], strike an
agreement, ...• vzbuzovat pochybnosti – to doubt, to raise doubts
19.1.2015 PARSEME Training School Prague 35
smát
_se
(t_l
emm
a)
DP
HR
(ar
gum
ent)
CP
HR
(ar
gum
ent)
Corpus ↔ Valency Lexicon
19.1.2015 PARSEME Training School Prague 36
• Corpus:
ENTRY: uzavřít (to close) vf1: ACT(.1) CPHR({smlouva}.4)
ex: u. dohodu (close a contract)vf2: ACT(.1) PAT(.4)
ex.: u. pokoj (close a room, house)
Lexicon:
Sentence 2035: Sentence 15345: Sentence 51042:
Valency & Text Generation
19.1.2015 PARSEME Training School Prague 37
• Using valency for...– ...getting the correct (lemma, tag) of verb arguments
• Example:
starat_se
PRED
Martin
ACT
tygr
PAT
Martin
....1..........
starat
V..............
o
...............
tygr
....4..........
VALLEX entry: starat (se) ACT(.1) PAT(o.[.4])
se
...............
Martin se stará o tygry.
“Martin takes care of tigers.”
“to take care of”
“tiger”
19.1.2015 PARSEME Training School Prague 39
Parallel Czech-English Annotation
• English text → Czech text (human translation)• Czech side (goal): all layers manual annotation• English side (goal):
– Morphology and surface syntax: technical conversion• Penn Treebank style -> PDT Analytic layer
– Tectogrammatical annotation: manual annotation• (Slightly) different rules needed for English
• Alignment– Natural, sentence level only (now)
19.1.2015 PARSEME Training School Prague 41
English Annotation POS and Syntax
• Automatic conversion from Penn Treebank– PDT morphological layer
• From POS tags
– PDT analytic layer• From:
– Penn Treebank Syntactic Structure– Non-terminal labels– Function tags (non-terminal “suffixes”)
• 2-step process– Head determination rules– Conversion to dependency + analytic function
Czech-English Example
19.1.2015 PARSEME Training School Prague 46
Dicku Darmane, zavolejte do své kanceláře! Dick Darman, call your office!
19.1.2015 PARSEME Training School Prague 48
PDT Treebanks at UFAL (written language)
• Czech– Prague Dependency Treebank
• Complex annotation, all levels, additional annotation
– Translation of Penn Treebank• Tectogrammatical layer only, no t/f
– Analytical, morphology: automatic tool
• English– Re-annotation of Penn Treebank
• Other languages– Arabic (own annotation)– Other: by conversion (HamleDT – 30 treebanks)
Prague Dependency Treebanks
• Annotation:– 4 layers:
• Words, lemmas/tags, surface dep. syntax, tectogrammatics
– Tectogrammatical layer:• No function words, semantic relations• Valency/verb arguments (some MWE features)
– Separate valency lexicon, fully linked from PDT nodes
• Coreference, Topic/focus, Discourse • Links back to analytical layer (parsing!)
19.1.2015 PARSEME Training School Prague 49
Now also in the Universal Dependency format!
https://github.com/UniversalDependencies
19.1.2015 PARSEME Training School Prague 50
Pointers• PDT 2.0 (the “Original”), newest version: PDT 3.0
– http://ufal.mff.cuni.cz/pdt2.0– http://ufal.mff.cuni.cz/pdt3.0
• PCEDT– http://ufal.mff.cuni.cz/pcedt2.0/
• PEDT– English side of PCEDT, additional: NE, coreference– http://ufal.mff.cuni.cz/pedt2.0/
• PADT (Arabic, morphology + surface syntax)– http://ufal.mff.cuni.cz/padt
• Other corpora, PDT-Vallex, EngVallex:– Search at http://lindat.cz
• LDC catalog numbers:– LDC2006T01 (PDT 2.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PEDT 1.0)
• CoNLL 2009 shared task (7 languages, surface syntax + predicate arguments only)
– http://ufal.mff.cuni.cz/conll2009-st • HamleDT 2.0 (30 treebanks in unified format)
– http://ufal.mff.cuni.cz/hamledt