Prague Dependency Treebank(s) Workshop at LSA2011, Part I

Prague Dependency Treebank(s)

Workshop at LSA2011, Part I

Jan Hajič, Zdeňka Urešová

Institute of Formal and Applied Linguistics

School of Computer Science

Faculty of Mathematics and Physics

Charles University, Prague

Czech Republic

July 30, 2011 LSA2011 Prague Dependency Treebanks I 2

Part I - Text to Syntax

The Prague Dependency Treebank projects The theory behind it The Corpora Morphology

Dictionaries, tools (incl. POS tagger) Dependency surface syntax

Czech, Arabic, English Parallel annotated corpus

The Prague Dependency Treebank(s)

The idea Apply the “old” Prague theory to real-word texts Provide enough data for ML experiments

?“Old” Prague theory Prague structuralism (1930s) Stratificational approach Centered on “deep syntax”

Separated from “surface form” Dependency based (how else )

PDT:The Methodology

Manual annotation is PRIMARY Some help from existing tools possible

“No information loss, no redundancy” Much formalization, but… … original form always retrievable

Dictionaries In theory: “secondary”, side effect of annotation (generalization) In reality: help consistency Links: data → dictionary(-ies)

Result: used for Machine Learning Ergonomy of annotation

Graphical (“linguistic”) presentation & editing

The Prague Dependency Treebank Project: Czech Treebank

1995 (Dublin) 1996-2006-... 1998 PDT v. 0.5 released (JHU workshop)

400k words manually annotated, unchecked 2001 PDT 1.0 released (LDC):

1.3MW annotated, morphology & surface syntax 2006 PDT 2.0 release

0.8MW annotated (50k sentences) + PDT 1.0 corrected the “tectogrammatical layer”

underlying (deep) syntax

Related Projects (Treebanks)

Prague Czech-English Dependency Treebank WSJ portion of PTB, translated to Czech (1.2 mil. words)

Annotated for basice set of attributes Penn Treebank / WSJ

Pre-converted, manually annotated for basic sets of attributes Named entity annotation, co-reference (BBN) merged in Detailed breakdown of NP

Prague Arabic Dependency Treebank apply same representation to annotation of Arabic surface syntax so far

Both published (in first version) 2004 (by LDC) PCEDT/PEDT version 2.0 being prepared (2011?) Preliminary version available for browsing – see the workshop web

PDT (Czech) Data

4 sources: Lidové noviny (daily newspaper, incl. extra sections) DNES (Mladá fronta Dnes) (daily newspaper) Vesmír (popular science magazine, monthly) Českomoravský Profit (economical journal, weekly)

Full articles selected article ~ DOCUMENT (basic corpus unit)

Time period: 1990-1995 1.8 million tokens (~110,000 sentences)

PDT Annotation Layers L0 (w) Words (tokens)

automatic segmentation and markup only L1 (m) Morphology

Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

Morphological Attributes

Tag: 13 categoriesExample: AAFP3----3N----Adjective no poss. Gender negatedRegular no poss. Number no voiceFeminine no person reserve1Plural no tense reserve2Dative superlative base

var.Lemma: POS-unique identifier

Books/verb -> book-1, went -> go, to/prep. -> to-1

Ex.: nejnezajímavějším“(to) the most uninteresting”

Morphological Tagset 13 categories, 4452 plausible tags (combinations):

Category # of values Example(s)POS 10 N (noun), Z (punctuation)SUBPOS 75 P (personal pron.), U (possessive adj.)GENDER 8 I (masc. inanimate), X (any), - (N.A)NUMBER 4 P (plural), D (dual)CASE 9 1 (nominative), 6 (locative)POSSGENDER 4 M (masc. animate), F (feminine)POSSNUMBER 3 S (singular), P (plural)PERSON 5 1 (first), ...TENSE 4 P (present), M (past)GRADE 5 3 (superlative)NEGATION 3 A (affirmative), N (negative)VOICE 3 A (active), P (passive)VAR 11 1 (1st variant), 6 (colloq. style), 8 (abbrev.)

Morphological Analysis Formally: MA: A+ → Pow(L x T)

MA(f) = { [ l,t ] }; f A+ (the token), l L (lemma), t T (tag)

tokens taken in isolation no attempt to solve e.g. auxiliaries vs. full verbs Ex.: MA(“má“) = { [mít,VB-S---3P-AA---], lit. “to have”

lit. “has”,”my” [můj,PSFS1-S1------1], lit. “my” [můj,PSFS5-S1------1], [můj,PSNP1-S1------1], [můj,PSNP4-S1------1], [můj,PSNP5-S1------1] }

Morphological Disambiguation

Full morphological disambiguation more complex than (e.g. English) POS tagging

Several full morphological taggers: (Pure) HMM Feature-based (MaxEnt, NB)

used in the PDT distribution Voted Perceptron, (M. Collins, EMNLP’02)

All: ~ 94-96% accuracy (perceptron is best) rule & statistic combination: tiny improvement

(Hajič et al., ACL 2001, Spoustova et al., 2007: > 96%)

The Segmentation Problem:Arabic

Tokenization / segmentation not always trivial Arabic, German, Chinese, Japanese

The Segmentation Problem:Solution for Arabic

Find max. no. of segments, concatenate up to max. 4 (x10) for Arabic

F---------VIIA-3MS--S----3MP4----------- sa-+yu-hbir-u+-hum+0

F---------VIIA-3MS--S----3MP4-----------

P---------SD----MS----------------------

P---------------------------------------

N-------2R------------------------------

N-------2D------------------------------

A-----FS2D------------------------------

Resulting annotation:

Arabic Tagging Results

Maximum entropy, features ~ categories Experiments on Penn Arabic Treebanks

POS: From 95.25-97.37% Full morph. 88.17-89.31% Segmentation: 98.60-99.37%

Prague Arabic Dependency Treebank POS: 96.02% Full morph.: 89.24% Segmentation: 99.25%

Layer 2 (a-layer): Analytical Syntax

Dependency + Analytical Function

dependent

governor

The influence of the Mexicancrisis on Central and EasternEurope has apparently been underestimated.

Analytical Syntax: Functions

Main (for [main] semantic lexemes):Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom“Double” dependency: AtrAdv, AtrObj, AtrAtr

Special (function words, punctuation,...):Reflefives, particles: AuxT, AuxR, AuxO, AuxZ, AuxYPrepositions/Conjunctions: AuxP, AuxCPunctuation, Graphics: AuxX, AuxS, AuxG, AuxK

StructuralElipsis: ExD, Coordination etc.: Coord, Apos

Surface Syntax Example

Complete sentence: Sb, Pred, Obj The-baker bakes rolls. Pekař peče housky.

Incomplete phrases Peter works well , but Paul badly Petr pracuje dobře, ale Pavel špatně

Variants (equal meaning) (he) bought shoes for boy koupil boty pro kluka

PDT-styleArabic Surface Syntax

Only several differences (Sometimes) Separate nodes for individual

segments (cf. tagging/segmentation) Copula treatment (Czech: rare treated as

ellispsis; Arabic: systematic solution needed): Pred (Added) analytic functions:

AuxM (did-not)

Ante (what)

Work by Faculty of Arts, Charles University Arabic language students

Arabic Surface SyntaxExample

In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.

English Analytic Layer

By conversion from PTB Extended analytic functions

Head rules Jason Eisner’s, added more for full conversion

Coordination, traces, etc.

Coordination handling Same as in Czech/Arabic PDT

Penn Treebank

University of Pennsylvania, 1993 Linguistic Data Consortium

Wall Street Journal texts, ca. 50,000 sentences 1989-1991 Financial (most), news, arts, sports 2499 (2312) documents in 25 sections

Annotation POS (Part-of-speech tags) Syntactic “bracketing” + bracket (syntactic) labels (Syntactic) Function tags, traces, co-indexing + Propbanking

Penn Treebank Example

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

POS tag (NNS)(noun, plural)

Phrase label (NP)

Noun Phrase

“Preterminal”

Penn Treebank Example:Sentence Tree

Phrase-based tree representation:

Parallel Czech-English Annotation

English text -> Czech text (human translation) Czech side (goal): all layers manual annotation English side (goal):

Morphology and surface syntax: technical conversion Penn Treebank style -> PDT Analytic layer

Tectogrammatical annotation: manual annotation (Slightly) different rules needed for English

Alignment Natural, sentence level only (now)

Human Translation ofWSJ Texts

Hired translators / FCE level Specific rules for translation

Sentence per sentence only …to get simple 1:1 alignment

Fluent Czech at the target side If a choice, prefer “literal” translation

The numbers: English tokens: 1,173,766 Translated to Czech:

Revised/PCEDT 1.0: 487,929 Now finished (all 2312 documents)

English Annotation POS and Syntax

Automatic conversion from Penn Treebank PDT morphological layer

From POS tags PDT analytic layer

From: Penn Treebank Syntactic Structure Non-terminal labels Function tags (non-terminal “suffixes”)

2-step process Head determination rules Conversion to dependency + analytic function

Head Determination Rules

Exhaustive set of rules By J. Eisner + M. Cmejrek/J. Curin 4000 rules (non-terminal based)

Ex.: (S (NP-SBJ VP .)) → VP Additional rules

Coordination, Apposition Punctuation (end-of-sentence, internal)

Original idea (possibility of conversion) J. Robinson (1960s)

Example: Head Determination Rules (J.E.)

(board)

(board)(the)

(join)

(will) (join)

(join)

(NP (DT NN)) → NN

(VP (VB NP)) → VB

(VP (MD VP)) → VP

(S (… VP …)) → VP

Rules:

Conversion: Analytic Structure, Functions

Analytic Function assignment (conversion) Rules

based on functional tags:-SBJ Sb -PRD Pnom

-BNF Obj -DTV Obj-LGS Obj -ADV Adv-DIR Adv -EXT Adv-LOC Adv -MNR Adv-PRP Adv -PUT Adv-TMP Adv

Ad-hoc rules (if functional tags missing) Lemmatization (years → year)

Example: Analytical Structure, Functions

(board)

(board)(the)

(join)

(will) (join)

(join)

→→

Penn Treebank structure

(with heads added) PDT-like Analytic

Representation

Annotation tools

TrEd Visualization, annotation, processing, search Perl, Perl-Tk Customizable Multiple data formats, native: Prague ML (XML)

Demonstration I Surface-syntax annotation

Speech reconstruction

Spoken input ASR does not do capitalization, punctuation Disfluencies Repetitions Corrections Fillers Deviations from syntax ...

ungrammatical / dissimilar to text

Objective

Create gold-standard data for (Statistical) training Testing

Use in machine learning of automatic speech reconstruction (eventually) language understanding

Go beyond state-of-the-art ASR Post-Correction / disfluency removal (cf. e.g. Fitzgerald, 2008, or Lopez/Cozar &

Callejas, 2008)

we ‘re sitting on the step ofwe ‘re sitting on the step of

What is Speech Reconstruction

original transcript → edited transcript resembling an interview editing for print

... apart from disfluency removal?

I think it ‘sI think it ‘s my aunt Molly ‘s housemy aunt Molly ‘s houseandand

I thinkI think

We’re sitting on the step of my aunt Molly’s house, We’re sitting on the step of my aunt Molly’s house, ..

my aunt Molly ‘s housemy aunt Molly ‘s house

Disfluency removal

Speech reconstruction

Word-for-word transcription

using Transcriber 1.5.1 audio synchronization spoken text non-speech events (UH-

HUH, laughter, etc.) rough segmentation

(defined acoustically) speaker/turn identification

Annotation rules

• sentence segmentation• orthography• capitalization• punctuation• morphosyntax• word order• partial ellipsis restoration• no discourse-irrelevant non- speech events• filler and fragment deletion• but: ...meaning preservation

~ written-text standards

Segment splitting

(reconstruction) segment = sentence

(transcript) segment ~ non-silence span

Word order, deletions, insertions, ...

punctuation, capitalization, ...

Function word insertion

Annotation layers

Multilayered standoff annotation

Current status

English dialogues 151,000 words (14,5 h) 16,000 words double annotated Audio / manual transcript / reconstruction Auto tagging/parsing: syntax / semnatics

Annotation manual Baseline automatic SR systems

Conclusion

Language resources Beyond post-ASR corrections:

“Speech Reconstruction” Integrated annotation of

Audio, ASR, manual transcription, edited speech reconstruction

[morphology, syntax, semantics] The next step: machine learning

Czech Example

Original transcription:ale taky důvod byl ten že škodováci byli hrozně rádi bejvali kdybych tam byla mohla nastoupit k ni - k nim jako do zaměstnání

Recovering punctuation and capitalization:Ale taky důvod byl ten , že škodováci byli hrozně rádi bejvali , kdybych tam byla , mohla nastoupit k ni k nim jako do zaměstnání

Translation:Ale taky důvod byl ten , že škodováci bývali byli hrozně rádi , kdybych tam byla , mohla nastoupit k ní , k nim jako do zaměstnání

Reference reconstructions:(a) Ale důvod byl taky ten , že "škodováci" by bývali byli hrozně rádi , kdybych tam k nim mohla nastoupit do zaměstnání .(b) Důvod byl ale taky ten , že škodováci by bývali byli hrozně rádi , kdybych tam byla mohla nastoupit k nim jako do zaměstnání .

Prague Dependency Treebank(s) Workshop at LSA2011, Part I

Documents

Transcript of Prague Dependency Treebank(s) Workshop at LSA2011, Part I

Hindi Dependency Parsing and Treebank Validationweb2py.iiit.ac.in/research_centres/publications/download/masters... · Thanks to Phani Gadde, Meher, Pujitha, GSK, Aswarth and other

Prague Dependency Treebank: Introduction

Valency in the Prague Dependency Treebank: …ufal.mff.cuni.cz/DEV/pbml/79-80/lopatkova.pdfValency in the Prague Dependency Treebank: Building the Valency Lexicon1 Markéta Lopatková

Creating a treebank - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/create...• Human errors are inevitable • Good guidelines, well-trained annotators, easy-to-use

Issue s of Valency in Prague Dependency Treebank: C reating Valency Lexicon of Verbs

A Dependency Treebank of Classical Chinese Poems John Lee and Yin Hei Kong The Halliday Centre for Intelligent Applications of Language Studies Department.

Prague Dependency Treebank(s) Workshop at LSA2011, Part I Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science.

Manual for Morphological Annotationufal.mff.cuni.cz/pcedt2.0/publications/mmanual-2.pdf · Manual for Morphological Annotation Revision for the Prague Dependency Treebank 2.0 ÚFAL

Using a Chinese treebank to measure dependency distancedickhudson.com/wp-content/uploads/2013/07/dd.pdf · Using a Chinese treebank to measure dependency distance HAITAO LIU, RICHARD

PDT 2.0 Prague Dependency Treebank 2ufal.mff.cuni.cz/~zabokrtsky/vyuka/pfl076-pdt-intro.pdf · PDT 2.0 Outline of the talk Introduction Layers of annotation Data Software tools Documentation

Complex Corpus Annotation: The Prague Dependency Treebank · Complex Corpus Annotation: The Prague Dependency Treebank Jan Haji č ÚFAL MFF UK Charles University Malostranské nám.

Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.

Prague Dependency Treebank 1.0 Functional Generative Description.

Prague Dependency Treebank 1.0

Danish Dependency Treebank - Annotation guide: Verbs

Lecture 11: Penn Treebank Parsing; Dependency Grammars

The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.

Dependency Relations for Sanskrit Parsing and Treebank

Penn Treebank Tagset

Issues of Valency in Prague Dependency Treebank: Creating Valency Lexicon of Verbs Markéta Lopatková Center for Computational Linguistics MFF UK, Prague.