Prague Dependency Treebank(s) Workshop at LSA2011, Part I

Post on 27-Jan-2016

30 views 0 download

Tags:

description

Prague Dependency Treebank(s) Workshop at LSA2011, Part I. Jan Haji č , Zde ňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic. Part I - Text to Syntax. - PowerPoint PPT Presentation

Transcript of Prague Dependency Treebank(s) Workshop at LSA2011, Part I

Prague Dependency Treebank(s)

Workshop at LSA2011, Part I

Jan Hajič, Zdeňka Urešová

Institute of Formal and Applied Linguistics

School of Computer Science

Faculty of Mathematics and Physics

Charles University, Prague

Czech Republic

July 30, 2011 LSA2011 Prague Dependency Treebanks I 2

Part I - Text to Syntax

The Prague Dependency Treebank projects The theory behind it The Corpora Morphology

Dictionaries, tools (incl. POS tagger) Dependency surface syntax

Czech, Arabic, English Parallel annotated corpus

July 30, 2011 LSA2011 Prague Dependency Treebanks I 3

The Prague Dependency Treebank(s)

The idea Apply the “old” Prague theory to real-word texts Provide enough data for ML experiments

?“Old” Prague theory Prague structuralism (1930s) Stratificational approach Centered on “deep syntax”

Separated from “surface form” Dependency based (how else )

July 30, 2011 LSA2011 Prague Dependency Treebanks I 4

PDT:The Methodology

Manual annotation is PRIMARY Some help from existing tools possible

“No information loss, no redundancy” Much formalization, but… … original form always retrievable

Dictionaries In theory: “secondary”, side effect of annotation (generalization) In reality: help consistency Links: data → dictionary(-ies)

Result: used for Machine Learning Ergonomy of annotation

Graphical (“linguistic”) presentation & editing

July 30, 2011 LSA2011 Prague Dependency Treebanks I 5

The Prague Dependency Treebank Project: Czech Treebank

1995 (Dublin) 1996-2006-... 1998 PDT v. 0.5 released (JHU workshop)

400k words manually annotated, unchecked 2001 PDT 1.0 released (LDC):

1.3MW annotated, morphology & surface syntax 2006 PDT 2.0 release

0.8MW annotated (50k sentences) + PDT 1.0 corrected the “tectogrammatical layer”

underlying (deep) syntax

July 30, 2011 LSA2011 Prague Dependency Treebanks I 6

Related Projects (Treebanks)

Prague Czech-English Dependency Treebank WSJ portion of PTB, translated to Czech (1.2 mil. words)

Annotated for basice set of attributes Penn Treebank / WSJ

Pre-converted, manually annotated for basic sets of attributes Named entity annotation, co-reference (BBN) merged in Detailed breakdown of NP

Prague Arabic Dependency Treebank apply same representation to annotation of Arabic surface syntax so far

Both published (in first version) 2004 (by LDC) PCEDT/PEDT version 2.0 being prepared (2011?) Preliminary version available for browsing – see the workshop web

July 30, 2011 LSA2011 Prague Dependency Treebanks I 7

PDT (Czech) Data

4 sources: Lidové noviny (daily newspaper, incl. extra sections) DNES (Mladá fronta Dnes) (daily newspaper) Vesmír (popular science magazine, monthly) Českomoravský Profit (economical journal, weekly)

Full articles selected article ~ DOCUMENT (basic corpus unit)

Time period: 1990-1995 1.8 million tokens (~110,000 sentences)

July 30, 2011 LSA2011 Prague Dependency Treebanks I 8

PDT Annotation Layers L0 (w) Words (tokens)

automatic segmentation and markup only L1 (m) Morphology

Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

PD

T 1

.0

(200

1)P

DT

2.0

(2

006)

July 30, 2011 LSA2011 Prague Dependency Treebanks I 9

PDT Annotation Layers L0 (w) Words (tokens)

automatic segmentation and markup only L1 (m) Morphology

Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

July 30, 2011 LSA2011 Prague Dependency Treebanks I 10

Morphological Attributes

Tag: 13 categoriesExample: AAFP3----3N----Adjective no poss. Gender negatedRegular no poss. Number no voiceFeminine no person reserve1Plural no tense reserve2Dative superlative base

var.Lemma: POS-unique identifier

Books/verb -> book-1, went -> go, to/prep. -> to-1

Ex.: nejnezajímavějším“(to) the most uninteresting”

July 30, 2011 LSA2011 Prague Dependency Treebanks I 11

Morphological Tagset 13 categories, 4452 plausible tags (combinations):

Category # of values Example(s)POS 10 N (noun), Z (punctuation)SUBPOS 75 P (personal pron.), U (possessive adj.)GENDER 8 I (masc. inanimate), X (any), - (N.A)NUMBER 4 P (plural), D (dual)CASE 9 1 (nominative), 6 (locative)POSSGENDER 4 M (masc. animate), F (feminine)POSSNUMBER 3 S (singular), P (plural)PERSON 5 1 (first), ...TENSE 4 P (present), M (past)GRADE 5 3 (superlative)NEGATION 3 A (affirmative), N (negative)VOICE 3 A (active), P (passive)VAR 11 1 (1st variant), 6 (colloq. style), 8 (abbrev.)

July 30, 2011 LSA2011 Prague Dependency Treebanks I 12

Morphological Analysis Formally: MA: A+ → Pow(L x T)

MA(f) = { [ l,t ] }; f A+ (the token), l L (lemma), t T (tag)

tokens taken in isolation no attempt to solve e.g. auxiliaries vs. full verbs Ex.: MA(“má“) = { [mít,VB-S---3P-AA---], lit. “to have”

lit. “has”,”my” [můj,PSFS1-S1------1], lit. “my” [můj,PSFS5-S1------1], [můj,PSNP1-S1------1], [můj,PSNP4-S1------1], [můj,PSNP5-S1------1] }

July 30, 2011 LSA2011 Prague Dependency Treebanks I 13

Morphological Disambiguation

Full morphological disambiguation more complex than (e.g. English) POS tagging

Several full morphological taggers: (Pure) HMM Feature-based (MaxEnt, NB)

used in the PDT distribution Voted Perceptron, (M. Collins, EMNLP’02)

All: ~ 94-96% accuracy (perceptron is best) rule & statistic combination: tiny improvement

(Hajič et al., ACL 2001, Spoustova et al., 2007: > 96%)

July 30, 2011 LSA2011 Prague Dependency Treebanks I 14

The Segmentation Problem:Arabic

Tokenization / segmentation not always trivial Arabic, German, Chinese, Japanese

July 30, 2011 LSA2011 Prague Dependency Treebanks I 15

The Segmentation Problem:Solution for Arabic

Find max. no. of segments, concatenate up to max. 4 (x10) for Arabic

F---------VIIA-3MS--S----3MP4----------- sa-+yu-hbir-u+-hum+0

F---------VIIA-3MS--S----3MP4-----------

P---------SD----MS----------------------

P---------------------------------------

N-------2R------------------------------

N-------2D------------------------------

A-----FS2D------------------------------

Resulting annotation:

July 30, 2011 LSA2011 Prague Dependency Treebanks I 16

Arabic Tagging Results

Maximum entropy, features ~ categories Experiments on Penn Arabic Treebanks

POS: From 95.25-97.37% Full morph. 88.17-89.31% Segmentation: 98.60-99.37%

Prague Arabic Dependency Treebank POS: 96.02% Full morph.: 89.24% Segmentation: 99.25%

July 30, 2011 LSA2011 Prague Dependency Treebanks I 17

PDT Annotation Layers L0 (w) Words (tokens)

automatic segmentation and markup only L1 (m) Morphology

Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

July 30, 2011 LSA2011 Prague Dependency Treebanks I 18

Layer 2 (a-layer): Analytical Syntax

Dependency + Analytical Function

dependent

governor

The influence of the Mexicancrisis on Central and EasternEurope has apparently been underestimated.

July 30, 2011 LSA2011 Prague Dependency Treebanks I 19

Analytical Syntax: Functions

Main (for [main] semantic lexemes):Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom“Double” dependency: AtrAdv, AtrObj, AtrAtr

Special (function words, punctuation,...):Reflefives, particles: AuxT, AuxR, AuxO, AuxZ, AuxYPrepositions/Conjunctions: AuxP, AuxCPunctuation, Graphics: AuxX, AuxS, AuxG, AuxK

StructuralElipsis: ExD, Coordination etc.: Coord, Apos

July 30, 2011 LSA2011 Prague Dependency Treebanks I 20

Surface Syntax Example

Complete sentence: Sb, Pred, Obj The-baker bakes rolls. Pekař peče housky.

July 30, 2011 LSA2011 Prague Dependency Treebanks I 21

Surface Syntax Example

Incomplete phrases Peter works well , but Paul badly Petr pracuje dobře, ale Pavel špatně

July 30, 2011 LSA2011 Prague Dependency Treebanks I 22

Surface Syntax Example

Variants (equal meaning) (he) bought shoes for boy koupil boty pro kluka

July 30, 2011 LSA2011 Prague Dependency Treebanks I 23

PDT-styleArabic Surface Syntax

Only several differences (Sometimes) Separate nodes for individual

segments (cf. tagging/segmentation) Copula treatment (Czech: rare treated as

ellispsis; Arabic: systematic solution needed): Pred (Added) analytic functions:

AuxM (did-not)

Ante (what)

Work by Faculty of Arts, Charles University Arabic language students

July 30, 2011 LSA2011 Prague Dependency Treebanks I 24

Arabic Surface SyntaxExample

In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.

July 30, 2011 LSA2011 Prague Dependency Treebanks I 25

English Analytic Layer

By conversion from PTB Extended analytic functions

Head rules Jason Eisner’s, added more for full conversion

Coordination, traces, etc.

Coordination handling Same as in Czech/Arabic PDT

July 30, 2011 LSA2011 Prague Dependency Treebanks I 26

Penn Treebank

University of Pennsylvania, 1993 Linguistic Data Consortium

Wall Street Journal texts, ca. 50,000 sentences 1989-1991 Financial (most), news, arts, sports 2499 (2312) documents in 25 sections

Annotation POS (Part-of-speech tags) Syntactic “bracketing” + bracket (syntactic) labels (Syntactic) Function tags, traces, co-indexing + Propbanking

July 30, 2011 LSA2011 Prague Dependency Treebanks I 27

Penn Treebank Example

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

POS tag (NNS)(noun, plural)

Phrase label (NP)

Noun Phrase

“Preterminal”

July 30, 2011 LSA2011 Prague Dependency Treebanks I 28

Penn Treebank Example:Sentence Tree

Phrase-based tree representation:

July 30, 2011 LSA2011 Prague Dependency Treebanks I 29

Parallel Czech-English Annotation

English text -> Czech text (human translation) Czech side (goal): all layers manual annotation English side (goal):

Morphology and surface syntax: technical conversion Penn Treebank style -> PDT Analytic layer

Tectogrammatical annotation: manual annotation (Slightly) different rules needed for English

Alignment Natural, sentence level only (now)

July 30, 2011 LSA2011 Prague Dependency Treebanks I 30

Human Translation ofWSJ Texts

Hired translators / FCE level Specific rules for translation

Sentence per sentence only …to get simple 1:1 alignment

Fluent Czech at the target side If a choice, prefer “literal” translation

The numbers: English tokens: 1,173,766 Translated to Czech:

Revised/PCEDT 1.0: 487,929 Now finished (all 2312 documents)

July 30, 2011 LSA2011 Prague Dependency Treebanks I 31

English Annotation POS and Syntax

Automatic conversion from Penn Treebank PDT morphological layer

From POS tags PDT analytic layer

From: Penn Treebank Syntactic Structure Non-terminal labels Function tags (non-terminal “suffixes”)

2-step process Head determination rules Conversion to dependency + analytic function

July 30, 2011 LSA2011 Prague Dependency Treebanks I 32

Head Determination Rules

Exhaustive set of rules By J. Eisner + M. Cmejrek/J. Curin 4000 rules (non-terminal based)

Ex.: (S (NP-SBJ VP .)) → VP Additional rules

Coordination, Apposition Punctuation (end-of-sentence, internal)

Original idea (possibility of conversion) J. Robinson (1960s)

July 30, 2011 LSA2011 Prague Dependency Treebanks I 33

Example: Head Determination Rules (J.E.)

(board)

(board)(the)

(join)

(will) (join)

(join)

(join)

(NP (DT NN)) → NN

(VP (VB NP)) → VB

(VP (MD VP)) → VP

(S (… VP …)) → VP

Rules:

July 30, 2011 LSA2011 Prague Dependency Treebanks I 34

Conversion: Analytic Structure, Functions

Analytic Function assignment (conversion) Rules

based on functional tags:-SBJ Sb -PRD Pnom

-BNF Obj -DTV Obj-LGS Obj -ADV Adv-DIR Adv -EXT Adv-LOC Adv -MNR Adv-PRP Adv -PUT Adv-TMP Adv

Ad-hoc rules (if functional tags missing) Lemmatization (years → year)

July 30, 2011 LSA2011 Prague Dependency Treebanks I 35

Example: Analytical Structure, Functions

(board)

(board)(the)

(join)

(will) (join)

(join)

(join)

→→

Penn Treebank structure

(with heads added) PDT-like Analytic

Representation

July 30, 2011 LSA2011 Prague Dependency Treebanks I 36

Annotation tools

TrEd Visualization, annotation, processing, search Perl, Perl-Tk Customizable Multiple data formats, native: Prague ML (XML)

Demonstration I Surface-syntax annotation

July 30, 2011 LSA2011 Prague Dependency Treebanks I 37

Speech reconstruction

Spoken input ASR does not do capitalization, punctuation Disfluencies Repetitions Corrections Fillers Deviations from syntax ...

ungrammatical / dissimilar to text

July 30, 2011 LSA2011 Prague Dependency Treebanks I 38

Objective

Create gold-standard data for (Statistical) training Testing

Use in machine learning of automatic speech reconstruction (eventually) language understanding

Go beyond state-of-the-art ASR Post-Correction / disfluency removal (cf. e.g. Fitzgerald, 2008, or Lopez/Cozar &

Callejas, 2008)

July 30, 2011 LSA2011 Prague Dependency Treebanks I 39

we ‘re sitting on the step ofwe ‘re sitting on the step of

What is Speech Reconstruction

original transcript → edited transcript resembling an interview editing for print

... apart from disfluency removal?

I think it ‘sI think it ‘s my aunt Molly ‘s housemy aunt Molly ‘s houseandand

I thinkI think

We’re sitting on the step of my aunt Molly’s house, We’re sitting on the step of my aunt Molly’s house, ..

????

my aunt Molly ‘s housemy aunt Molly ‘s house

Disfluency removal

Speech reconstruction

July 30, 2011 LSA2011 Prague Dependency Treebanks I 40

Word-for-word transcription

using Transcriber 1.5.1 audio synchronization spoken text non-speech events (UH-

HUH, laughter, etc.) rough segmentation

(defined acoustically) speaker/turn identification

July 30, 2011 LSA2011 Prague Dependency Treebanks I 41

Annotation rules

• sentence segmentation• orthography• capitalization• punctuation• morphosyntax• word order• partial ellipsis restoration• no discourse-irrelevant non- speech events• filler and fragment deletion• but: ...meaning preservation

~ written-text standards

July 30, 2011 LSA2011 Prague Dependency Treebanks I 43

Segment splitting

(reconstruction) segment = sentence

(transcript) segment ~ non-silence span

July 30, 2011 LSA2011 Prague Dependency Treebanks I 44

Word order, deletions, insertions, ...

punctuation, capitalization, ...

July 30, 2011 LSA2011 Prague Dependency Treebanks I 45

Function word insertion

July 30, 2011 LSA2011 Prague Dependency Treebanks I 46

Annotation layers

Multilayered standoff annotation

July 30, 2011 LSA2011 Prague Dependency Treebanks I 47

Current status

English dialogues 151,000 words (14,5 h) 16,000 words double annotated Audio / manual transcript / reconstruction Auto tagging/parsing: syntax / semnatics

Annotation manual Baseline automatic SR systems

July 30, 2011 LSA2011 Prague Dependency Treebanks I 48

Conclusion

Language resources Beyond post-ASR corrections:

“Speech Reconstruction” Integrated annotation of

Audio, ASR, manual transcription, edited speech reconstruction

[morphology, syntax, semantics] The next step: machine learning

July 30, 2011 LSA2011 Prague Dependency Treebanks I 51

Czech Example

Original transcription:ale taky důvod byl ten že škodováci byli hrozně rádi bejvali kdybych tam byla mohla nastoupit k ni - k nim jako do zaměstnání

Recovering punctuation and capitalization:Ale taky důvod byl ten , že škodováci byli hrozně rádi bejvali , kdybych tam byla , mohla nastoupit k ni k nim jako do zaměstnání

Translation:Ale taky důvod byl ten , že škodováci bývali byli hrozně rádi , kdybych tam byla , mohla nastoupit k ní , k nim jako do zaměstnání

Reference reconstructions:(a) Ale důvod byl taky ten , že "škodováci" by bývali byli hrozně rádi , kdybych tam k nim mohla nastoupit do zaměstnání .(b) Důvod byl ale taky ten , že škodováci by bývali byli hrozně rádi , kdybych tam byla mohla nastoupit k nim jako do zaměstnání .