Prague Dependency Treebank(s) Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová Institute...

53
Prague Dependency Treebank(s) Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic

Transcript of Prague Dependency Treebank(s) Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová Institute...

Prague Dependency Treebank(s)

Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová

Institute of Formal and Applied Linguistics

School of Computer ScienceFaculty of Mathematics and Physics

Charles University, PragueCzech Republic

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 2

Part II - Syntax and Semantics

Tectogrammatical representation Valency lexicon

Languages Czech, Arabic and English

Technical issues Annotation scheme and format Tools for annotation Applications

Summary, pointers, conclusion

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 3

PDT Annotation Layers L0 (w) Words (tokens)

automatic segmentation and markup only L1 (m) Morphology

Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 4

Layer 3 (t-layer): Tectogrammatical

Underlying (deep) syntax 4 sublayers (integrated):

dependency structure, (detailed) functors valency annotation

topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):

detailed functors underlying gender, number, ...

Total 39 attributes (vs. 5 at m-layer, 2 at a-layer)

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 5

Analytical vs. Tectogrammatical

Underlying verb + tense

Deep function

Elided Actor in

Prepositions out

Another ellipsis...

(TR: sublayer 1 only shown)

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 6

Layer 3: Tectogrammatical

Underlying (deep) syntax 4 sublayers:

dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):

detailed functors underlying gender, number, ...

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 7

Tectogrammatical Functors

“Actants”: ACT, PAT, EFF, ADDR, ORIG

modify: verbs, nouns, adjectives cannot repeat in a clause, usually obligatory

Free modifications (~ 50), semantically defined can repeat; optional, sometimes obligatory Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP,

INTT, MANN; MAT, APP; ID, DPHR, ...

Special Coordination, Rhematizers, Foreign phrases,...

syntactic semantic

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 8

Tectogrammatical Example

Analytical verb form: (he) allowed would-be to-be enrolled směl by být zapsán

Additional attributes (grammatemes):conditional + “allow”

Collapsed

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 9

Tectogrammatical Example

Passive construction (action) (The) book has-been translated [by Mr. X] Kniha byla přeložena

Disappeared Added

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 10

Tectogrammatical Example

Object (he) gave him a-book dal mu knihu

Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor’s valency frame

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 11

Tectogrammatical Example

Incomplete phrases Peter works well , but Paul badly Petr pracuje dobře, ale Pavel špatně

Added

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 12

Layer 3: Tectogrammatical

Underlying (deep) syntax 4 sublayers:

dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):

detailed functors underlying gender, number, ...

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 13

Deep Word OrderTopic/Focus

Example:

Baker bakes rolls. vs. BakerIC bakes rolls.

Analyticaldep. tree:

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 14

Deep Word OrderTopic/Focus

Deep word order: from “old” information to the “new” one (left-to-

right) at every level (head included) projectivity by definition (almost...)

i.e., partial level-based order -> total d.w.o.

Topic/focus/contrastive topic attribute of every node (t, f, c) restricted by d.w.o. and other constraints

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 15

Layer 3: Tectogrammatical

Underlying (deep) syntax 4 sublayers:

dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):

detailed functors underlying gender, number, ...

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 16

Coreference

Grammatical relative clauses

which, who Peter and Paul, who ...

control infinitival constructions

John promised to go ...

reflexive pronouns {him,her,thme}self(-ves)

Mary saw herself in ...

Johngo

he home

promisePRED

ACTPAT

ACT DIR3

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 17

Coreference

Textual Ex.: Peter moved to Iowa after he finished his PhD.

Peter Iowafinish

he PhD

movePRED

ACT DIR1TWHEN

ACT PAT

heAPP

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 18

Layer 3: Tectogrammatical

Underlying (deep) syntax 4 sublayers:

dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):

detailed functors underlying gender, number, ...

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 19

Grammatemes Detailed functors (subfunctors)

only for some functors: TWHEN: before/after LOC: next-to, behind, in-front-of, ... also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT

Lexical (underlying) number (SG/PL), tense, modality, degree of

comparison, ... strictly only where necessary (agreement!)

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 20

Example - simplified view

Se zuby jsem měl v minulosti jen problémy.With teeth I-have had in the-past only problems.

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 21

Fully Annotated Sentence

The boundaries of some problems seem to be clearer after they were revived by Havel’s speech.

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 22

Arabic Example:Tectogrammatics

In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 23

English PDT-style Annotation

Morphology and Syntax By conversion

Tectogrammatical annotation Guidelines (English TR: by S. Cinková) Pre-annotation

Transformation from Penn Treebank & Propbank (Palmer, Kingsbury) by Z. Žabokrtský et al.

Valency From Propbank Frame Files (Cinková, Šindlerová,

Nedolužko, Semecký)

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 24

Example - English TR

Words Dependencies Sem. function Valency

(predicates) Coref (BBN) Named Entities

(BBN)

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 25

Valency in the PDTValency: specific ability of a word to combine itself with other units of meaning

dát (give)

Eva matka (mother)ACT ADDR

pršet (rain)

zítra (tomorrow)TWHEN

plakat (cry)

Adam noc (night)ACT TWHEN

Specific behavior

dar (gift)PAT

neděle (Sunday)TWHEN

---

Modifies anything

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 26

Valency - Basic Principles

inner participants vs. free modifications (arguments vs. adjuncts)

obligatory vs. optional modifications (the dialogue test)

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 27

Inner Participant … … Free Modification

ACT(or), PAT(ient) ADDR(essee), EFF(ect), ORIG(in) (5)

each occurs just with particular verbs

each modifies the verb only once (in a clause)

Location (LOC, DIR1,…) Time (TWHEN, TTILL, …), Manner, Intention,… (70)

can modify in principle any verb

can be repeated (within the same clause)

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 28

Inner Participants

syntactic criteria - Actor and Patient semantic criteria for other inner participants (if a verb has more than two arguments)

Argument shifting Actor Patient

Addressee

Origin

EffectPetr has dug a hole.

The teacher asked a pupil.

Semantic Effect (as a cognitive role) shifted to the position of Patient.

Semantic Addresse shifted to the position of Patient.

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 29

Obligatory … Optional

A: John left.B: From where?A: *I don't know.

A: John left.B: To where?A: I don't know.

„from where“ obligatory modification

„to where“

optional modification

The Dialogue Test

Answering a question about a semantically obligatory modification, the speaker cannot say: I don't know.

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 30

Valency frame

obligatory optional

argument

adjunct

Structure:

one meaning of the word one valency frame

Contents:

functor obligatoriness surface form

word: leavemeaning 1: sb left sth meaning 2: sb left from somewhere

frame1: ACT PAT frame2: ACT DIR1

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 31

Valency lexicon:PDT-VALLEX

8500 verb senses / valency frames 9000 noun sense / valency frames some adjectives and adverbs

PDT-VALLEX Entryverb: dosáhnout meaning 1: to reach sthmeaning 2: to get sb to do sthmeaning 3: …meaning 4: …

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 32

The PDT-VALLEX editor

‘lay down’

resign

win

ask

senses:

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 33

Valency Lexicon and TrEd

to write sth (about sth)

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 34

Corpus <-> Valency Lexicon

Corpus – occurrences of „uzavřít“ (to close) :

ENTRY: uzavřít

vf1: ACT(.1) CPHR({smlouva}.4)

ex: u. dohodu (close a contract)

vf2: ACT(.1) PAT(.4)

ex.: u. pokoj (close a room, house)

Lexicon:

Sentence 2035: Sentence 15345: Sentence 51042:

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 35

Valency and Text Generation

Tectogrammatical Representation has all the information to (re)generate the surface

form of the sentence: in a “generalized” form non-redundant (almost... but for generation, it is o.k.)

...except the links to a-layer, however links used only for training [statistical models for]

parsing/generation modules not present when e.g. doing text planning, translation, ...

valency dictionary: form of “learned” knowledge

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 36

Valency and Text Generation

Using valency for... ...getting the correct (lemma, tag) of verb arguments

Example:

starat_se

PRED

Martin

ACT

tygr

PAT

Martin

....1..........

starat

V..............

o

...............

tygr

....4..........

VALLEX entry: starat (se) ACT(.1) PAT(o.[.4])

se

...............

Martin se stará o tygry.

“Martin takes care of tigers.”

“to take care of”

“tiger”

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 37

The Annotation Process 4 sublayers

work on structure first, rest in parallel Structure

automatic preprocessing - programmed conversion from analytical layer annotation

Grammatemes mostly automatically (based on lower layers’

annotation), manual checking, corrections Cross-sublayer/cross-layer checking

partly automatic, then manual

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 38

The Annotation ProcessScheme

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 39

Tectogrammatical Annotation Tools

Manual annotation 4 groups of annotators ~ 4 sublayers Special graphical tool (TrEd)

Customizable graphical tree editor

Preprocessing Data from analytical layer, preprocessed Online dependency function preassignment

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 41

The Annotation Scheme XML + principles of linear- and tree-based

standoff annotation

PML(Prague Markup Language)

Layer schemes (Relax NG) PDT/PADT: t(ecto), a(nalytic), m(orphology), … English: + phrase-based (p-layer)

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 42

PML/XML Annotation Layers

Strictly top-down links w+m+a can be easily

“knitted” API for cross-layer

access (programming)

PML Schema / Relax NG

[z and audio layers: used for spoken data (audio as layer “-1”)]

LFG

analogy:

f-struct

Φ

c-structz-

laye

rau

dio

BYL BYS ČELO LESA …

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 43

The Prague Markup Language Example

m-layer data, linked to w-layer:<m id="m-tr/_12941_01_00013.fs-s1w4"> <src.rf>manual</src.rf> <w> <dest.rf>w#w-tr/_12941_01_00013.fs-s1w4</dest.rf> <trans>basic</trans> </w> <form>pocházela</form> <lemma>pocházet_:T</lemma> <tag>VpQW---XR-AA---</tag></m><m id="m-tr/_12941_01_00013.fs-s1w5"> ...

Pointer to w-layer

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 45

Searching the Treebanks TrEd extension: PML-TQ

Backend: database server Frontend: TrEd or Web browser

Web access http://euler.ms.mff.cuni.cz:8111 Sample data (Czech, English [soon]):

anonymous / anonymous Full access (LSA 2011 particiapnts only, 2011):

LSA2011 / UC.Boulder Full access: licence needed for the corpora

Available later this year at http://www.lindat.cz

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 46

Using the Results: Parsing

Several parsers of Czech Analytical layer dependency syntax Trained on PDT 1.0 data, 1.2 mil. words

Collins(98), Charniak(00), Žabokrtský(02), Ribarov(04), Nivre(05), Zeman(05),

McDonald(05), CoNLL’06 (19 parsers) Best results

accuracy: percent of correct dependencies: 84-85% for a single parser, > 86% for a combination

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 47

Tectogrammatical Parsing

Newest results: 4 phases Transformation

-based learning FnTBL Largely langu-

age independent Coreference: >90%

m- and a-layer:Attribute manual autostructure 89,3 % 76,4 %functor 85,5 % 77,4 %val_frame.rf 92,3 % 90,9 %t_lemma 93,5 % 90,9 %nodetype 94,5 % 92,6 %gram/sempos 93,8 % 91,5 %a/lex.rf 96,5 % 95,1 %a/aux.rf 94,3 % 90,3 %is_member 94,3 % 89,5 %is_generated 96,6 % 95,2 %deepord 68,0 % 66,7 %

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 48

Tectogrammatical Layer in Machine Translation

The Translation (“Vauquois”) triangle

transfer

source target

Tectogrammatical Representation

Surface Syntax

MorphologyGeneration

Cz En

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 49

Dependency trees in MT

According to his opinion UAL's executives were misinformed about the financing of the original transaction.

Transfer:

Podle jeho názoru bylo vedení UAL o financování původní transakce nesprávně informováno.

- structure (~0)- lexical- functions- grammatical

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 50

Analytical LayerCorrespondence

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 51

TectogrammaticalCorrespondence

The [Homestead’s] only remaining baker bakes the most famous rolls to the north of Long River.

‘al-xabaaz ‘al-’axiir ‘al-baaqii [fii Homestead] yaśmacu ‘ashhar ‘al-kruasaanaat ilaa shimaal min Long River.

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 52

Valency and Translation

leave: leave-1

to leave [from] somewhere leave-2

to leave sth for sb

Translating (from English into Czech): which equivalent to chose?

nechat vs. odjet/opustit which prepositions, cases, ... to use?

accusative vs. “z” (“from”) with genitive vs. ...?

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 53

Valency and Translation

leave-1 nechat-3 ACT() PAT() LOC() ACT(.1) PAT(.4) LOC()

leave-2 odjet-1 ACT() DIR1(from.) ACT(.1) DIR1(z.[.2])

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 54

To summarize…

PDT is/has (a)… Dependency-based treebanking project

Czech (other languages: – Eng, Ar) Ongoing projects (other inst.): Italian, Old Greek, Latin, …

~ 1mil. words sufficient size for ML experiments

4 layers of annotation token, morphology, syntax, deep syntax/semantics++) independent and full information at all levels, but... interlinked (for the development of parsers/generators)

Valency dictionary integrated (links from data)

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 55

Some pointers Current version of PDT: v2.0, LDC2006T01

all three levels, 1.9/1.5/0.8 Mwords http://ufal.mff.cuni.cz/pdt2.0

http://ufal.mff.cuni.cz Research -> Corpora (Treebank(s))

http://www.ldc.upenn.edu LDC2001T10 (PDT v1.0), LDC2004T23 (PADT 1.0),

LDC2004T25 (PCEDT 1.0), LDC2006T01 (PDT 2.0) http://www.clsp.jhu.edu: Workshop 2002

Using TL for MT Generation http://ufal.mff.cuni.cz/pedt

1st version of English dep. Treebank http://ufal.mff.cuni.cz/~hajic/lsa2011.html

This workshp page, many links to resources, tools