Prague Dependency Treebank 1.0

58
Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000

description

Prague Dependency Treebank 1.0. CD-ROM PRESENTATION Dec 18, 2000. Prague Dependency Treebank 1.0. Functional Generative Description. CD-ROM PRESENTATION Dec 18, 2000. Functional Generative Description. - PowerPoint PPT Presentation

Transcript of Prague Dependency Treebank 1.0

Page 1: Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0

CD-ROM PRESENTATIONDec 18, 2000

Page 2: Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0

CD-ROM PRESENTATIONDec 18, 2000

Functional Generative Description

Page 3: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Functional Generative Description

theoretical framework based on the findings of European structural linguistics, esp. of the classical Prague School

methodological requirements of a formal description levels:

tectogrammatical (underlying) representations (TRs) with dependency based syntax

morphemics phonemics and phonetics

TRs (see Sgall, Hajičová and Panevová 1986, formally specified by Petkevič, also in a declarative way)

Page 4: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Dependency tree

My younger brother arrived there yesterday.

Linearized form, one-to-one relation:((I)Appurt (younger)Rstr brother)Act arrive.Pret.Indic (Dir there) (Temp yesterday)

Page 5: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Dependency Tree labels - lexical meanings (abstract symbols) with indices

functors subscripts at parentheses oriented towards head

grammatemes - values of morphological categories Tense, Modality, Number, Definiteness, etc.

projectivity valency

arguments (inner participants) and adjuncts (circumstantials or 'free modifications')

obligatory and optional with a given head, deletable or not

Page 6: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Dependency Treeparticipants

(arguments) of verbs Actor/Bearer

(underlying subject) Objective (Patient,

underlying direct object) Addressee

(underlying indirect object)

Effect ('second' object: to choose so. as sth.)

Origin (to make sth. out of sth.)

adjuncts Locative, several

Directional and Temporal modifications

Condition, Means, Manner, etc.

Page 7: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Dependency Tree

inner participants Material (Partitive)

two baskets of sth. Identity

the river Danube; the notion of operator

free modifications Possession

(Appurtenance) my table; Jim's brother

Restrictive rich man

Descriptive the Swedes, who are a Scandinavian nation

Complementations dependent mainly on nouns

Page 8: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Dependency Tree

syntactic grammatemes Loc, Dir - in, on, under, between... Regard - with, without

operational (testable) criteria for distinguishing

arguments from adjuncts,from each other

deletability (dialogue test)

Page 9: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Simplified valency frames

read V Act Addr Obj

change V Act Obj Orig Eff

give V Act Addr Obj

brother N Appurt

man N

glass N Material

full A Materialobligatory complementations in blue

Page 10: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Topic-focus articulation

contextual boundness main verb CB/NB (T/F) dependents to the left/right

communicative dynamism left-right (mother, sisters,

transitive)

partial ordering underlying word order

left-right

linear ordering

left-to-right order of nodes together with the index T or (prototypically) F indicates the TFA of the sentence (of the TR)

young

there

T

Page 11: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Topic-focus articulation

TFA - one of the basic aspects of underlying structures

young

there

T

yesterday

F

Page 12: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Complex sentence

a subordinated (dependent) clause (i.e. its main verb) depends on a word contained in its governing clause

My brother, whom you know, arrived there yesterday.

Page 13: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Complex sentence

function words (synsemantic) are viewed as function morphemes, syntactically fixed to certain lexical (autosemantic) words - prepositions and articles to nouns, conjunctions and auxiliaries to verbs

Martin came there late, since he had to accompany his sick mother.

Page 14: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Complex sentence

Martin arrived late to the session, since he had to accompany his sick mother.

schematically (morphemes):

Martin arrive.ed late to the session since he have.ed to accompany he.s sick mother.

dot - close connection of morphemes ('semes')

Page 15: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

deleted items restored order of items - difference between 'underlying' and surface

(morphemic) word order transductive components - Panevová, Oliva, Borota

coordination (multidimensional) Jim and Mary, who have two children, went to Boston. the linearized notation is adequate: ((Jim Mary)Conj ((who)Act have (Pat (two)Rstr

children)))Act went (Dir Boston)

structures close to Boolean, i.e. no complex 'innate properties' specific for natural language are needed.

Page 16: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Prague Dependency Treebank - corpus annotation

an intermediate level - 'analytical' representations dependency trees, not always projective nodes for all word tokens, even for

punctuation markstectogrammmatical tree:

coordinating conjunction as the head

Page 17: Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0

CD-ROM PRESENTATIONDec 18, 2000

Page 18: Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0

CD-ROM PRESENTATIONDec 18, 2000

Morphological Layer

Page 19: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

ACKNOWLEDGEMENTS

Page 20: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

ANNOTATED CORPORA

PDT version 1.0, 2000

(1996 - 2000)

Penn Treebank, release 3, 1999

(1989 - 1999)

Page 21: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

TAG SETs

Czech - ambiguous inflective language nový, nového, novému, novém, novým, nová, nové, novou, nových, novým, novými, … novější, novejšího, novějšímu, novějším, …., nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších, nejnovějším, …

English - language with poor inflectionwork, works, worked, working

Page 22: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Page 23: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

TEXT SOURCES

Lidové noviny

Mladá Fronta Dnes

Vesmír

Českomoravský

Profit

...taken from Czech

National Corpus

´88, ´89 WSJ articles

Air Travel Information

System transcripts

Brown Corpus

Switchboard transcripts

Page 24: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

ANNOTATION STRATEGY - Penn Treebank

TEXT

Ken Church‘s stochastic tagger,Eric Brill‘s transformation tagger

corrections by annotator (GNU Emacs Lisp based package)

Page 25: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

ANNOTATION STRATEGY - PDT

Automatic Morphological Analyzer (AMA)

two independent annotators; Linux, Win tools

differences resolved by third annotator

comparison with the current AMA; manual resolution; Win tools

Page 26: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

INTERNAL FORMAT

SGML coding, csts dtd word/tag(|tag)*

Page 27: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

<s id=“ln95040:020-p1s1“><f>Pokus<l>pokus<t>NNIS1-----A----<f>o<l>o<t>RR--4----------<f>zázrak<l>zázrak<t>NNIS4-----A----<d>.<l>.<t>Z:-------------

The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN ./.

SAMPLES

Page 28: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

SGML coding

SGML coding

word/tag

word/lemma/tag

CONVERSION

pdt2wsj.pl

pdt2wsjFLT.pl

Page 29: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

DATA SIZE

# wordtokens

# sentences

PDT 1.0 1 730K 112K

Penn Treebank

release 3

4 600K 350K

Page 30: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

DATA SETs of MORPHOLOGICALLY ANNOTATED

DATA

for tagging only #tokens/sentences

training data 1 470K/95K

development test data 130K/8K

evaluation test data 127K/8K

for parsing (preprocessing step)

training data 475K/29K

development test data 130K/8K

evaluation test data 127K/8K

Page 31: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

TOOLS

Automatic Morphological Analyser/Generator of Czech HMAnalyze.pl,

HMGenerate.pl Dictionary: CZE_a Remote Acces

Czech Taggers

HMM

Exponential

Page 32: Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0

CD-ROM PRESENTATIONDec 18, 2000

Page 33: Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0

CD-ROM PRESENTATIONDec 18, 2000

Analytical Layer in PDT

Page 34: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Introduction

Input: morphologically tagged sentences

Graph Editor: “user-friendly” software

Output: ATS structure „surface“ syntax tree structure nodes labelled by the analytical functions

Page 35: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Two stages (chronologically)

(A) manual „analytic“ annotation (ATS) training data for (B)(a)

(B) (a) semiautomatic procedure (Collin‘s

parser) (b) manual correcting of (B)(a)

Page 36: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Constraints and limitations

any string has a node of its own word-form, punctuation mark, etc. AuxV, AuxP, AuxC, AuxX, AuxG…

reflecting the coordination and apposition relations so called third dimension of the graph in the

plain tree (X_Co, X_Ap, X_Pa, where X is one of analytic functions, such as Sb, Obj, Adv, etc.)

Page 37: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Constraints and limitations

no missing nodes (on the surface) can be added analytic funtion Ex_D is used

relations between semi-automatic and manual procedure 80% edges are established correctly

automatically

Page 38: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Project organization

team consisting of 5-6 annotatorshandbook for ATS structure

annotation1999: 100000 sentences on ATStectogrammatical annotation follows

Page 39: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Adv

AuxT

První restituční zákon českého parlamentu se do sněmovních lavic může vrátit jako bumerang.

Page 40: Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0

CD-ROM PRESENTATIONDec 18, 2000

Page 41: Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0

CD-ROM PRESENTATIONDec 18, 2000

From the Analyticaltowards

the Tectogrammatical layer

Page 42: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Introduction

ATS annotation nodes:

word formspunctuationgraphical symbols

TGTS annotationautosemantic

wordsdeletions

edges:surface relations

deep layer functions

Page 43: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Input Czech

sentence

Morphological tagging and lexical

disambiguation

TokenizationSyntactic parsing and analytic function

assignment

Tree structure pruning

Attribute assignments TGTS

ATS PDT1.0

Annotation process

Page 44: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Transition procedure

deterministic procedure operating on treesmacro language for Graph Editor (C++ like)automatic changes & tools for annotators

Requirements new attributes for tectogrammatical layer ATS is recoverable from TGTS automatized to a maximally high degree

Page 45: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

New attributestrlemma - lemma of the original node or lemma

composed of joined nodes

morphological grammatemes gender, number, degree of comparison, tense,gender, number, degree of comparison, tense, aspect, iterativeness, verbal modality, deontic aspect, iterativeness, verbal modality, deontic

modality, sentence modalitymodality, sentence modality

positionposition of the nodeof the node functor, topic-focus articulation, syntactic functor, topic-focus articulation, syntactic

grammateme,grammateme, type of relation (dependency, coordination, apposition), type of relation (dependency, coordination, apposition), phraseme, deletion, quoted word, direct speech, phraseme, deletion, quoted word, direct speech, coreference, antecedentcoreference, antecedent

Page 46: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Tree Structure Pruning

U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný.

For those, who start actually at zero, the tax outcome for the state is not substantial.

Page 47: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Tree Structure Pruning U toho, kdo začíná opravdu od nuly, není daňový

výnos pro stát podstatný. For those, who start actually at zero, the tax outcome

for the state is not substantial.

REG

Page 48: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Verbal Nodes

•… enterpreneurs should have (their) taxes …

•… podnikatelé by měli mít daně …

PRED

verbmod=CDNdeontmod=HRT

Page 49: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Attribute Assignments

prepositions stored as fw attributequoted words

clause in quotes -> DSP one pair of quotes in the sentence -> DSPP string in quotes -> QUOT

gender, number, tense, degcmp, aspectdefault values

Page 50: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Macros for Annotators

keyboard shortcuts (in Graph editor) structure changes

hide/recover nodesmerge nodes

add new nodes functor assignments

Page 51: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Manual annotation

structure checkingfunctorsdeletions of obligatory modifications

feedback for formulating the handbook for annotators

Page 52: Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0

CD-ROM PRESENTATIONDec 18, 2000

Page 53: Prague Dependency Treebank 1.0

Prague Dependency Treebank 1.0

CD-ROM PRESENTATIONDec 18, 2000

Tectogrammatical Layer

Page 54: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Page 55: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

C T

T

T

T

T

F

FT

T

Page 56: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Jirka se včera opil do němoty a Honza dneska. George himself yesterday drank to silence and Honza today.

Page 57: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Attributes of Coreferrential relations

only in MC attribute valuescoref the lemma of the antecedentcorsnt NIL - in the same sentence

PREV1 ... PREVi - position of the sentence which includes the antecedent

grammatical coreferenceantec the functor of the antecedent

Page 58: Prague Dependency Treebank 1.0

Prague Dependency Treebank

1.0

Example

Honza slíbil přijít včas.Honza promised to come in time.

coref: Honzacorsnt: NILcornum: 1antec: ACT