LREC 2010, Malta 17-23 Maj Centre for Language Technology The DAD corpora and their uses Costanza...

20
LREC 2010, Malta 17-23 Maj Centre for Language Technology The DAD corpora and their uses Costanza Navarretta [email protected] Funded by Danish Research Councils – Sussi Olsen, CST

Transcript of LREC 2010, Malta 17-23 Maj Centre for Language Technology The DAD corpora and their uses Costanza...

LREC 2010, Malta 17-23 Maj

Centre for Language Technology

The DAD corpora and their uses

Costanza [email protected]

Funded by Danish Research Councils – Sussi Olsen, CST

LREC 2010, Malta 17-23 Maj Dias 2

Parallel or comparable texts and dialogues in Danish and Italian annotated with information on third-person singular neuter personal pronouns and singular demonstrative pronouns (3sn).

Focus on abstract anaphoric uses (abstract pronouns AA), i.e. antecedent is a copula predicate, a verbal phrase, a clause, a discourse segment.

The DAD corpora

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 3

it, this, that• Strong preference for use of

demonstrative pronouns to refer to entities introduced in discourse by clauses - 83.7% of occurrences in written corpus (Webber 1991).

Similar figures in English written and spoken data (i.a. Byron & Allen 1998, Gundel et al. 2003, 2005).

Abstract Anaphora in English

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 4

Centre for Language Technology

Abstract Anaphora in Danish

• written Danish: • det (it/this/that)• dette (this)

• spoken Danish: • unstressed det (it)• d'et (this/that), • d'et h'er (this)• d'et d'er (that)• dette (this) – very seldom

LREC 2010, Malta 17-23 Maj Dias 5

• zero anaphora (subject pro-drop language);

• clitics –lo, -ne, -ci; (it)

• personal pronouns lo, ne, ci;

• demonstrative pronouns: ciò (this/that), questo (this), quello (that)

Abstract Anaphora in Italian

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 6

Centre for Language Technology

The Annotations

Texts: structural information, PoS and lemma (various tagsets)

Spoken data: PoS, (lemma), stress, (prosody, phrases), speakers, interaction segments, utterances, timestamps

All data (Navarretta & Olsen LREC-2008): • 3sn: pronominal function (9), syntactic function• anaphoric occurrences: referential links and their

type, antecedents, syntactic type of antecedents, anaphoric distance

• when AA also semantic type of reference

LREC 2010, Malta 17-23 Maj Dias 7

1. dialogues from AVIP, Italian map-task corpus, Pisa, Napoli, Bari (ftp://ftp.cirass.unina.it/cirass/avip);

2. dialogues and monologues from Danish map-task corpus DanPASS (Grønnum, 2006);

3. multiparty spontaneous dialogues from the Danish LANCHART corpus (Gregersen, 2007);

4. transcriptions of TV-interviews;

Spoken Corpora (da 100000, it 70000)

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 8

1. Pirandello’s (1922) stories and Danish translations;

2. parallel Danish and Italian EU texts; 3. articles from Italian financial newspaper, Il

Sole 24 Ore; 4. Danish juridical texts; 5. extracts from the Danish general language

PAROLE corpus (Keson and Norling-Christensen, 1998);

.

Written corpora (da 60000, it 50000)

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 9

Pro Non-ref IA AA % Other Total

Danish Texts

det 345 152 130 65% 81 708

dette 0 23 71 35% 4 98

total 345 175 201 100%

85 816

Danish Monologues

unstressed 22 107 27 73% 54 210

stressed 1 74 10 17% 45 130

total 23 181 37 100% 99 340

Danish Dialogues

unstressed 158 483 299 59% 467 1407

stressed 10 185 204 41% 197 596

total 168 668 503 100% 664 2003

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 10

Pronoun Non-ref IA AA Other Total

Italian Texts

zero 34 317 19 (48%)

22 392

clitic 0 100 2 (5%) 4 106

personal 0 165 12 (30%) 4 181

demonstrative 0 16 7 (17%) 4 27

total 34 598 40 (100%) 34 706

Italian Dialogues

zero 1 26 42 (75%) 3 72

clitic 0 19 0 2 21

personal 0 128 11 (20%) 56 195

demonstrative 0 7 3 (5%) 10 10

total 1 180 56 (100%) 71 308

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 11

Corpus Antecedent Pronoun Total Pronoun Total

Danish Texts

Clause

det

72

dette

60

Discourse Seg. 6 7

C. predicate 13 4

VP 17 3

abstract pron 5 3

DanishMonologs

Clause

unstresseddet

22

stresseddet

4

Discourse Seg. 1 0

VP 1 2

C. predicate 85 62

abstract pron 19 20

DanishDialogs

Clause

unstresseddet

165

stresseddet

122

Discourse Seg. 8 3

VP 52 35

C. predicate 208 57

abstract pron 149 55

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 12

Corpus Antecedent Pronoun Total Pronoun Total

Italian

texts

Clausezero

17ciò

4

Discourse S 1 2

Clause

lo, ne

10

questo

-

Discourse S 1 1

C. Predicate

1 -

Italian

dialogues

Clausezero

41

questo

3

Clauselo

3

VP 5 -

C. Predicate

ci 2 -

LREC 2010, Malta 17-23 Maj Dias 13

Many factors influence the use of pronouns, see i.a. Hajičová et al. (1990), Borthen et al. (1997), Kaiser (2000), Kaiser and Trueswell (2004), Gundel et al. (2003), Navarretta (2002, 2005).

Navarretta (WARII-2008): the differences in the use of AA pronouns in Danish and Italian with respect to English are systematic.

Language specific characteristics can partly explain these differences.

Discussion

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 14

Pronouns for inanimate entities • English: 1 gender • Danish and Italian:2 inanimate genders Danish: common and neuter – only latter can

be abstract anaphorItalian: feminine and masculine – only latter can be abstract anaphor

In English more necessary to restrict interpretation: via distinction personal-demonstrative pronoun

Pronominal Systems

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 15

• constructions as clefts and left dislocations are much more frequent in Danish than in English and Italian, thus in Danish the clause is often the entity which is in "focus" (Gundel et al. 1993) – this partly explains the frequent use of personal pronouns (det and unstressed det) with clausal antecedents;

• word order is relatively free in Italian opposed to Danish and English:the use of abstract substantives in Italian restricts the antecedent search space;

Syntax

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 16

Centre for Language Technology

Machine Learning Experiments on Danish data (Navarretta – DAARC 2009)

Classifying the function of 3sn-pronouns using the pronominal context (n-grams of various size) and the annotated function (only in training), see i.a. (Evans 2000, Müller 2007, Hoste et al. 2007)

More classifiers run on data as proposed by Daeleman et al. (2005), but more types of data and more fine-grained classification.

Weka (Witten and Frank, 2005): results evaluated using 10-fold cross-validation;

Baseline: results by ZeroR which proposes most frequent nominal category.

LREC 2010, Malta 17-23 Maj Dias 17

Centre for Language Technology

Results (F-score)

• texts: 62.4%, monologues: 64.7%, map-task dialogues: 55.4%, multiparty dialogues: 32.9% (improvement: 36.4%, 30.7%,33.7% and 19.1% with respect to the baseline respectively);

• results on texts and map task data in line with results obtained on more restricted tasks, e.g. recognition of het by Hoste et al. (2007);

• recognition of non-referential pronouns slightly lower than in i.a. Boyd et al. (2005);

• adding pos and lemma information to data improves classification, but not significantly, same result as in Hoste et al. (2007);

LREC 2010, Malta 17-23 Maj Dias 18

Corpus Algorithm Precision Recall

F-score

Texts

Baseline 18.5 43 25.8

SMO 79.8 84.3 81.1

NBTree 78 83.6 80.4Naive Bayes

71.6 76 73.5

AVIPdialogues

Baseline 25.3 50.3 33.7

Kstar 68.9 72.4 69.6

SMO 63.5 69.2 65.7

NBTree 63 68.5 64.5

Classification experiments on Italian

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 19

• Improvement of classification with respect to the baseline is 55.1% for texts, 35.9% for dialogues.

• There are more types of pronouns in Italian than in Danish, thus the use of each type of pronoun is much more restricted in the former language than in the latter.

• Adding PoS and lemma information decreases performance of classifier, but not significantly.

Results on Italian data

Centre for Language Technology

LREC 2010, Malta 17-23 Maj Dias 20

• annotations in DAD corpora: characteristics of use of 3sn in Danish and Italian;

• differences in use of AA in Danish, English and Italian can be explained in terms of languages' pronominal system and syntax;

• annotations useful to automatically distinguish function of 3sn;

• to do: look at relation between pronouns, clausal types of antecedents and anaphoric distance –look at parallel data, investigate resolution, investigate use of lexical resources...

Conclusion and future work

Centre for Language Technology