Co-referential chains and discourse topic shifts in parallel and comparable corpora

23
CBA-08, Barcelona November 13th-15th 2008 Dias 1 Centre for Language Technology Co-referential chains and discourse topic shifts in parallel and comparable corpora Costanza Navarretta [email protected]

description

Co-referential chains and discourse topic shifts in parallel and comparable corpora. Costanza Navarretta [email protected]. Outline. Motivation Preceding studies/projects Background The data The annotation Problems Some results. Motivation. - PowerPoint PPT Presentation

Transcript of Co-referential chains and discourse topic shifts in parallel and comparable corpora

Page 1: Co-referential chains and discourse topic shifts in parallel and comparable corpora

CBA-08, Barcelona November 13th-15th 2008Dias 1

Centre for Language Technology

Co-referential chains and discourse topic shifts in parallel and comparable corpora

Costanza [email protected]

Page 2: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 2

Centre for Language Technology

Outline

• Motivation• Preceding studies/projects• Background• The data• The annotation• Problems• Some results

Page 3: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 3

Centre for Language Technology

Motivation

1. to provide a corpus of parallel and comparable Danish and Italian texts annotated with (co)-reference and with discourse topic shifts (language studies,anaphora resolution, MT, generation)

2. to investigate whether there is a systematic relation between the use of various types of referring expression and different discourse transition states in the two languages

3. to individuate similarities and differences in the use of various referring expressions in Danish and Italian

Page 4: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 4

Centre for Language Technology

Previous work

• Study on the use and resolution of pronouns in Danish

• MULINCO project (Maegaard et al. 2006)

• DAD project (Navarretta & Olsen 2008)

• Annotation seminar at University of Copenhagen (september 2008)

Page 5: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 5

Centre for Language Technology

Things to be inquired

• Referring expressions are used differently in English, Danish and Italian (theoretic and practical problems)

• Differences in the way the three languages use various types of pronoun in abstract reference

• Impression that Danish and Italian use different strategies in reference especially in relation to topic shifts

Page 6: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 6

Centre for Language Technology

Background: relation between reference and discourse structure

Kuno 1972, Halliday and Hasan 1976 Hobbs 1982 (coherence relations + reference resolution in an abductive framework)Givón 1983 (major and minor junctures in dialogue transcriptions )Cristea et al. 1998: Veins Theory inside Rhetorical Structure Theory (Mann and Thompson 1987)

Page 7: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 7

Centre for Language Technology

Background - continued

Centering framework (Grosz et al. 1995):

• presupposes global coherence: Grosz and Sidner 1986

• is about local coherence• mainly regards pronouns• compatible with cognitive models of reference

of nominal expressions, i.a. (Givón 1983, Gundel et al. 1993, Prince 1981): use of referring expressions reflects the assumption made by speakers about the addressees’ mental state at that point in discourse

Page 8: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 8

Centre for Language Technology

Local transitions (Brennan et al. 1987, Fais 2004, Poesio et al. 2004)

Cb(Un)=Cb(Un-1) or Cb(Un-1) =NIL

Cb(Un)=Cb(Un-1)

Cb(Un)=Cp(Un) Continue Smooth Shift

Cb(Un) ≠ Cp(Un) Retain Rough Shift

Presence/absence of backward-looking center (Cb)Nature of instantiation of discourse entities

Page 9: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 9

Centre for Language Technology

Background: Salience and nominal expressions

in focus > activated> familiar> uniq. identifiable>type ident. it that that N the N a N this this N

Gundel et al. (1993)

zero pronouns < cliticized pronouns < unstressed pronouns <stressed pronouns < stressed pronouns + gestures <proximal demonstrative < distal demonstratives < first name or last name < definite description < full name

Ariel (1988, 1994)

Page 10: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 10

Centre for Language Technology

The annotated corpora

Parallel corpora• European law texts• Short stories and translations (Pirandello)• short stories and translations (Villy Sørensen?)

Comparable corpora• Financial newspapers (Il Sole 24 Ore, Børsen)• Newspaper articles until now:approx. 24,000 words for Italianapprox. 19,000 words for Danish

Page 11: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 11

Centre for Language Technology

The annotation

(Co)referenceAnnotation of (co)reference by

substantives added on a small subset of the DAD corpus (annotated with pronominal abstract anaphora and 3rd person singular neuter pronouns)

Annotators: Italian (6 on a first subset of the data than divided in groups of 2)

Annotators: Danish (4 then 2)

Page 12: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 12

Centre for Language Technology

The annotation – continued

• builds upon the MATE/GNOME annotation (Poesio 2004)

• includes both reference to objects introduced in discourse by nominal phrases and reference to objects introduced by i.a. verbal phrases, clauses, discourse segments, predicates in copula constructions

Page 13: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 13

Centre for Language Technology

The annotation continued

• function of pronouns (pleonastic, cataphoric, deictic, anaphoric, individual, abstract, textual deictic, vague, abandoned )

• information about type of referring expression (type of NP, see also Poesio et al. 2004)

• type of relation between referring expression and antecedent (identity/non-identity/other?)

• syntactic type of antecedent (e.g. type of clause, discourse segment, other…)

• semantic type of abstract referents (Asher 1993, Gundel et al. 2003, Navarretta & Olsen 2008)

Page 14: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 14

Centre for Language Technology

Some problems

• pronouns (referring expressions in general) can be multifunctional (anaphoric and cataphoric)

• definition of clauses in Danish and Italian • reference relations: non-identity too general–

antecedent and referring expression related or context determining semantic difference of referents

• possessives• granularity of semantic types• direct speech, deictic I, you…

Page 15: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 15

Centre for Language Technology

Discourse topics

Global level (all files):paragraphs are considered to be starting a topic, then subtopic and subsubtopic (Rocha 1997)

local level (only part of the data): continue/retain/smooth shift/rough shift

Page 16: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 16

Centre for Language Technology

The annotation schemes

• two slightly different annotation schemes, the Italian scheme accounting for zero anaphora (Italian is a subject pro-drop language), clitic pronouns, reference to PP

• de, seg elements as in MATE/GNOME• added explet, abandoned, chunk• added seg1 for clitics and zero-anaphora

(Italian)• added a number of extra attributes• tool PALinkA (Orasan 2003): anchor+ref

substituted by link (attributes identity/non-identity and dislink to annotate discontinuous elements)

Page 17: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 17

Centre for Language Technology

Interannotator agreement: Italian

6 annotators on the first 4000 wordsweighed kappa statistics (Cohen, 1968): PRAM

http://www.geocities.com/skymegsoftware/pram.html

• In-between 0.75 (abstract reference by NPs) and 0.95

On rest of the data varying agreement (depend on annotators, data etc)

Humans are not machines: a number of referring expressions are “forgotten” by 1 or both annotators, other distraction errors.

Page 18: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 18

Centre for Language Technology

<P id="p35" topic="t35.1"> <S id="s35.1"> <transition ttype="TNULL“/> <de ID="n173" syn-type="NPR"> <link Ltype="ident" POINT-BACK="n172"/> <W id="w35.1.1">La</W><W id="w35.1.2">Acqua</W><W

id="w35.1.3">Marcia</W> </de> <W id="w35.1.4">può</W><W id="w35.1.5">evitare</W> <de ID="n521" syn-type="DNP"> <W id="w35.1.6">il</W><W id="w35.1.7">fallimento</W></de> <W id="w35.1.8">.</W> </S> <S id="s35.2"> <transition ttype=“CONTINUE“/> <de ID="n174" syn-type="DNP+GP"> <link Ltype="ident" POINT-BACK="n173"/> <W id="w35.2.1">La</W><W id="w35.2.2">finanziaria</W> <W

id="w35.2.3">di</W> <de ID="n522" syn-type="NPR"> <W id="w35.2.4">Vincenzo</W> <W id="w35.2.5">Romagnoli</W> </de></de>.... </S>...</P>

Page 19: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 19

Centre for Language Technology

First results – genre differences

• (co)referential chains in literary texts much longer than in the financial articles where coherence is often given by domain knowledge

• pronouns more frequently used in literary texts

• distance between referring expression and antecedent extremely high in literary texts (is there coreference when the distance is more than 50 clauses?)

Page 20: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 20

Centre for Language Technology

Differences between the two languages

Inferable entities are more often anchored to known entities by genitives in Danish

• Fin dal primo giorno, Bartolino Fiorenzo s’era sentito dire dalla promessa sposa…

• Fra første dag havde Bartolino Fiorenzo hørt sin tilkommende sige…

(From the very first day Bartolino Fiorenzo had heard (his/the) fiancée say)

Pirandello La buona anima

Page 21: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 21

Centre for Language Technology

Differences in use of proximal/distal demonstrative + N'

Italian quel/quello/quella (that) + N' used if:• there are other clauses or nominals inbetween

referring expression and antecedent • there is a temporal or spatial distance from the

antecedent Danish denne (this) + N': in the same contexts (there can

be clauses and nominals inbetween, but no competing antecedents, i.e. antecedents of the same semantic type)

quella donna/denne kvinde (woman)quella sciagura/denne ulykke (calamity)quella gioia/denne glæde (joy)

questo ragionamento/dette argument (this argument/this reasoning) when the antecedent is the immediately preceding discourse segment

Page 22: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 22

Centre for Language Technology

Transition states and referring expressions - Italian

Continue: Zero> Pronoun> the N

Retain: Pronoun > Proper Name >…>Zero

Smooth Shift: Proper Name > the N >Pronoun

Rough Shift: the N > genitive N > Proper Name (+ the N)> distal N >a N >Pronoun

NULL: Proper name > the N

Page 23: Co-referential chains and discourse topic shifts in parallel and comparable corpora

Dias 23

Centre for Language Technology

Conclusion

• A lot of work to be done• A lot of aspects which we have not

“knowledge” of • Language differences which must be

accounted for • Important for both theoretical studies of

language• Rule-based systems• Machine learning