Anabela Barreiro - Alinhamentos

49
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L 2 F - Spoken Language Systems Laboratory 1 Cross-Language Alignments: Challenges, Guidelines and Gold Sets Anabela Barreiro Luísa Coheur Tiago Luís Ângela Costa Fernando Batista João Graça

description

Workshop orientado por Anabela Barreiro na I Conferência Internacional de Tradução e Tecnologia, 13 e 14 de Maio, Faculdade de Letras do Porto.

Transcript of Anabela Barreiro - Alinhamentos

Page 1: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 1

Cross-Language Alignments:

Challenges, Guidelines and Gold Sets

Anabela Barreiro Luísa Coheur Tiago Luís

Ângela Costa Fernando Batista João Graça

Page 2: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 2

Outline – Part 1

• Word alignment

• Basic concepts

• Applications

• State of the art

• Limitations

• Paraphrase alignment

• Multiword, meaning and translation unit alignment: importance

• Our task

• Alignment tool: CLUE-Aligner

Page 3: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 3

Outline – Part 2

• General annotation guidelines

• Cross-linguistic major challenges to word alignment

• Annotation guidelines for multiword units and lexical and non-lexical

realization phenomena

• Pro-dropping

• Articles and zero articles

• Examples: continuous multiword units

• Examples: continuous and discontinuous support verb constructions

Preposition-dependency

(V, N and Adj)

Active vs passive Choice of noun pre-modifiers Different PoS with same

semantics (V vs process N)

Noun adjuncts Coordination Anaphora: choice of co-

referents

Impersonal constructions

Contractions Style Antonyms and negation

constructions

Romance languages double

negation

Singular vs plural idiomatic vs non-idiomatic Flexible/loose paraphrasing

constructions;

Idiosyncrasies of each

language

Page 4: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 4

Outline – Part 3

• Our contribution

• Annotation process

• Preliminary results

• Discussion

• Future work

Page 5: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Word Alignment: Basic Concepts

• Objects representing the mapping of words (or expressions),

which are semantically equivalent in a source and a target

sentence of a parallel corpus [Brown at al., 1990]

– Matrix of n * m entries, where n is a position on the source sentence, and

m is a position on the target sentence. An entry in that matrix an,m

specifies if the word at position n is part of a translation of the word at a

position m on the target language

• Task of word alignment - identifying translational equivalences

(= semantic correspondences) in the aligned sentence pairs of

a parallel text [Hearne & Way, 2011]

• Translational equivalences - graphically represented in a grid

by the intersection of single segments (individual words) or

blocks (semantico-syntactic units, phrases, expressions)

5

Page 6: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Word Alignment: Basic Concepts

6

• Sure alignment (S-alignment)

– Unambiguous and valid in all contexts

• EN system

• ES sistema

• FR système

• PT sistema

• Possible alignment (P-alignment)

– Ambiguous and invalid in some contexts

• EN be

• ES ser/estar/haber/existir

• FR être/avoir/exister

• PT ser/estar/haver/existir

Page 7: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Word Alignment: Applications

• Statistical machine translation

– [Brown et al., 1990] – statistical machine translation

– [Och and Ney, 2004] – phrase base machine translation

– [Galley et al., 2004] – syntax base machine translation

• Annotations’ projections

• Extraction of bilingual lexica

• Evaluation of machine translation systems

7

Page 8: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Word Alignment: State of the Art

• Workshops and evaluation tasks (multi-language)

– http://www.cse.unt.edu/~rada/wp/

– http://www.statmt.org/wpt05

– http://www.lpl.univ-aix.fr/projects/arcade

• Projects

– Blinker project –French-English

http://nlp.cs.nyu.edu/blinker/

• Guidelines

[Melamed, 1998] [Och and Ney, 2000]

[Lambert et al., 2005] [Kruijff-Korbayová et al., 2006]

[Graça et al., 2004]

8

Page 9: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Word Alignment: Limitations

• Language does not operate on a word-for-word basis

• A large number of words are undissociated

– Multiword units

• [Gross and Senellart, 1998] - +40% of 1 year of Le Monde are MWU

• [Sag et al., 2002] – 50-70% of specialized lexica are MWU

• [Ramisch et al., 2010] – 56.7% of terms in Genia corpus have 2+

words (not included general purpose MWU, e.g., generic compounds,

lexical bundles, phrasal verbs, fixed expressions, which also occur in

domain-specific texts)

– Translation units

– Meaning units

– Paraphrases

• Segment and block alignment (sure and possible)

9

Page 10: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Example: Segment and Block

Alignment (Sure and Possible)

10

Page 11: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Paraphrase Alignment

• Monolingual

– [Callison-Burch et al., 2006]

• Annotation guidelines for paraphrase alignment

• Paraphrases - sentences that convey the same meaning but are

worded differently

• Alignment of words, phrases, expressions, within the same language

• Bilingual = (non-literal) translation

– Need to account for paraphrases across languages

11

Page 12: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Multiword, Meaning and Translation

Unit Alignment: Importance

• Publicly available manual word alignments are restricted

to a few language pairs

• Manual word alignments are a desired resource

– Evaluation of word alignment algorithms

– Training of supervised and semi-supervised algorithms

– Tuning of parameters for different types of model

• But, “name”, “concept” and “techniques” of alignment need

to be linguistically sophisticated to be more useful and

help provide improved machine translation!

12

Page 13: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Our Task

• EuroParl corpus [Koehn, 2005]

• 6 gold alignments sets

– 400 alignments each set (400x6=2,400)

• Languages: English, French, Portuguese and Spanish

– Language pairs: [en-es], [en-fr], [en-pt], [es-fr], [pt-es], [pt-fr]

• Guidelines for multi-language manual word annotations

(with inter-annotator agreement)

• Linguistically-informed (and linguistically-motivated) cross-

language multiword unit and paraphrase alignment

(translation unit alignment)

13

Page 14: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

CLUE-Aligner Alignment Tool

14

CLUE-Aligner =

Cross-Language Unit Elicitation Aligner

• Helps reduce ambiguity in the alignment process

• Facilitates the alignment of translation units

Page 15: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Major Challenges (4 different classes)

• semantico-discursive

– emphatic linguistic constructions

• tautology

• pleonasm and repetition

• focus constructions

• lexical and semantico-syntactic

– multiword units

– compound verbs

– prepositional predicates

15

Page 16: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Major Challenges (4 different classes)

• morphological

– contracted forms

– lexical versus non-lexical realization

• articles and zero articles

• pro-dropping

– subject pronoun drop

– empty relative pronoun

• morpho-syntactic

– free noun adjuncts

16

Page 17: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Linguistic phenomenon No alignment P-alignment

Incomplete or non-translation X

Incorrect translation and typo X*

Approximate correspondence (numeric) X

Non-obligatory

linguistic structure

Pleonasm X

Repetition of words or expressions X

Redundancy or additional/extra information X

Mismatching pronoun, determiner, verbs, etc. X

Abbreviations versus full word X

Punctuation mark

Different but correct X

Incorrect / mismatch X

Missing X

17

General Annotation Guidelines

* If a multiword unit is incorrectly translated or contains a typo, none of its internal segments are aligned

Page 18: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Linguistic phenomenon No alignment Block-alignment

S-align P-align

Multiword Unit continuous X X

discontinuous X*

Lexical

versus

non-lexical

realization

article+ N

versus

zero-article + N

Ø people

=

PT - as pessoas

X

Pro-drop + V

versus

pronoun + V

I went

=

PT - Ø fui

X

Empty relative pronoun

versus

realized relative pronoun

N that I met = N I met

=

PT - que (eu) conheci

X

Relative

versus

participial adjective

that was writen = writen

=

PT – escrito

X

18

Annotation Guidelines

* Some discontinous multiword units are candidates for block-alignment (e.g., when the number of inserts is small or the multiword unit

is “semi-frozen”

Page 19: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Continuous multiword units Block-S-alignment Block-P-alignment

Support verb construction X X

Compound X X

Phrasal verb X X

Named entity X X

Date and time expression X

Lexical bundle X

Idiomatic expression X

Domain term X

French negation (ne pas) X

English infinitive (to + V) X X

19

Annotation Guidelines

[Barreiro, 2008] presents a detailed description and examples of the different types of multiword unit

Page 20: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Example: Continuous Support Verb

Constructions (alignment)

20

ES aprueba plenamente

FR approuve pleinement

Page 21: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Example: Discontinuous Support Verb

Constructions (no alignment)

21

ES para que acelere la directiva sobre pensiones

complementares

FR pour faire avancer la directive sur les pensions

complementaires

Page 22: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Prepositional predicates

EN I too should like to congratulate [NE] on his excellent report

ES también yo quisiera felicitar a mi colega [NE] por su excelente informe

FR je voudrais féliciter moi aussi mon collègue [NE] pour son excellent

rapport

PT também eu gostaria de felicitar o meu colega [NE] pelo seu excelente

relatório

EN […] our Asian partners prefer to deal with questions which unite us

ES […] nuestros socios asiáticos prefieren dedicarse a las questiones que

nos unen

FR […] nos partenaires asiatiques préfèrent s’attacher à ce qui nous unit

PT […] os nossos parceiros asiáticos preferem centrar-se unicamente nas

questões comuns

22

Segment S-alignment

Impossible to annotate discontinuous preposition-dependency

Block P-alignment

Page 23: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

agree with belong to forgive s/o for pay for stand for

aim at/for choose between hope for prepare for thank s/o for

allow for comment on insist on prevent s/o from think of/about

apologise for compare with interfere with/in provide s/o with volunteer to

apply for complain about joke about refer to wait for

approve of concentrate on laugh at rely on warn s/o about

argue with/about congratulate on lend s/th to s/o run for worry about

ask for consist of listen to smile at

attend to deal with long for succeed in

believe in decide on object to suffer from

Cross-Linguistic Challenges

• Prepositional verbs

23

Page 24: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Prepositional nouns

24

attack on attitude towards in agreement on strike

cruelty towards comparison between on average in trouble

difficulty in/with decrease in on condition on behalf of

knowledge of disadvantage of delay in connection between

reason for incerase in in doubt difference between/of

rise in preference for information about under guarantee

solution to reduction in need for in power

use of at risk protection from reaction to

in a hurry at stake report on result of

in practice in theory room for trouble with

Page 25: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Prepositional adjectives

25

delighted at/about frightened of opposed to similar to

different from friendly with pleased with sorry for/about

dissatisfied with good at popular with suspicious of

doubtful about guilty of proud of sympathetic to(wards)

enthusiastic about incapable of puzzled by/about tired of

envious of interested in safe from typical of

excited about jealous of satisfied with unaware of

famous for keen on sensitive to(wards) used to

fed up with kind to serious about

fond of mad at/about sick of

Page 26: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Noun Adjuncts

– Compounds

• European investment bank banco europeu de investimento

[Adj N N] [N Adj [de N]]

– Free noun phrases (not compounds)

• presidency communication comunicação da presidência

[N N] [N [de N]]

26

Block S-alignment

Segment S-alignment

Block-P-alignment

of [de N]

Page 27: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Contractions

– two or more words with different parts-of-speech overlap, which

makes syntactic analysis and generation difficult

– in cross-language analysis, the contrast between languages that

have contractions and languages that do not have them, or do not

have them in the same contexts, presents additional difficulties

– The alignment of one segment that corresponds to a contracted form

in one language with the corresponding segments where elements

are not contracted in the other language of the parallel pair is

pragmatically motivated

27

Page 28: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Example: Contractions (block-P-

alignment)

28

Interference with the support verb construction

EN to make a reference to

PT fazer uma referência a

Page 29: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Example: Contractions (block-P-

alignment)

29

Interference with the support verb construction

ES hacer una referencia a

FR faire référence a

Page 30: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Singular versus plural (related to determiner)

EN in every official language of the union

ES en todos los idiomas oficiales de la unión

FR dans toutes les langues officielles de l'union

PT em cada uma das línguas oficiais da união

• Active versus passive

EN before new member states are admitted

ES antes de la incorporación de nuevos miembros

FR avant l'admission de nouveaux membres

PT antes da entrada de novos membros

30

Block or segment

P-alignment

Block-S-alignment if there

is some fixedness

(such as in this case)

Block P-alignment

Page 31: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Coordination

EN which we will send to the council and Ø parliament

ES que enviaremos al consejo y al parlamento

FR qui sera envoyée au conseil et au parlement

PT que remeterá ao conselho e ao parlamento

• Style: idiomatic versus non-idiomatic

EN which began four years ago

ES que empezó hace quatro años

FR qui a vu le jour il y a quatre ans

PT que se iniciou há quatro anos

31

No alignment

Block P-alignment

Page 32: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Choice of noun pre-modifiers

EN we should use that public funding for those types of project which are

most difficult to finance through the private sector

ES deberíamos utilizar esa financiación pública para aquel tipo de proyectos

que tienen mayor dificuldad para ser financiados por el sector privado

FR nous devrions recourir au financement public pour les projets que le

secteur privé boude

PT o financiamento público deveria ser utilizado para os projectos que

registam maiores dificuldades em serem financiados pelo sector privado

32

Block P-alignment

EN despite certain difficulties

PT apesar das dificuldades

Page 33: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Anaphora - choice of co-referents (noun versus pronoun)

EN it is not acceptable that we assisted Korea during the Asean crisis by

means of IMF loans and suchlike, only for Korea still to be subsidising its

shipyards

EN no resulta procedente que hayamos ayudado a Corea en la crisis de la

Asean a través de préstamos del FMI, etc. y que Corea siga

subvencionando sus astilleros

FR il n’est pas acceptable que nous ayons aidé la Corée dans la crise de

l’Anase, avec des prêts du FMI, etc. et qu’elle continue à subventionner

ses chantiers navals

PT é inadmissível que, depois de termos ajudado a Coreia, através de

créditos do FMI, etc., na crise da Asean, este país continue a

subvencionar agora os seus estaleiros navais

33

Segment or block

P-alignment

Page 34: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Antonyms and negation constructions

EN the countries of Asia have not unfortunately been in favour of that

proposal

ES los países de Asia desgraciadamente no han sido favorables a dicha

propuesta

FR les pays d'Asie ont malheureusement rejeté cette proposition

PT os países da Ásia, infelizmente, não se mostraram favoráveis a esta

proposta

34

Block S-alignment together

with adverb

(insert in EN and FR)

Page 35: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Flexible/loose paraphrasing constructions

EN and we shall vote against it

ES y merece nuestra condena

FR et dénonçons

PT e merece a nossa condenação

EN 1993 was a significant year

ES el año 1993 es una fecha notable

FR l’année 1993 est à marquer d’une pierre blanche

PT 1993 é uma data charneira

35

Block P-alignment

Page 36: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Different parts-of-speech with same semantics (verbs versus

process nouns)

EN we must use all the financial instruments at our disposal to rapidly

develop the market

ES es preciso utilizar todos los instrumentos financieros disponibles para un

rápido desarollo ulterior del mercado

FR il faut utiliser tous les instruments financiers disponibles pour

développer rapidement le marché

PT todos os instrumentos financeiros disponíveis deverão ser aplicados

para continuar a desenvolver rapidamente o mercado

36

Block S-alignment (with internal segment P-alignments)

EN and PT :

Segment S-alignment

No alignment of [continuar a]

Page 37: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Impersonal constructions

(+ “impersonal” relative versus participial adjective)

EN we must fully support the demands that have been made

ES hay que apoyar plenamente las exigencias que se han formulado

FR il faut par conséquent appuyer les requêtes formulées

PT as reivindicações formuladas deverão ser plenamente apoiadas

37

Block P-alignment

Internal P-alignment

EN we must

ES hay que

FR il faut

Internal segment S-alignment - adverb and verb (EN, ES, FR)

Internal segment P-alignment - verb (PT)

Page 38: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Romance languages double negation (+ coordination)

EN it is not, therefore, surprising that there is, in this context, no real

integration or gennuine political dialogue

ES no es nada sorprendente, entonces, que en ese contexto, no haya ni

verdadera integración ni verdadero diálogo político

FR rien d’étonnant donc, qu'il n'y ait dans ce contexte, ni intégration

véritable, ni dialogue politique véritable

PT assim, não é de espantar que, nesse contexto, não exista verdadeira

integração nem verdadeiro diálogo político

38

Block P-alignment of the relative existential with adverbial (insert)

EN that there is, in this context, no

ES que en esse contexto, no haya

FR qu’il n’y ait dans ce contexte

PT que, nesse contexto, não exista

Segment P-alignment of negation

and negation connector

EN no – or

ES ni – ni

FR n’ – ni

PT Ø - nem

Page 39: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Cross-Linguistic Challenges

• Idiosyncrasies of languages

• Portuguese inflected infinitive (peculiar verb tense)

• English to+Infinitive

• French negation

• English apostrophe

• …

• Sociolinguistic differences

39

Page 40: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Our Contribution

• Tool CLUE-Aligner

• Annotated corpora

• Cross-language resources – gold collection

Publicly available on the META-NET website:

http://metanet4u.l2f.inesc-id.pt/

• Guidelines

– http://www.inesc-id.pt/ficheiros/publicacoes/8204.pdf

40

Page 41: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Annotation Process

• Annotation of 400 x 6 (2,400 sentence alignments) by a

linguist

• Alignment on a subset of by a second linguist (25

• sentences of the English-Portuguese language pair)

• Inter-annotators agreement

41

Page 42: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Preliminary Results

42

language words avg. words

en 11158 27.9

es 11664 29.2

fr 12464 31.2

pt 11649 29.1

pair Sure Possible Total

en-pt 6684 418 7102

en-fr 7025 569 7594

en-es 7636 399 8035

es-fr 7477 767 8244

pt-es 7958 557 8515

pt-fr 7029 782 7811

pair Sure Possible Total

en-pt 2588 602 3190

en-fr 3865 414 4279

en-es 3551 351 3902

es-fr 3516 495 4011

pt-es 3162 382 3544

pt-fr 3253 698 3951

Block (MWU) alignment Segment (word) alignment

Page 43: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Inter-annotators Agreement

43

• Statistical significance for kappa is rarely reported. However, a number magnitude guidelines have appeared in the literature. – Landis & Koch (1977) consider

• kappas between .4 and .6 as a moderate agreement

• kappas between .8 and 1 correspond to an almost perfect agreement

– Fleiss (1981) (equally arbitrary guidelines) characterize

• kappas from .40 to .75 as fair to good

• kappas over .75 as excellent

• This set of guidelines is however by no means universally accepted

Cohen's kappa coefficient

Multi-word units (MWU) 0.541

Word alignments (WA) 0.984

Total 0.871

Page 44: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Discussion

• Difficulties in analyzing fluency, stylistics (including word order),

paraphrase, etc.

• Alignments do not always work bi-directionally (sometimes the source-

target direction for a language pair matters)

• Levels of alignment and ranking systems (n-grams, morphology,

semantico-syntactic level, phrase, paraphrase, etc.)

• Terminology imprecision is found in corpora (it leads to poor quality

machine translation)

45

Page 45: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Future Work

• Integration of lexica (multiword units, etc.) obtained via the use of local

grammars – use multiword units as ONE (1) segment of alignment,

whenever that is possible (contiguous, etc.)

• Pre-processing of contractions and post-processing of elements that

need to be contracted is important if applied to machine translation or

to create “more polished” lexica

• Evaluation of the current alignments in a statistical machine translation

system to see if translation quality improves

46

Page 46: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Future Work

• Machine learning of recognition and alignment of multiword units

• based on segment alignments, i.e., individual words inside the

multiword unit

• based on multiword units of a parallel sentence in another language or

language pair alignment

• Use of local grammars that identify and process discontinuous

multiword units and other complex linguistic phenomena to combine

with word alignment techniques – how to combine?

47

Page 47: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory

Main Conclusion

• Bringing linguistics into STM at the start is the first inevitable place

where hybridization should be possible.

• We believe that it would be productive to convert texts on both sides of

a translation pair into a common semantico-syntactic

representation before applying statistics into them. For this, each

language would have to have a parser capable of producing

homogeneous output.

• If this common representation were available, that would bring vast

possibilities for multi-linguistic SMT.

48

Page 48: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 49

technology from seed

L2 F - Spoken Language Systems Laboratory

Thank you!

Page 49: Anabela Barreiro - Alinhamentos

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 50

Cross-Language Alignments:

Challenges, Guidelines and Gold Sets

Anabela Barreiro Luísa Coheur Tiago Luís

Ângela Costa Fernando Batista João Graça