Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000...

30
Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 ACIDCA ’2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling texts and corpuses in Ariane-G5, a complete environment for multilingual MT

Transcript of Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000...

Page 1: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 1

ACIDCA ’2000, Monastir, 21-24/3/2000

Christian Boitet

GETA, CLIPS, IMAG, Grenoble

Handling texts and corpuses in Ariane-G5,

a complete environment for multilingual MT

Page 2: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 2

Outline Introduction Multilingual MT-R (for revisors): linguistic

methodology & basic software Goals and linguistic methodology Ariane-G5, an MT shell for building multilingual MT-R

systems What has been and is done with Ariane-G5:

MT-R, MT-A (for authors), MT of speech

Representation of input documents Structuration of corpuses Functionalities during processing

Page 3: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 3

MULTILINGUAL MT-R: GOALS AND LINGUISTIC METHODOLOGY

Produce RAW translation GOOD ENOUGH to be revised

Specialize to SUBLANGUAGES and use MULTILEVEL TRANSFER

(semantic + traces) HEURISTIC PROGRAMMING

Page 4: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 4

MULTILINGUAL MT-R: BASIC DIAGRAM

umc-structure

uma-structure

umc-structure

Source LanguageText

Target Language 1Text

umc-structure

Target Language 2Text

uma-structure uma-structure

gma-structure gma-structure

paraphrase choice.

Morphological Analysis

Abstraction

Structural Analysis

Structural Generation

Morphological Generation

Syntactic Generation

Page 5: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 5

Ariane-G5 (1978-99) : structure

Page 6: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 6

DB of lingware components Declaration of variables (= typed attributes),

templates… Dictionaries Grammars (rules = transitions of abstract automata)

DB of texts Corpuses Source texts Intermediate results Translations (± revisions)

Ariane-G5: 2 specialized DB

relative to “variants”=>

Page 7: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 7

What has been and is done with Ariane-G5:

MT-R (for revisors)Large, operational systems: RU—>FR, FR—>EN

Prototypes: EN—>MY, TH, FR

Lots of mockups MT-A (for authors)

LIDIA mockups: FR—>DE, EN, RU (adding CH) MT of speech (for task-oriented dialogues)

CSTAR demo system (EN, DE, KR, IT, FR, JP)

Page 8: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 8

MT-R examples of translation (1)français-anglais en aéronautique (avant révision humaine)

Page 9: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 9

MT-R examples of translation (2)

Page 10: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 10

MT-A example of a disambiguation dialogue

Le capitaine a rapporté des tasses et des assiettes bleues

—> The captain has brought back blue bowls and plates/ bowls and blue plates OO des tasses bleues et des assiettes bleues

O des assiettes bleues et des tassesQuestion 1

OO capitaine de marine

O capitaine d’aviation

O capitaine d’artillerie

O capitaine d’infanterie

O capitaine de cavalerie

O …

Question 2

Page 11: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 11

e-mail servere-mail server

Interaction in source for the “quality MT for all”

Example scenario : multilingual e-mail (UNL)

e-mail tool

Nicknames + language preferences

e-mail tool

Nicknames + language preferences

enconversion serverenconversion server

analysis serveranalysis serverinteractive disambiguation server

interactive disambiguation server

decoding serverdecoding serverdecoding serverdecoding serverdecoding serverdecoding serverdecoding serverdecoding server

decoding serverdecoding serverdeconversion servers

deconversion servers

1

2

65

7

8

9

Addressees’ e-mail serversAddressees’ e-mail servers

10

4

3

Page 12: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 12

Other future possibility: production of multilingual “self-explaining documents”

structure MMC

structure UMC

structure UMA

structure UMC

Texte en langue source

Texte en langue cible 1

structure UMC

Texte en langue cible 2

désambiguïsation interactive

structure GMA

structure UMA

structure UMC

rétro-traduction

Rétro-traduction 1

Utilisateur

structure MMC

désambiguïsation "muette" simulée (DMS) DMS

m.a.&d.marques d'ambiguïté

et dialogue

structure MMC

structure UMA structure UMA

structure GMA structure GMA

choix de paraphrase

marques d'ambiguïté et dialogue

Page 13: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 13

Speech Translation:advantages of an Interchange Format

N target languages for the cost of one analysis Translating into one’s language from N source

languages with one generation Using the same generation to “backgenerate”

Analysis into IF

IFBackgeneration

Page 14: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 14

Interchange Format : example

la semaine du 12 nous avons des chambres simples et doubles disponibles

give-informationgive-information++availabilityavailability++roomroom(room-type=(room-type=((single ; doublesingle ; double), time=(), time=(week, md12week, md12))))

give-informationgive-information ++availabilityavailability++roomroom (room-type=((room-type=(single ; doublesingle ; double), time=(), time=(week, md12week, md12))))

Acte de dialogueActe de dialogue

ConceptsConcepts

ArgumentsArguments

Page 15: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 15

Interface of CLIPS++ CSTAR-II demonstratorReconnaissance IF Rétrogénération (pour contrôler la “compréhension”)

Génération

Page 16: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 16

Hardware architecture of the CLIPS++ CSTAR-II demonstrator

FIF

MontpellierGrenobleRNIS

Reco

Ethernet

Contrôle, IFFSynthèseVC IU

Page 17: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 17

Steps in translating a text

Build its hierarchical structureChapters, sections, paragraphs, [sentences]

Segment into translation unitsAccording to current length parameter [min..max]

Translate each segmentAdding segment results to text results for desired

phases Revise (manually) the whole translations, keep

the revisions

Page 18: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 18

Representations of input documents

3 main questions: how to represent the writing system, separate formatting tags from the text or not, how to handle non-textual elements (figures, icons, or

formulas) contained in utterances

Transliterations of textual elements Keeping formatting tags in the texts Non-textual elements

Page 19: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 19

Facilitate string-matching operations Diminish the size of dictionaries

Represent diacritics

Make some processing easier for some toolskataba —> ktb$aaa, katub —> ktb$au- or ktb$-ua

Transliterations of textual elements

lisp Lisp LISPLISP *LISP **LISP

François va à ACIDCA’2000*FRANC!4OIS VA A!2 **ACIDCA'2000

Page 20: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 20

Transliterations of textual elements (2) Represent writing systems using non Roman

characters"мать" (mother) —> "MATQ" and not "MAT6"‡ fl  ˝ Ë ˚ Ó fi Û ˛ ÈA YA E YE I YI O E!1 U YU JÁ Ê Í ˜ Ò ¯ Ú ˘ ¸ ˙Z ZH K KH S SH T TH Q W

今日は京都へ行きます。 (Today theme Kyoto dest go.) —>

KYOU <kj k1=kon k2=nichi> WA <hg ha> KYOUTO <kj k1=higashi k2=toukyo-no-tou> E <hg he> IKI <kj k1=iku> MASU.

Page 21: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 21

Keeping formatting tags in the texts

If the translation units get larger, almost all tags become “inside tags”

Tags often have a linguistic roleFor example, a sentence may contain• a bullet list• or a numbered listwhich are normally linguistically homogeneous.

<P>For example, a sentence may contain</P><UL> <LI>a bullet list <LI>or a numbered list</UL><P>which are normally linguistically homogeneous. </P>

Page 22: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 22

Non-textual elements

Formulas, figures, icons, brand names, anchors, links…are often best replaced by tags or special occurrences

The situation may be recursive (text inside figures)

*IF x2+5y>3 , x+y IS CONVENIENT .

*IF <relation 1> , <entity 2> IS CONVENIENT .

*IF $$R-1 , $$E-2 IS CONVENIENT .

Page 23: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 23

Structuration of corpuses

Motivations for corpuses Segmentation and structuration Representation of texts, intermediate results,

translations and revisions

Page 24: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 24

Motivations for corpuses

Corpus = collection of texts sharing some factual characteristics:

• natural language

• transliteration and method for handling formatting information and non-textual elements

• segmentation method

• structuration method

some management information:

• source (journal/volume, book/chapter…)

• usage destination (send back, postedit, tests…)

Page 25: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 25

Segmentation and structuration "segmentation"

= input texts —> words, sentences…best done by the morphological analyzer

& units of translation "structuration"

= segmentation —> higher level units paragraphs, sections, etc.

+ production of a corresponding tree structure In Ariane-G5, up to 7 hierarchical separators

for a given corpus

Page 26: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 26

Representation of texts, intermediate results, translations and revisions

Corpus = list of text files + descriptor Text = (transliterated) text + descriptor

(+ non-textual elements replaced by tags or spec.occs) Intermediate result = list of decorated trees

+ descriptor (lingware variant + interval processed) Translation = (transliterated) text + descriptor

(transliterated form may reduce morph. gen. size) Revision = (transliterated) text + descriptor

(usually another, more natural transliteration)

Page 27: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 27

Functionalities during processsing

Ensuring coherence between lingware and results

Stopping & restarting processing of a text Reusing intermediate results

recovery from interruptions debugging multitarget translation (analysis ≈ 2/3 of translation

time)

Page 28: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 28

Conclusion and perspectives (1)

Text & corpus handling in complete MT systems is quite complex and interesting…�handling texts and corpuses not a straightforward

problem,�suggests many interesting technological and

scientific issues

Page 29: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 29

Conclusion and perspectives (2)

but more is coming:Synergy MT systems <—> TA (Translation Aids)

unification of the representations of texts in both worlds: • MT: revised texts structured as input texts,

=> the text data base will become a kind of multilevel translation memory (texts, translations/revisions, intermediate results)

• TA: translation memories from "bags" to structured translation memories (keeping the sequential context)

both: multiple-layer translation memories• lemmatized forms

• "concrete" syntactic trees & "abstract" logico-semantic trees

• formatting tags

Page 30: Ch. Boitet, GETA, CLIPS ACIDCA 2000, Monastir, 22-24/3/2000 1 ACIDCA 2000, Monastir, 21-24/3/2000 Christian Boitet GETA, CLIPS, IMAG, Grenoble Handling.

Ch. Boitet, GETA, CLIPS ACIDCA ’2000, Monastir, 22-24/3/2000 30

Conclusion and perspectives (3)

Structuration may be used to « distribute the work » to MT and TA by segmenting according to the « best engine »

some sublanguages are good for MT, bad for TA

• weather bulletins

others are good for TA, bad for MT

• weather related warnings, slightly modified versions of already translated documents

and others are best kept for specialists

• Fine-tune legal sentences