AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14...

12
AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET [email protected]

Transcript of AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14...

Page 1: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

AUTOMATIC TRANSLATION UTILITYFostering language diversity and participationJuan Dolio, DR, 11-14 November 2008Stéphane Bruno, AHTIC/CONSORTIUM [email protected]

Page 2: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

LANGUAGE STATS

Page 3: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

LANGUAGE STATS

Page 4: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

FACTS

English is the dominant language in CIVIC discussions

Non-English speaking members that are not fluent in English (or do not speak at all) are reluctant to contribute

Manual (Human) translation of all email and forum communications is impossible and way too costly

Systematic human translation would also delay interactions

Page 5: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

CIVIC APPROACH TO LANGUAGE DIVERSITY

Three official languages: English, French, Spanish

All documents and “official” communications are translated in all three languages, (the original language document being the legally binding one?)

Simultaneous translation is provided in face-to-face meetings for plenary sessions when the number of the language group and its needs justify the cost

Automatic translation of emails is provided to facilitate comprehension and contribution by all language groups

Page 6: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

OBJECTIVES OF THE AUTOMATIC TRANSLATION

Provide the opportunity for all members to get the essence of all communications in all three official CIVIC languages

Make the translation non disruptive, as seamless and as user-friendly as possible

Allow an improvement of the translation overtime

Construct a contextual terminology and linguistic environment for CIVIC on its field of intervention

Page 7: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

HOW IT WORKS

Page 8: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

THE TRANSLATION MECHANISMS When a mail arrives, the software breaks the

email into paragraphs The software tries to guess the language of the

paragraph If it cannot guess the language, it assumes it is

English Then the software preprocess the paragraph

through the knowledgebase Then each paragraph is sent to the translation

service (Babelfish) and the result is retrieved for each language pair

The resulting paragraph is post-processed Then the email is reconstructed and sent to the

mailing list manager

Page 9: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

INPUT REQUIREMENTS

Use simple language constructs Use complete sentences and correct

grammar and syntax Avoid abbreviations, metaphors and

idiomatic expressions Avoid proverbs and sayings Do not mix languages in same paragraph (as

translation is done paragraph by paragraph, and language is guessed)

Page 10: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

OTHER FEATURES

If you want some words not to be translated, enclose them in “*”, like *CIVIC*

The knowledgebase allows to enter in a database how some words are to be translated to override the translation of the translation service, for example, to say ICT is translated TIC in French and Spanish and vice cersa

This allows to build a lexicon or linguistic construct in the context of CIVIC and ICT4D

Page 11: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

LIMITATIONS The less lengthy a paragraph is, the less

accurate is the guessing of the language of the text. So, introductory paragraphs like greetings or opening, single-words texts will usually be wrongly or not translated at all

The current version works only with plain text email messages. The final version will try to convert HTML-formatted emails to plain text before processing them

The utility relies on Babelfish without a formal agreement (since it is free) and for which Babelfish was not designed. So, it is vulnerable to the slightest changes on the Babelfish web site

Page 12: AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET.

THINGS TO RESOLVE

The character encoding issues Who will manage the knowledgebase? How

words are entered into the database? How it is decided?