AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14...
-
Upload
kelley-wilkerson -
Category
Documents
-
view
212 -
download
0
Transcript of AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11-14...
AUTOMATIC TRANSLATION UTILITYFostering language diversity and participationJuan Dolio, DR, 11-14 November 2008Stéphane Bruno, AHTIC/CONSORTIUM [email protected]
LANGUAGE STATS
LANGUAGE STATS
FACTS
English is the dominant language in CIVIC discussions
Non-English speaking members that are not fluent in English (or do not speak at all) are reluctant to contribute
Manual (Human) translation of all email and forum communications is impossible and way too costly
Systematic human translation would also delay interactions
CIVIC APPROACH TO LANGUAGE DIVERSITY
Three official languages: English, French, Spanish
All documents and “official” communications are translated in all three languages, (the original language document being the legally binding one?)
Simultaneous translation is provided in face-to-face meetings for plenary sessions when the number of the language group and its needs justify the cost
Automatic translation of emails is provided to facilitate comprehension and contribution by all language groups
OBJECTIVES OF THE AUTOMATIC TRANSLATION
Provide the opportunity for all members to get the essence of all communications in all three official CIVIC languages
Make the translation non disruptive, as seamless and as user-friendly as possible
Allow an improvement of the translation overtime
Construct a contextual terminology and linguistic environment for CIVIC on its field of intervention
HOW IT WORKS
THE TRANSLATION MECHANISMS When a mail arrives, the software breaks the
email into paragraphs The software tries to guess the language of the
paragraph If it cannot guess the language, it assumes it is
English Then the software preprocess the paragraph
through the knowledgebase Then each paragraph is sent to the translation
service (Babelfish) and the result is retrieved for each language pair
The resulting paragraph is post-processed Then the email is reconstructed and sent to the
mailing list manager
INPUT REQUIREMENTS
Use simple language constructs Use complete sentences and correct
grammar and syntax Avoid abbreviations, metaphors and
idiomatic expressions Avoid proverbs and sayings Do not mix languages in same paragraph (as
translation is done paragraph by paragraph, and language is guessed)
OTHER FEATURES
If you want some words not to be translated, enclose them in “*”, like *CIVIC*
The knowledgebase allows to enter in a database how some words are to be translated to override the translation of the translation service, for example, to say ICT is translated TIC in French and Spanish and vice cersa
This allows to build a lexicon or linguistic construct in the context of CIVIC and ICT4D
LIMITATIONS The less lengthy a paragraph is, the less
accurate is the guessing of the language of the text. So, introductory paragraphs like greetings or opening, single-words texts will usually be wrongly or not translated at all
The current version works only with plain text email messages. The final version will try to convert HTML-formatted emails to plain text before processing them
The utility relies on Babelfish without a formal agreement (since it is free) and for which Babelfish was not designed. So, it is vulnerable to the slightest changes on the Babelfish web site
THINGS TO RESOLVE
The character encoding issues Who will manage the knowledgebase? How
words are entered into the database? How it is decided?