Real-time DirectTranslation System for Sinhala and Tamil Languages.

53
Real-time Direct Translation System for Sinhala and Tamil Languages Authors: Rajpirathap S, Sheeyam S, Umasuthan K, Amalraj Chelvarajah AITM’15 / FedCSIS’15

Transcript of Real-time DirectTranslation System for Sinhala and Tamil Languages.

ITRU Symposium Presentation 2014

Real-time DirectTranslation System for Sinhala and Tamil Languages

Authors: Rajpirathap S, Sheeyam S, Umasuthan K, Amalraj ChelvarajahAITM15 / FedCSIS15

Machine TranslationAutomatic translation from one language to another using computing devices and algorithms.

Machine Translation ApproachesTransfer based approachInterlingua approachDirect approachExample based approach Statistical based approach

Why Statistical Approach?Proved to be more efficientShorter development timeLots of standard algorithms existSupportive tools availableEffective for large text translationFew linguistic assumptions

GoalTo develop a Real-time Machine Translation System which enables effective communication between Sinhala & Tamil people solving the language barriers in Sri Lanka.

Problem & SolutionProblemPeople face language barriers while communicating in native languages & unavailability of translation systems especially for Tamil and SinhalaUnavailability of a real time communication system that does translation automatically for the language pairs considered especially in informal domainSolutionBuild our own Instant Communication system which enables effective communication between Sinhala & Tamil people solving the language barriers in Sri Lanka.

ObjectiveDevelop a Bi-directional translation system for Sinhala & Tamil languages which can be used for communication purposes

ScopeTranslate Sinhala text to Tamil text and vice versaTranslation output is based on the type of the language corpora we use to implement the systemAccessible for the public

Others Work

Sinhala Word Net ProjectThe BEES projectExample Based Machine Translation for English-Sinhala Translation systemTransFire IPhone applicationUCSC projects on Statistical Machine TranslationTranslators supported by ICTA

Comparison of existing Machine Translation Approaches

Word Net ProjectExample Based MTTrans Fire ApplicationUCSC SMT ResearchOur SMT ResearchSinhala to Tamil Translation

Tamil to Sinhala Translation

Chat Feature

TransliterationHandle Large text translations

ProjectsFeatures

Concept of SMTSinhala Tamil translateA sentence t1A sentence t2A sentence t3A sentence t4A sentence s p1p2p3pn

Assumption

Concept of SMTSelect the Tamil Sentence that has maximum probabilityIf ( p3 > p1,p2 pn ) then the sentence t3 is a translation of sentence sThe notation is : -

Concept of SMT

Using BAYEs theoremAs s is fixed in a language, p(s) can be removed

Concept of SMT

Language ModelTranslation Model

Components of a SMT systemParallel Corpus (Data Preparation)Language ModelTranslation ModelDecoder

Data PreparationWe used over 6000 phrases from each language which is totally more than 12000 sentences and more than 120000 words to train the system

Data PreparationDiscussions of various ministry affairsDiscussions on Road development affairsDiscussions on financial developments and issuesDiscussions on general public issuesAdministrative data1500+ parallel text of informal language

Data Preparation

Data SetsTraining SetTuning SetTesting SetSinTamSinTamSinTamWords99k78k3425307831103204Phrases58875887200200200200

Language ModelStandard n-gram language modelProbability value is set to every sentence.Conditional distribution to identify the i'th word in a sequence, given the identities of all previous words.

Consider a sentence s as :- s = { w1,w2 wn }

N-gram approximation

SmoothingOnly the word sequences in the corpus are assigned non-zero probabilities and all unseen word sequences are assigned zero.

Allocate some probability mass to unseen word sequence by decrementing the actual probabilities of seen word sequence.

Smoothing AlgorithmsAdd SmoothingWritten Bell DiscountingNatural Discounting Neys absolute discountingKneser - Ney DiscountingGood Turing Discounting

Translation ModelUseful in checking whether a target language sentence is a proper translation of a source language sentence or notP ( S | T )Probability of source sentence (s) given target sentence (t)

Translation ModelsIBM Model 1IBM Model 2IBM Model 3IBM Model 4IBM Model 5

Translation Modeling ProcessCalculate lexical Translation ProbabilitiesGenerate phrase extraction fileScoring extracted phrasesLexical WeightingWord penaltyPhrase penaltyBuild Re-Ordering Model

Word AlignmentSinhala : Mama Paasalata Yanawa

Tamil : Naan Paadasaalaikku Pokiren

Word alignments: 1 1 , 2 2 , 3 3

Word Alignment AlgorithmsUnionIntersectionGrowGrow-Diagonal

DecoderEfficient Searching Given language and translation models, searching for the most satisfying source sentence for a given target sentence

DecoderBeam SearchMinimum Bayes Risk decodingLattice MBRConsensus decoding

Our SMT Project OutlineClient 1(Sinhala)Client 2(Tamil)Translation model in both waysSinhala TextTamil TextTraining CorpusTrainingCorrected Tamil outputSinhala TextTamil TextCorrected Sinhala output

DesignArchitecture DesignImplementation DesignEvaluation Design

Architecture Design (SMT)Data preparationLanguage Modeling Translation ModelingMERT TuningDecodingEvaluation (BLEU & NIST)

UsersinputApplication InterfaceFormatted inputoutputCorrections

Implementation Design (SMT)Client Application 1History FileTraining CorpusSinhala SentencesTamil SentencesSMT SystemLMTMDecoderTunerEvaluator

Client Application 2History File

Implementation (SMT)

We have developed a Bi-directional Translation system which does translations for Sinhala and Tamil.Developed a chat Application which supports translations of Sinhala and Tamil.Technologies like Java, C++ & Perl are usedLanguage Modeling and Translation Modeling Algorithms are integrated.Decoder is developed using decoding algorithms integrated.Parallel Corpora for Chat domain is created (2000 parallel lines)Parallel corpora of Parliament order papers are used to model LM and TM.

Our contributions as developers

Trainer application Interface implementationChat application implementationParameter optimizations (n-gram , Discounting , word alignment , lexical Re - ordering ) Creation of corpus Decoder implementationTokenizer improvements and implementationTransliteration feature implementation

Evaluation Design/Strategy

Language Model Evaluation StrategyOrder Adjustment (2,3,4)Smoothing/DiscountingAdd SmoothNeys AbsoluteKneser NeyNatural DiscountingWritten Bell Interpolation/not

Translation Model Evaluation StrategyResulted Language Models

Word AlignmentIntersectGrow-DiagonalGrow- Diag-FinalUnion

Re-OrderingMSD-BidirectionalMSDMonotonocity-bidirectionalMonotonocity

Selected Combinations of LM and TMs

Decoder Evaluation StrategyDecoding AlgorithmBeam SearchMinimum Bayes RiskLattice MBRConsensus Limit on distortion (-1,0,6,10,20)BEST System Configurations

EvaluationEvaluation to find Optimal System ParametersUser EvaluationManual EvaluationCorpus Evaluation

Evaluation to find Optimal System Parameters for the translation system

N-gram order (3 parameters)Smoothing (6 techniques) Word Alignment & Re-ordering(16 Combinations) Decoding(4 Algorithms) Language Modeling(18 experiments and select the best 2)Translation Modeling(32 experiments and select the best 2)Decoding(8 experiments and select the best 1)~60 experiments per system) * 2=120 Experiments

Optimal System Evaluation ScoresSystemsSinhala - TamilTamil - SinhalaBLEU0.59570.6693NIST4.41824.8563

Manual Evaluation (SMT)SystemsSinhala - TamilTamil - SinhalaSource Words812Translated Words610Missed Words22%7583.33

Evaluated the systems with 15 translations each

User Evaluation (SMT)Questionnaires were distributed to evaluate the final systemDemo Videos were prepared to use for the evaluation.IT and Engineering students were our participantsUser feedbacks were considered and corrective actions were takenAchieved an Overall rating of 3.8 out of 5

Corpus Evaluation (SMT)Evaluated the systems with different numbers of training phrasesSin - TamTam - SinNumber of phrasesBLEU SCORE5100.1369820.23568911000.2698450.29653416500.2647820.35648421250.3865320.49897530600.4023560.54326540020.4532560.57632550200.4923560.59235656970.5235420.62356466970.5493030.642535

MERT TuningMERT - Minimum Error Rate Training

Possesses the capability of adjusting many parameters

Attempts solving the loose relation to the final translation quality of unseen text in maximum likely hood method

Tuned System EvaluationSystemsSinhala - TamilTamil - SinhalaBLEU0.59570.6693NIST4.41824.8563

Final ConclusionsWord Alignment [ BEST - Grow-Diag , Worst Union ]Re-Ordering [ BEST- MSD ]Decoding Algorithm [ Beam Search, Lattice MBR (TAM to SIN) and Consensus (TAM to SIN) ]

Final outputs

Sinhala to Tamil

Final outputs

Tamil to Sinhala

Improvements than past/existing researchesImproved BLEU and NIST scoresAuto training of systemHighly accurate translationsNormal usage conversations trained to support chat domainImproved performance and translation timesReferred updated tools and techniquesSystem is supported for large data setsImproved data preparation techniques

Future WorkCloud hostingDevelop an API for developers to use this system as a serviceEnable public contribution for data preparationImprove translation qualityUpdate new techniques/algorithmsSpecialize in wider domainsUsage of newly available evaluation metrics

Demonstration

Translation Enabled Chat Application

Thank you