Machine translation Context-based approach Lucia Otoyo.

Machine translationContext-based approach

Lucia Otoyo

Machine translation

Computerized task of translating from one natural language to another

• Human vs. machine translation• Difficulties of MT

Brief history of MT

• 17th century Descartes & Leibniz• 1930 bilingual dictionary + rules• After war (Warren Wiewer)–decoding msg.• 1954 – first public demonstration of MT IBM (spawned research)• 1966 ALPAC – less accurate & more cost• 1980 increasing demand, rule-based born• 1990 parallel corpora approach

MT approaches

Rule based

Parallel corpora based

Context based

Conclusion

Rule Based approach

• Dominant in 1980• Resourses: Set of rules & bilingual dict.• Steps: Syntax -grammar

Semantics - meaningPragmatics – difference btw. Lang.

• Disadvantages:-language experts for rules

-new language pair - new rules-not possible to include all the rules-rules have exceptions

MT diagram

Parallel corpora based

• Example based (word freq. & combination)• Statistical (phrase extract. & combination)• Resources: parallel corpora (pre-trans.), decoder,

alignment software• Steps: disassemble test into phrases, search the corpora

and match phrases, substitute, align phrases to form text

• Advantages vs. Disadvantages-Easy to apply to new language-more readable as using human pre-translated text-General translation vs. Specific domain-Lexical ambiguity

MT diagram

Context Based MT

Target Language

N-gram Connector

Overlap-based decoder

N-gram candidatesSubstitution

request

Stored n-gram pairs

approved n-gram pairs

Source Language

N-gram segmenter

Cache database

Cross-language n-gram database

Resources

Bilingual dictionary

Target corpora

Source corpora

Gazetteers

N-gram builder

Flooder

Edge Locker

Synonym generator

MT diagram

CBMT edge illustration

‘This context based machine translation approach looks very interesting’.

1. ‘This context based machine’

2. ‘context based machine translation’

3. ‘based machine translation approach’

4. ‘machine translation approach looks’

5. ‘translation approach looks very ’

6. ‘approach looks very interesting’

edge locking

CBMT n-grams

Break down source text into n-grams(4-8)‘This context based machine translation approach looks very interesting’.

• If ‘n’ = 4 then n-grams as follows:







CBMT n-grams








CBMT n-grams








diagram

CBMT Flooding

• Search the monolingual corpora with translated n-grams

• Produces large number of n-grams with different translations for each word

• words can be in any order, taking into account differences between languages

• each n-gram 100-3000 high density matches

diagram

CBMT Target language lattice overlap maximization

• Align all the n-grams with each other• choose the ones, with the highest number of left and right

side overlaps • Eliminate non or partially overlapping n-grams

• 1. n-gram ‘This approach for computer’• 2. n-gram ‘This context based machine’• 3. n-gram ‘based machine translation approach’

diagram

CBMT Cross language database

• stores cross language n-gram correspondences for later use• to speed up the translation process

diagram

CBMT target language

• Find globally longest target language overlap with the highest match density






6. ‘approach looks very interesting


diagram

CBMT – synonymy

• Word and Phrasal Synonymy -increase accuracy if no or only partial overlaps found

-dynamic synonyms, no predefined coded patterns

Stages:1. Search for the word in corpus(1000-100000 context

related phrases)

• 1. ‘This establishment was founded in the year’

• 2. ‘The number of people working in the establishment is far greater than’

• 3. ‘The establishment is the first hotel’, etc

CBMT – synonymy cont.

2. Search the corpus only with the phrases• 1. ‘This ________ was founded in the year’

• 2. ‘The number of people working in the _______ is far greater than’

• 3. ‘The ________ is the first hotel’, etc

3. This may return:• 1. ‘This company was founded in the year’

• 2. ‘The number of people working in the business is far greater than’

• 3. ‘The institution is the first hotel’, etc

4. Rank synonyms according to various criteria and flood

diagram

CBMT Edge locking

• First and last words only confirmed by overlap once or few times

• search for other source sentences, where first & last words in original n-gram also in middle of newly found n-gram

• this confirms suitability within a particular context • Use also for words around interior punctuation

illustration

diagram

CBMT Target corpora

• monolingual

• Very large (50GB – 1 TB)

• The bigger the more accurate translation

• Easy to obtain from the web

diagram

CBMT Bilingual dictionary

• Very large• The bigger the more accurate translation• Usually widely available for most languages• Used to translate the n-grams• large number of n-grams• different translations for each word• Words can be in any order, taking into account

differences between languages• each n-gram 100-3000 high density matches

diagram

Conclusion

• Can we?

– Create a universal foundation for all languages

– Eliminate the need for human translators– Solve the biggest obstacle in MT – ambiguity

Conclusion

• Can we?

– Create a universal foundation for all languages

– Eliminate the need for human translators– Solve the biggest obstacle in MT – ambiguity

It does not seem so in the foreseeable future

Machine translation Context-based approach Lucia Otoyo.

Documents

Transcript of Machine translation Context-based approach Lucia Otoyo.