Machine translation Context-based approach Lucia Otoyo.
-
Upload
beatrix-alexander -
Category
Documents
-
view
224 -
download
0
Transcript of Machine translation Context-based approach Lucia Otoyo.
Machine translationContext-based approach
Lucia Otoyo
Machine translation
Computerized task of translating from one natural language to another
• Human vs. machine translation• Difficulties of MT
Brief history of MT
• 17th century Descartes & Leibniz• 1930 bilingual dictionary + rules• After war (Warren Wiewer)–decoding msg.• 1954 – first public demonstration of MT IBM (spawned research)• 1966 ALPAC – less accurate & more cost• 1980 increasing demand, rule-based born• 1990 parallel corpora approach
MT approaches
Rule based
Parallel corpora based
Context based
Conclusion
Rule Based approach
• Dominant in 1980• Resourses: Set of rules & bilingual dict.• Steps: Syntax -grammar
Semantics - meaningPragmatics – difference btw. Lang.
• Disadvantages:-language experts for rules
-new language pair - new rules-not possible to include all the rules-rules have exceptions
MT diagram
Parallel corpora based
• Example based (word freq. & combination)• Statistical (phrase extract. & combination)• Resources: parallel corpora (pre-trans.), decoder,
alignment software• Steps: disassemble test into phrases, search the corpora
and match phrases, substitute, align phrases to form text
• Advantages vs. Disadvantages-Easy to apply to new language-more readable as using human pre-translated text-General translation vs. Specific domain-Lexical ambiguity
MT diagram
Context Based MT
Target Language
N-gram Connector
Overlap-based decoder
N-gram candidatesSubstitution
request
Stored n-gram pairs
approved n-gram pairs
Source Language
N-gram segmenter
Cache database
Cross-language n-gram database
Resources
Bilingual dictionary
Target corpora
Source corpora
Gazetteers
N-gram builder
Flooder
Edge Locker
Synonym generator
MT diagram
CBMT edge illustration
‘This context based machine translation approach looks very interesting’.
1. ‘This context based machine’
2. ‘context based machine translation’
3. ‘based machine translation approach’
4. ‘machine translation approach looks’
5. ‘translation approach looks very ’
6. ‘approach looks very interesting’
edge locking
CBMT n-grams
Break down source text into n-grams(4-8)‘This context based machine translation approach looks very interesting’.
• If ‘n’ = 4 then n-grams as follows:
1. ‘This context based machine’
2. ‘context based machine translation’
3. ‘based machine translation approach’
4. ‘machine translation approach looks’
5. ‘translation approach looks very ’
6. ‘approach looks very interesting’
CBMT n-grams
‘This context based machine translation approach looks very interesting’.
1. ‘This context based machine’
2. ‘context based machine translation’
3. ‘based machine translation approach’
4. ‘machine translation approach looks’
5. ‘translation approach looks very ’
6. ‘approach looks very interesting’
CBMT n-grams
‘This context based machine translation approach looks very interesting’.
1. ‘This context based machine’
2. ‘context based machine translation’
3. ‘based machine translation approach’
4. ‘machine translation approach looks’
5. ‘translation approach looks very ’
6. ‘approach looks very interesting’
CBMT n-grams
‘This context based machine translation approach looks very interesting’.
1. ‘This context based machine’
2. ‘context based machine translation’
3. ‘based machine translation approach’
4. ‘machine translation approach looks’
5. ‘translation approach looks very ’
6. ‘approach looks very interesting’
CBMT n-grams
‘This context based machine translation approach looks very interesting’.
1. ‘This context based machine’
2. ‘context based machine translation’
3. ‘based machine translation approach’
4. ‘machine translation approach looks’
5. ‘translation approach looks very ’
6. ‘approach looks very interesting’
CBMT n-grams
‘This context based machine translation approach looks very interesting’.
1. ‘This context based machine’
2. ‘context based machine translation’
3. ‘based machine translation approach’
4. ‘machine translation approach looks’
5. ‘translation approach looks very ’
6. ‘approach looks very interesting’
diagram
CBMT Flooding
• Search the monolingual corpora with translated n-grams
• Produces large number of n-grams with different translations for each word
• words can be in any order, taking into account differences between languages
• each n-gram 100-3000 high density matches
diagram
CBMT Target language lattice overlap maximization
• Align all the n-grams with each other• choose the ones, with the highest number of left and right
side overlaps • Eliminate non or partially overlapping n-grams
• 1. n-gram ‘This approach for computer’• 2. n-gram ‘This context based machine’• 3. n-gram ‘based machine translation approach’
diagram
CBMT Cross language database
• stores cross language n-gram correspondences for later use• to speed up the translation process
diagram
CBMT target language
• Find globally longest target language overlap with the highest match density
1. ‘This context based machine’
2. ‘context based machine translation’
3. ‘based machine translation approach’
4. ‘machine translation approach looks’
5. ‘translation approach looks very ’
6. ‘approach looks very interesting
‘This context based machine translation approach looks very interesting’.
diagram
CBMT – synonymy
• Word and Phrasal Synonymy -increase accuracy if no or only partial overlaps found
-dynamic synonyms, no predefined coded patterns
Stages:1. Search for the word in corpus(1000-100000 context
related phrases)
• 1. ‘This establishment was founded in the year’
• 2. ‘The number of people working in the establishment is far greater than’
• 3. ‘The establishment is the first hotel’, etc
CBMT – synonymy cont.
2. Search the corpus only with the phrases• 1. ‘This ________ was founded in the year’
• 2. ‘The number of people working in the _______ is far greater than’
• 3. ‘The ________ is the first hotel’, etc
3. This may return:• 1. ‘This company was founded in the year’
• 2. ‘The number of people working in the business is far greater than’
• 3. ‘The institution is the first hotel’, etc
4. Rank synonyms according to various criteria and flood
diagram
CBMT Edge locking
• First and last words only confirmed by overlap once or few times
• search for other source sentences, where first & last words in original n-gram also in middle of newly found n-gram
• this confirms suitability within a particular context • Use also for words around interior punctuation
illustration
diagram
CBMT Target corpora
• monolingual
• Very large (50GB – 1 TB)
• The bigger the more accurate translation
• Easy to obtain from the web
diagram
CBMT Bilingual dictionary
• Very large• The bigger the more accurate translation• Usually widely available for most languages• Used to translate the n-grams• large number of n-grams• different translations for each word• Words can be in any order, taking into account
differences between languages• each n-gram 100-3000 high density matches
diagram
Conclusion
• Can we?
– Create a universal foundation for all languages
– Eliminate the need for human translators– Solve the biggest obstacle in MT – ambiguity
Conclusion
• Can we?
– Create a universal foundation for all languages
– Eliminate the need for human translators– Solve the biggest obstacle in MT – ambiguity
It does not seem so in the foreseeable future