Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started...

47
Machine Translation Spring 2020 CMPT 825: Natural Language Processing Adapted from slides from Chris Manning, Abigail See, Matthew Lamm, Danqi Chen and Karthik Narasimhan

Transcript of Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started...

Page 1: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Machine Translation

Spring 2020

CMPT 825: Natural Language Processing

SFUNatLangLab

Adapted from slides from Chris Manning, Abigail See, Matthew Lamm, Danqi Chen and Karthik Narasimhan

Page 2: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

• One of the “holy grail” problems in artificial intelligence

• Practical use case: Facilitate communication between people in the world

• Extremely challenging (especially for low-resource languages)

Translation

Page 3: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Easy and not so easy translations

• Easy:

• I like apples ich mag Äpfel (German)

• Not so easy:

• I like apples J'aime les pommes (French)

• I like red apples J'aime les pommes rouges (French)

• les the but les pommes apples

↔ ↔

Page 4: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

MT basics

• Goal: Translate a sentence in a source language (input) to a sentence in the target language (output)

• Can be formulated as an optimization problem:

• where is a scoring function over source and target sentences

• Requires two components:

• Learning algorithm to compute parameters of

• Decoding algorithm for computing the best translation

w(s)

w(t) = arg maxw(t)

ψ (w(s), w(t))

ψ

ψ

w(t)

Page 5: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Why is MT challenging?

• Single words may be replaced with multi-word phrases

• I like apples J'aime les pommes

• Reordering of phrases

• I like red apples J'aime les pommes rouges

• Contextual dependence

• les the but les pommes apples

↔ ↔

Extremely large output space Decoding is NP-hard⟹

Page 6: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Vauquois Pyramid

• Hierarchy of concepts and distances between them in different languages

• Lowest level: individual words/characters

• Higher levels: syntax, semantics

• Interlingua: Generic language-agnostic representation of meaning

Page 7: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Evaluating translation quality

• Two main criteria:

• Adequacy: Translation should adequately reflect the linguistic content of

• Fluency: Translation should be fluent text in the target language

w(t)

w(s)

w(t)

Different translations of A Vinay le gusta Python

Page 8: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Evaluation metrics

• Manual evaluation is most accurate, but expensive

• Automated evaluation metrics:

• Compare system hypothesis with reference translations

• BiLingual Evaluation Understudy (BLEU) (Papineni et al., 2002):

• Modified n-gram precision

Page 9: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

BLEU

Two modifications:

• To avoid , all precisions are smoothed

• Each n-gram in reference can be used at most once

• Ex. Hypothesis: to to to to to vs Reference: to be or not to be should not get a unigram precision of 1

Precision-based metrics favor short translations

• Solution: Multiply score with a brevity penalty for translations shorter than reference,

BLEU = exp1N

N

∑n=1

log pn

log 0

e1−r/h

Page 10: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

BLEU

• Correlates somewhat well with human judgements

(G. Doddington, NIST)

Page 11: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

BLEU scores

Sample BLEU scores for various system outputs

• Alternatives have been proposed:

• METEOR: weighted F-measure

• Translation Error Rate (TER): Edit distance between hypothesis and reference

Issues?

Page 12: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Machine Translation (MT)

task of translating a sentence from one language (the source language)

to a sentence in another language (the target language)

Page 13: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

History

• Started in the 1950s: rule-based, tightly linked to formal linguistics theories

• 1980s: Statistical MT

• 2000s-2015: Statistical Phrase-Based MT

• 2015-Present: Neural Machine Translation

13

• Russian → English (motivated by the Cold War!)

• Systems were mostly rule-based, using a bilingual dictionary to map Russian words to their English counterparts

1 minute video showing 1954 MT:https://youtu.be/K-HfpsHPmvw

Page 14: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

History

• Started in the 1950s: rule-based, tightly linked to formal linguistics theories

• (late) 1980s to 2000s: Statistical MT

• 2000s-2014: Statistical Phrase-Based MT

• 2014-Present: Neural Machine Translation

14

Page 15: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

History

• Started in the 1950s: rule-based, tightly linked to formal linguistics theories

• 1980s: Statistical MT

• 2000s-2015: Statistical Phrase-Based MT

• 2015-Present: Neural Machine Translation

15

Page 16: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Statistical MT

• Key Idea: Learn probabilistic model from data

• To find the best English sentence y, given French sentence x:

y = arg maxy

P(y |x)

y = arg maxy

P(x |y)P(y)• Decompose using Bayes Rule:

Translation/Alignment Model Models how words and phrases

should be translated (adequacy/fidelity).

Learn from parallel data

Language Model Models how to write good

English (fluency)

Learn from monolingual data

Page 17: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Noisy channel model

• Generative process for source sentence

• Use Bayes rule to recover that is maximally likely under the conditional distribution (which is what we want)

w(t)

pT|S

w(t) = arg maxw(t)

ψ (w(s), w(t))Decoder

pS|T

Noisy Channel

w(s)Input Model

pT

w(t)

ψ (w(s), w(t)) = ψA (w(s), w(t)) + ψF (w(t))

log PS,T(w(s), w(t)) = log PS|T(w(s) |w(t)) + log PT(w(t))

Page 18: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Data

• Statistical MT relies requires lots of parallel corpora

(Europarl, Koehn, 2005)

• Not available for many low-resource languages in the world

Page 19: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

How to define the translation model?

Introduce latent variable modeling the alignment (word-level correspondence) between the source sentence x and the target sentence y

P(x |y) = P(x, A |y)

Page 20: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

What is alignment?

Page 21: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Alignment is complex

Alignment can be many-to-one

Page 22: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Alignment is complex

Alignment can be one-to-many

Page 23: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Alignment is complex

Some words are very fertile!

Page 24: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Alignment is complex

Alignment can be many-to-many (phrase-level)

Page 25: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

How to define the translation model?

Given the alignment, how do we incorporate in our model?

P(x |y) = P(x, A |y)

Page 26: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Incorporating alignments

• Joint probability of alignment and translation can be defined as:

• are the number of words in source and target sentences

• is the alignment of the word in the source sentence, i.e. it specifies that the word is aligned to the word in target

M(s), M(t)

am mth

mth amth

Is this sufficient?

Page 27: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Incorporating alignments

a1 = 2, a2 = 3, a3 = 4,...

Multiple source words may align to the same target word!

(source)

(target)

Page 28: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Reordering and word insertion

(Slide credit: Brendan O’Connor)

Assume extra NULL token

Page 29: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Independence assumptions

• Two independence assumptions:

• Alignment probability factors across tokens:

• Translation probability factors across tokens:

Page 30: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

How do we translate?

• We want:

• Sum over all possible alignments:

• Alternatively, take the max over alignments

arg maxw(t)

p(w(t) |w(s)) = arg maxw(t)

p(w(s), w(t))p(w(s))

Page 31: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Alignments

• Key question: How should we align words in source to words in target?

good

bad

Page 32: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

IBM Model 1

• Assume

• Is this a good assumption?

p(am |m, M(s), M(t)) =1

M(t)

Every alignment is equally likely!

Page 33: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

• Each source word is aligned to at most one target word

• Further, assume

• We then have:

• How do we estimate ?

p(am |m, M(s), M(t)) =1

M(t)

p(w(s), w(t)) = p(w(t))∑A

(1

M(t))M(s) p(w(s) |w(t))

p(w(s) = v |w(t) = u)

IBM Model 1

Page 34: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

• If we had word-to-word alignments, we could compute the probabilities using the MLE:

• where = #instances where word was aligned to word in the training set

• However, word-to-word alignments are often hard to come by

p(v |u) =count(u, v)count(u)

count(u, v) uv

IBM Model 1

What can we do?

Page 35: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

EM for Model 1

• (E-Step) If we had an accurate translation model, we can estimate likelihood of each alignment as:

• (M Step) Use expected count to re-estimate translation parameters:

p(v |u) =Eq[count(u, v)]

count(u)

Page 36: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Independence assumptions allow for Viterbi decoding

In general, use greedy or beam decoding

Page 37: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Model 1: Decoding

• Pick target sentence length

• Decode:

M(t)

arg maxw(t)

p(w(t) |w(s)) = arg maxw(t)

p(w(s), w(t))

p(w(s), w(t)) = p(w(t))∑A

1M(t)

p(w(s) |w(t))

Page 38: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Model 1: Decoding

(source)

(target)

At every step , pick target word to maximize product of:1. Language model: 2. Translation model:

where is the inverse alignment from target to source

mpLM(w(t)

m |w(t)<m)

p(w(s)bm

|w(t)m )

bm

Page 39: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

IBM Model 2

• Slightly relaxed assumption:

• is also estimated, not set to constantp(am |m, M(s), M(t))

• Original independence assumptions still required:

• Alignment probability factors across tokens:

• Translation probability factors across tokens:

Page 40: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Other IBM models

• Models 3 - 6 make successively weaker assumptions

• But get progressively harder to optimize

• Simpler models are often used to ‘initialize’ complex ones

• e.g train Model 1 and use it to initialize Model 2 parameters

Page 41: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Phrase-based MT

• Word-by-word translation is not sufficient in many cases

• Solution: build alignments and translation tables between multiword spans or “phrases”

(literal)(actual)

Page 42: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Phrase-based MT

• Solution: build alignments and translation tables between multiword spans or “phrases”

• Translations condition on multi-word units and assign probabilities to multi-word units

• Alignments map from spans to spans

Page 43: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Syntactic MT

(Slide credit: Greg Durrett)

Page 44: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Syntactic MT

Page 45: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Vauquois Pyramid

• Hierarchy of concepts and distances between them in different languages

• Lowest level: individual words/characters

• Higher levels: syntax, semantics

• Interlingua: Generic language-agnostic representation of meaning

Page 46: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

Statistical MT

• 1990s to 2010s: huge research area

• Extremely complex systems

• Many separately-designed subcomponents

• Lots of feature engineering

• Needed to compile/maintain extra resources (phrase tables)

• Lots of human effort to maintain

Page 47: Machine Translation - Angel Xuan Chang | Angel Xuan Chang · 2020. 12. 2. · History • Started in the 1950s: rule-based, tightly linked to formal linguistics theories • 1980s:

History

• Started in the 1950s: rule-based, tightly linked to formal linguistics theories

• 1980s: Statistical MT

• 2000s-2015: Statistical Phrase-Based MT

• 2015-Present: Neural Machine Translation

• ~2018-Present: Neural Machine Translation + PBMT Hybrid

47