Unsupervised Machine Translation - Stanford UniversityIntroduction WMT 2019 English-German System...

UnsupervisedMachine Translation

Mikel Artetxe

Facebook AI Research

Introduction

IntroductionWMT 2019 English-German

IntroductionWMT 2019 English-GermanSystem ScoreFacebook FAIR 0.347Microsoft sent-doc 0.311Microsoft doc-level 0.296HUMAN 0.240MSRA-MADL 0.214UCAM 0.213⋮ ⋮

Introduction

super-human performance

WMT 2019 English-GermanSystem ScoreFacebook FAIR 0.347Microsoft sent-doc 0.311Microsoft doc-level 0.296HUMAN 0.240MSRA-MADL 0.214UCAM 0.213⋮ ⋮

Introduction


human evaluators preferred MT to

human translation!

WMT 2019 English-GermanSystem ScoreFacebook FAIR 0.347Microsoft sent-doc 0.311Microsoft doc-level 0.296HUMAN 0.240MSRA-MADL 0.214UCAM 0.213⋮ ⋮

Introduction



human translation!

WMT 2019 English-German

NMT trained on 27.7M parallel sentence pairsSystem ScoreFacebook FAIR 0.347Microsoft sent-doc 0.311Microsoft doc-level 0.296HUMAN 0.240MSRA-MADL 0.214UCAM 0.213⋮ ⋮

Introduction



human translation!


In my opinion, this exposes a serious inconsistency in industrial… En mi opinión, esto representa una grave inconsistencia entre…

NMT trained on 27.7M parallel sentence pairsSystem ScoreFacebook FAIR 0.347Microsoft sent-doc 0.311Microsoft doc-level 0.296HUMAN 0.240MSRA-MADL 0.214UCAM 0.213⋮ ⋮

Introduction



human translation!



NMT trained on 27.7M parallel sentence pairs

In my opinion the issues mentioned previously in the Council's…

En mi opinión, estos aspectos han sido tenidos debidamente…

System ScoreFacebook FAIR 0.347Microsoft sent-doc 0.311Microsoft doc-level 0.296HUMAN 0.240MSRA-MADL 0.214UCAM 0.213⋮ ⋮

Introduction



human translation!






In my opinion, it is clearly correct and natural to discuss issues…

En mi opinión es perfectamente correcto y natural debatir los…


Introduction



human translation!







En mi opinión es perfectamente correcto y natural debatir los…In my opinion, however, we could also achieve adequate flood…

En mi opinión también podríamos conseguir una protección…


Introduction



human translation!









In my opinion, it is a duty of the Presidency of the European…

A mi juicio, es responsabilidad de la Presidencia de la Unión…


Introduction



human translation!











In my opinion, however, this report does not have acceptable…En mi opinión, no obstante, el presente informe no ofrece…


Introduction



human translation!












In my opinion, the new liquidity requirements will help to solve…

En mi opinión, los nuevos requisitos de liquidez contribuirán a…


Introduction



human translation!














In my opinion, it is really a very good thing that Mr Potočnik…

En mi opinión, que el señor Potočnik haya sido tan inequívoco…


Introduction



human translation!
















In my opinion, this social dimension should, in fact, be brought into…


Introduction



human translation!

















???


Introduction



human translation!

















En mi opinion, …


Introduction




human translation!



Introduction




human translation!



We would need 48 years to read that!

Introduction




human translation!



We would need 48 years to read that!- Not sample efficient

Introduction




human translation!



We would need 48 years to read that!- Not sample efficient- Not available for most language pairs

Introduction




human translation!




Can we train MT systems with zero

parallel data?

Introduction




human translation!





parallel data?

Scientific & practical interest!

Introduction




human translation!





parallel data?


Previously explored in statistical decipherment(Knight et al., ACL’06; Ravi & Knight, ACL’11; Dou et al., ACL’15; inter alia)

Introduction




human translation!





parallel data?



…but strong limitations…

Introduction




human translation!





parallel data?



…but strong limitations…

IMPRESSIVE PROGRESS IN THE LAST 2-3 YEARS!

Outline

L1 corpus

Outline

L1 corpus

Outline

L2 corpus

L1 corpus

non-parallel

Outline

L2 corpus

L1 corpus

non-parallel

Outline

L2 corpus

magic

Machine Translation

L1 corpus

non-parallel

Outline

L2 corpus

magic

Machine Translation

L1 corpus

non-parallel

Outline

L2 corpus

L1 corpus

non-parallel

Outline

L2 corpus

Machine Translation

L1 corpus

non-parallel

OutlineL1 embeddings

L2 corpus

L2 embeddings

Machine Translation

L1 corpus

non-parallel


Cross-lingual embeddingsL2 corpus

L2 embeddings

Machine Translation

Machine Translation

L1 corpus

non-parallel



L2 embeddings

Word embeddings

Word embeddingsDistributed representations of words

Moo

Meow

Bark

Dog

Apple

House

cat

Cow

Banana

Pear

Calendar


Moo

Meow

Bark

Dog

Apple

House

cat

Cow

Banana

Pear

Calendar


sim 𝑐𝑜𝑤, 𝑐𝑎𝑡 ≈ cos 𝑤!"# , 𝑤!$% =𝑤!"# ⋅ 𝑤!$%𝑤!"# 𝑤!$%

Moo

Meow

Bark

Dog

Apple

House

cat

Cow

Banana

Pear

Calendar

Word embeddingsDistributed representations of words Learned from co-occurrence patterns

in a monolingual corpus


Moo

Meow

Bark

Dog

Apple

House

cat

Cow

Banana

Pear

Calendar

Word embeddings

e.g. skip-gram with negative sampling(Mikolov et al., NIPS’13)

Distributed representations of words Learned from co-occurrence patternsin a monolingual corpus


Moo

Meow

Bark

Dog

Apple

House

cat

Cow

Banana

Pear

Calendar

Word embeddings


log 𝜎 𝑤 ⋅ 𝑐 +4&'(

)𝔼*!~," log 𝜎 −𝑤 ⋅ 𝑐-



Moo

Meow

Bark

Dog

Apple

House

cat

Cow

Banana

Pear

Calendar

Word embeddings


I will go to New York by plane .

log 𝜎 𝑤 ⋅ 𝑐 +4&'(

)𝔼*!~," log 𝜎 −𝑤 ⋅ 𝑐-



Moo

Meow

Bark

Dog

Apple

House

cat

Cow

Banana

Pear

Calendar

Word embeddings


𝑤I will go to New York by plane .

log 𝜎 𝑤 ⋅ 𝑐 +4&'(

)𝔼*!~," log 𝜎 −𝑤 ⋅ 𝑐-



Moo

Meow

Bark

Dog

Apple

House

cat

Cow

Banana

Pear

Calendar

Word embeddings



log 𝜎 𝑤 ⋅ 𝑐 +4&'(

)𝔼*!~," log 𝜎 −𝑤 ⋅ 𝑐-



𝑐

Moo

Meow

Bark

Dog

Apple

House

cat

Cow

Banana

Pear

Calendar

Word embeddings



log 𝜎 𝑤 ⋅ 𝑐 +4&'(

)𝔼*!~," log 𝜎 −𝑤 ⋅ 𝑐-

𝑐



Machine Translation

L1 corpus

non-parallel



L2 embeddings

Cross-lingual word embedding alignment

𝑋

Donostia

Bilbo

Gasteiz

Iruñea

MauleD. Garazi

Baiona


𝑋 𝑍

Donostia

Bilbo

Gasteiz

Iruñea

MauleD. Garazi

Baiona

S. SebastiánBilbao

Vitoria

Pamplona

Mauléon

S. J. Pie

de Puerto

Bayona


𝑋 𝑍

Donostia

Bilbo

Gasteiz

Iruñea

MauleD. Garazi

Baiona

S. SebastiánBilbao

Vitoria

Pamplona

Mauléon

S. J. Pie

de Puerto

Bayona

BilboBaionaIruñea

BilbaoBayonaPamplona


𝑊

𝑋 𝑍

Donostia

Bilbo

Gasteiz

Iruñea

MauleD. Garazi

Baiona

S. SebastiánBilbao

Vitoria

Pamplona

Mauléon

S. J. Pie

de Puerto

Bayona

BilboBaionaIruñea



BilboBaionaIruñea


𝑊

𝑋 𝑍 𝑋𝑊

Donostia

Bilbo

Gasteiz

Iruñea

MauleD. Garazi

Baiona

S. SebastiánBilbao

Vitoria

Pamplona

Mauléon

S. J. Pie

de Puerto

Bayona

Donostia

Bilbo

Gasteiz

Iruñea

Maule

D. Garazi

Baiona



Mikolov et al. (arXiv’13)

Zaunka

Miau

Behi

Marru

TxakurKatu

Banana

EgutegiEtxe

Sagar

Udare

𝑋

Basque



Moo

MeowBark

Dog

Apple

House

cat

CowBanana

Pear Calendar

Zaunka

Miau

Behi

Marru

TxakurKatu

Banana

EgutegiEtxe

Sagar

Udare

𝑋 𝑍

Basque English



Moo

MeowBark

Dog

Apple

House

cat

CowBanana

Pear Calendar

Zaunka

Miau

Behi

Marru

TxakurKatu

Banana

EgutegiEtxe

Sagar

Udare

TxakurSagar⋮

Egutegi

DogApple⋮

Calendar

𝑋 𝑍

Basque English



Moo

MeowBark

Dog

Apple

House

cat

CowBanana

Pear Calendar

Zaunka

Miau

Behi

Marru

TxakurKatu

Banana

EgutegiEtxe

Sagar

Udare

TxakurSagar⋮

Egutegi

DogApple⋮

Calendar

𝑊

𝑋 𝑍

Basque English



Moo

MeowBark

Dog

Apple

House

cat

CowBanana

Pear Calendar

Zaunka

Miau

Behi

Marru

Txakur

Katu

Banana

Egutegi

Etxe

SagarUdare

Zaunka

Miau

Behi

Marru

TxakurKatu

Banana

EgutegiEtxe

Sagar

Udare

TxakurSagar⋮

Egutegi

DogApple⋮

Calendar

𝑊

𝑋 𝑍 𝑋𝑊

Basque English



Moo

MeowBark

Dog

Apple

cat

CowBanana

Pear Calendar

Zaunka

Miau

Behi

Marru

Txakur

Katu

Banana

Egutegi

Etxe

SagarUdare

Zaunka

Miau

Behi

Marru

TxakurKatu

Banana

EgutegiEtxe

Sagar

Udare

TxakurSagar⋮

Egutegi

DogApple⋮

Calendar

𝑊

𝑋 𝑍 𝑋𝑊House

Basque English



Moo

MeowBark

Dog

Apple

House

cat

CowBanana

Pear Calendar

Zaunka

Miau

Behi

Marru

Txakur

Katu

Banana

Egutegi

Etxe

SagarUdare

Zaunka

Miau

Behi

Marru

TxakurKatu

Banana

EgutegiEtxe

Sagar

Udare

TxakurSagar⋮

Egutegi

DogApple⋮

Calendar

𝑊

𝑋 𝑍 𝑋𝑊

Basque English



Moo

MeowBark

Dog

Apple

House

cat

CowBanana

Pear Calendar

Zaunka

Miau

Behi

Marru

Txakur

Katu

Banana

Egutegi

Etxe

SagarUdare

Zaunka

Miau

Behi

Marru

TxakurKatu

Banana

EgutegiEtxe

Sagar

Udare

𝑋:,∗𝑋<,∗⋮

𝑋=,∗

𝑊 ≈

𝑍:,∗𝑍<,∗⋮𝑍=,∗

TxakurSagar⋮

Egutegi

DogApple⋮

Calendar

𝑊

𝑋 𝑍 𝑋𝑊

Basque English



Moo

MeowBark

Dog

Apple

House

cat

CowBanana

Pear Calendar

Zaunka

Miau

Behi

Marru

Txakur

Katu

Banana

Egutegi

Etxe

SagarUdare

Zaunka

Miau

Behi

Marru

TxakurKatu

Banana

EgutegiEtxe

Sagar

Udare

𝑋:,∗𝑋<,∗⋮

𝑋=,∗

𝑊 ≈

𝑍:,∗𝑍<,∗⋮𝑍=,∗

TxakurSagar⋮

Egutegi

DogApple⋮

Calendar

𝑊

𝑋 𝑍 𝑋𝑊

Mikolov et al. (arXiv’13)Basque English


Moo

MeowBark

Dog

Apple

House

cat

CowBanana

Pear Calendar

Zaunka

Miau

Behi

Marru

Txakur

Katu

Banana

Egutegi

Etxe

SagarUdare

Zaunka

Miau

Behi

Marru

TxakurKatu

Banana

EgutegiEtxe

Sagar

Udare

𝑋:,∗𝑋<,∗⋮

𝑋=,∗

𝑊 ≈

𝑍:,∗𝑍<,∗⋮𝑍=,∗

TxakurSagar⋮

Egutegi

DogApple⋮

Calendar

𝑊

𝑋 𝑍 𝑋𝑊

arg min!

'"

𝑋"∗𝑊 − 𝑍$∗%

Basque English


Zaunka

Miau

Behi

Marru

TxakurKatu

Banana

EgutegiEtxe

Sagar

Udare

𝑋 𝑍

Moo

MeowBark

Dog

Apple

House

cat

CowBanana

Pear Calendar

Basque English


𝑋 𝑍


𝑊

𝑋 𝑍


𝑊

𝑋 𝑍 𝑋𝑊


𝑊

𝑋 𝑍 𝑋𝑊

𝑊∗ = arg min/∈1(3)

4&

min5

𝑋&∗𝑊 − 𝑍5∗6


𝑊∗ = arg min/∈1(3)

4&

min5

𝑋&∗𝑊 − 𝑍5∗6


𝑊∗ = arg min/∈1(3)

4&

min5

𝑋&∗𝑊 − 𝑍5∗6

Cross-lingual word embedding alignmentSelf-learning

Dictionary

𝑊∗ = arg min/∈1(3)

4&

min5

𝑋&∗𝑊 − 𝑍5∗6


Mapping

Dictionary

𝑊∗ = arg min/∈1(3)

4&

min5

𝑋&∗𝑊 − 𝑍5∗6


Mapping

Dictionary

Dictionary

𝑊∗ = arg min/∈1(3)

4&

min5

𝑋&∗𝑊 − 𝑍5∗6


Mapping

Dictionary

Dictionary

𝑊∗ = arg min/∈1(3)

4&

min5

𝑋&∗𝑊 − 𝑍5∗6

- 25 word pairs


Mapping

Dictionary

Dictionary

𝑊∗ = arg min/∈1(3)

4&

min5

𝑋&∗𝑊 − 𝑍5∗6

- 25 word pairs- Numeral list


Mapping

Dictionary

Dictionary

𝑊∗ = arg min/∈1(3)

4&

min5

𝑋&∗𝑊 − 𝑍5∗6

- 25 word pairs- Numeral list

- Random dict.


Mapping

Dictionary

Dictionary


Intra-lingual similarity distribution

Mapping

Dictionary

Dictionary

English



Mapping

Dictionary

Dictionary

two

English



Mapping

Dictionary

Dictionary

two

English

for x in vocab:

sim(“two”, x)



Mapping

Dictionary

Dictionary

two

English



Mapping

Dictionary

Dictionary

two

ItalianEnglish



Mapping

Dictionary

Dictionary

two due (two)

ItalianEnglish



Mapping

Dictionary

Dictionary

two due (two) cane (dog)

ItalianEnglish



Mapping

Dictionary

Dictionary

Intra-lingual similarity distributiontwo due (two) cane (dog)

ItalianEnglish


Mapping

Dictionary

Dictionary


ItalianEnglish

𝑋7 = sorted 𝑋𝑋8


Mapping

Dictionary

Dictionary


ItalianEnglish

𝑋7 = sorted 𝑋𝑋8 𝑍7 = sorted 𝑍𝑍8


Mapping

Dictionary

Dictionary


ItalianEnglish



+

Mapping

Dictionary

Dictionary


ItalianEnglish



+Robust self-learning

Mapping

Dictionary

Dictionary


ItalianEnglish




- Stochastic dictionary induction

Mapping

Dictionary

Dictionary


ItalianEnglish




- Stochastic dictionary induction- Frequency-based vocabulary cutoff

Mapping

Dictionary

Dictionary


ItalianEnglish




- Stochastic dictionary induction- Frequency-based vocabulary cutoff- CSLS retrieval (Conneau et al., ICLR’18)

Mapping

Dictionary

Dictionary


ItalianEnglish




- Stochastic dictionary induction- Frequency-based vocabulary cutoff- CSLS retrieval (Conneau et al., ICLR’18)- Bidirectional dictionary induction

Mapping

Dictionary

Dictionary


ItalianEnglish




- Stochastic dictionary induction- Frequency-based vocabulary cutoff- CSLS retrieval (Conneau et al., ICLR’18)- Bidirectional dictionary induction- Final symmetric re-weighting

Cross-lingual word embedding alignmentSupervision Method


5k dict.

Mikolov et al. (2013)Faruqui and Dyer (2014)Shigeto et al. (2015)Dinu et al. (2015)Lazaridou et al. (2015)Xing et al. (2015)Zhang et al. (2016)Artetxe et al. (2016)Smith et al. (2017)Artetxe et al. (2018a)


5k dict.


25 dict. Artetxe et al. (2017)


5k dict.


25 dict. Artetxe et al. (2017)Init.

heurist.Smith et al. (2017), cognatesArtetxe et al. (2017), num.


5k dict.




None

Zhang et al. (2017), 𝜆 = 1Zhang et al. (2017), 𝜆 = 10Conneau et al. (2018), code‡

Conneau et al. (2018), paper‡

Artetxe et al. (2018b)

Cross-lingual word embedding alignmentSupervision Method EN-IT EN-DE EN-FI EN-ES

5k dict.




None

Zhang et al. (2017), 𝜆 = 1Zhang et al. (2017), 𝜆 = 10Conneau et al. (2018), code‡

Conneau et al. (2018), paper‡

Artetxe et al. (2018b)

Cross-lingual word embedding alignmentSupervision Method EN-IT EN-DE EN-FI EN-ES

5k dict.

Mikolov et al. (2013) 34.93† 35.00† 25.91† 27.73†

Faruqui and Dyer (2014) 38.40* 37.13* 27.60* 26.80*

Shigeto et al. (2015) 41.53† 43.07† 31.04† 33.73†

Dinu et al. (2015) 37.7 38.93* 29.14* 30.40*

Lazaridou et al. (2015) 40.2 - - -Xing et al. (2015) 36.87† 41.27† 28.23† 31.20†

Zhang et al. (2016) 36.73† 40.80† 28.16† 31.07†

Artetxe et al. (2016) 39.27 41.87* 30.62* 31.40*

Smith et al. (2017) 43.1 43.33† 29.42† 35.13†

Artetxe et al. (2018a) 45.27 44.13 32.94 36.6025 dict. Artetxe et al. (2017) 37.27 39.60 28.16 -

Init.heurist.

Smith et al. (2017), cognates 39.9 - - -Artetxe et al. (2017), num. 39.40 40.27 26.47 -

None

Zhang et al. (2017), 𝜆 = 1 0.00* 0.00* 0.00* 0.00*

Zhang et al. (2017), 𝜆 = 10 0.00* 0.00* 0.01* 0.01*

Conneau et al. (2018), code‡ 45.15* 46.83* 0.38* 35.38*

Conneau et al. (2018), paper‡ 45.1 0.01* 0.01* 35.44*

Artetxe et al. (2018b) 48.13 48.19 32.63 37.33

Machine Translation

L1 corpus

non-parallel



L2 embeddings

Machine Translation

L1 corpus

non-parallel



L2 embeddings

- Neural approach

Machine Translation

L1 corpus

non-parallel



L2 embeddings

- Neural approach

- Statistical approach

Unsupervised neural machine translation

Training


Training


Une fusillade a eu lieu à l’aéroportinternational de Los Angeles.

TrainingSupervised



TrainingSupervised


Une fusillade a eu lieu à l’aéroportinternational de Los Angeles. There was a

shooting in Los Angeles International Airport.

TrainingSupervised




TrainingSupervisedDenoising




TrainingSupervisedDenoising


Une lieu fusillade a eu à l’aéroport de Los international Angeles.


TrainingSupervisedDenoisingBacktranslation




A shooting has had place in airport international of Los Angeles.

TrainingSupervisedDenoisingBacktranslation



Machine Translation

L1 corpus

non-parallel



L2 embeddings

- Neural approach


L1 corpus L2 corpus

Phrase-based SMT

L1 corpus L2 corpus

Phrase-based SMT

Phrase-based SMT

L1 corpus L2 corpus

Phrase-based SMT

Phrase-based SMTLog-linear model combining

L1 corpus L2 corpus

Phrase-based SMT

Phrasetable


- Phrase table

L1 corpus L2 corpus

Phrase-based SMT

Phrasetable


- Phrase table

nire iritziz in my opinion

nire iritziz in my view

nire iritziz I think

opari bat a present

opari bat one present

opari bat a gift

⁞

L1 corpus L2 corpus

Phrase-based SMT

Phrasetable


- Phrase table- Direct/inverse translation probabilities

𝝓(I𝒇|I𝒆) 𝝓(I𝒆|I𝒇)nire iritziz in my opinion 0.54 0.63

nire iritziz in my view 0.32 0.68

nire iritziz I think 0.11 0.09

opari bat a present 0.32 0.56

opari bat one present 0.14 0.73

opari bat a gift 0.11 0.49

⁞

L1 corpus L2 corpus

Phrase-based SMT

Phrasetable


- Phrase table- Direct/inverse translation probabilities- Direct/inverse lexical weightings

𝝓(I𝒇|I𝒆) 𝝓(I𝒆|I𝒇) 𝐥𝐞𝐱(I𝒇|I𝒆) 𝐥𝐞𝐱(I𝒆|I𝒇)nire iritziz in my opinion 0.54 0.63 0.12 0.15

nire iritziz in my view 0.32 0.68 0.09 0.16

nire iritziz I think 0.11 0.09 0.04 0.02

opari bat a present 0.32 0.56 0.21 0.22

opari bat one present 0.14 0.73 0.18 0.32

opari bat a gift 0.11 0.49 0.11 0.13

⁞

L1 corpus L2 corpus

Phrase-based SMT

Phrasetable

Language model



- Language model

L1 corpus L2 corpus

Phrase-based SMT

Phrasetable

Language model



- Language model- N-gram frequency counts with back-off and smoothing

L1 corpus L2 corpus

Phrase-based SMT

Reordering model

Phrasetable

Language model




- Reordering model

L1 corpus L2 corpus

Phrase-based SMT

Reordering model

Phrasetable

Language model




- Reordering model- Distortion model (distance based)

L1 corpus L2 corpus

Phrase-based SMT

Reordering model

Phrasetable

Language model




- Reordering model- Distortion model (distance based)- Lexical reordering model

L1 corpus L2 corpus

Phrase-based SMT

Reordering model

Phrasetable

Language model

Word/phrase penalty





- Word/phrase penalty

L1 corpus L2 corpus

Phrase-based SMT

Reordering model

Phrasetable

Language model

Word/phrase penalty





- Word/phrase penalty- Fixed score to control the length of the output

L1 corpus L2 corpus

Phrase-based SMT

Reordering model

Phrasetable

Language model

Word/phrase penalty





L1 corpus L2 corpus

learn each component

Phrase-based SMT

Reordering model

Phrasetable

Language model

Word/phrase penalty





L1 corpus L2 corpus


Phrase-based SMT

Reordering model

Phrasetable

Language model

Word/phrase penalty

Unsupervised phrase-based SMT





L1 corpus L2 corpus


Phrase-based SMT

Reordering model

Phrasetable

Language model

Word/phrase penalty

Unsupervised phrase-based SMTLearn components from monolingual corpora





L1 corpus L2 corpus


Phrase-based SMT

Reordering model

Phrasetable

Language model

Word/phrase penalty





- Word/phrase penalty EASY!!!- Fixed score to control the length of the output

L1 corpus L2 corpus


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty






L1 corpus L2 corpus


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty




- Reordering model EASY!!!- Distortion model (distance based)- Lexical reordering model







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty

L1 corpus L2 corpus




- Language model EASY!!!- N-gram frequency counts with back-off and smoothing



Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty

L1 corpus L2 corpus



- Phrase table TRICKY…- Direct/inverse translation probabilities- Direct/inverse lexical weightings




Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty

L1 corpus L2 corpus







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty

L1 corpus L2 corpus






L2 n-gram embeddings

Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty


L1 corpus L2 corpus







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty


Cross-lingual mapping

L1 corpus L2 corpus







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus

I will go to New York by plane .







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus








Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus


𝑐







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus

I will go to New York by plane .𝑐𝑝







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus

I will go to New York by plane .𝑝 𝑐







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus

I will go to New York by plane .𝑐𝑝







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus

For each �̅�, estimate 𝜙 ̅𝑓|�̅� for 100-NN:







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus

𝜙 ̅𝑓|�̅� =𝑒 ⁄"#$ &̅,(̅ )

∑(* 𝑒⁄"#$ &̅,(* )








Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus

𝜙 ̅𝑓|�̅� =𝑒 ⁄"#$ &̅,(̅ )

∑(* 𝑒⁄"#$ &̅,(* )


min!$̅#

log𝜙 ̅𝑓|NN ̅$ ̅𝑓 +$̅$

log𝜙 �̅�|NN ̅# �̅�







Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus


- Phrase table TRICKY… EASY!!!- Direct/inverse translation probabilities- Direct/inverse lexical weightings





Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus

Unsupervised phrase-based SMT


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus L2 corpus

Unsupervised phrase-based SMTL1 corpus

Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Goal: Adjust the weights of the resulting log-linearmodel to optimize some evaluation metric (e.g. BLEU)

Phrasetable

tuning


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus



… but we don’t have a parallel development set!

Phrasetable

tuning


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus




Solutions:

Phrasetable

tuning


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus




Solutions:- Default weights without tuning (Lample et al., EMNLP’18)

Phrasetable

tuning


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus




Solutions:- Default weights without tuning (Lample et al., EMNLP’18)- Alternating optimization with back-translation (Artetxe et al., EMNLP’18)

Phrasetable

tuning


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus




Solutions:- Default weights without tuning (Lample et al., EMNLP’18)- Alternating optimization with back-translation (Artetxe et al., EMNLP’18)- Unsupervised optimization criterion (Artetxe et al., ACL’19)

Phrasetable

tuning


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus





Phrasetable

tuning

𝐿 = 𝐿!9!:; 𝐸 + 𝐿!9!:; 𝐹 + 𝐿:< 𝐸 + 𝐿:<(𝐹)


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus





Phrasetable

tuning

𝐿 = 𝐿!9!:; 𝐸 + 𝐿!9!:; 𝐹 + 𝐿:< 𝐸 + 𝐿:<(𝐹)

• 𝐿!9!:; 𝐸 = 1 − BLEU T=→? T?→= 𝐸 , 𝐸


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus





Phrasetable

tuning

𝐿 = 𝐿!9!:; 𝐸 + 𝐿!9!:; 𝐹 + 𝐿:< 𝐸 + 𝐿:<(𝐹)

LP = LP 𝐸 ⋅ LP 𝐹 , LP 𝐸 = max 1,&'( )%→' )'→% *

&'( *

• 𝐿!9!:; 𝐸 = 1 − BLEU T=→? T?→= 𝐸 , 𝐸

• 𝐿:< 𝐸 = max 0, H 𝐹 − H T?→= 𝐸6⋅ LP


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpusUnsupervised phrase-based SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine

phra

se

tabl

e


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine

L1 train

phra

se

tabl

e


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine

L1 → L2 SMTL1 train

phra

se

tabl

e


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine

L1 train

phra

se

tabl

e

L1 → L2 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine

L1 train L2’ train

phra

se

tabl

e

L1 → L2 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

eL2 → L1 SMT

L1 → L2 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

e Let’s look inside!

L1 → L2 SMT

L2 → L1 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

eph

rase

ta

ble

L1 → L2 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

e 𝒑 ]𝒍𝟏| ]𝒍𝟐

phra

se

tabl

e

L1 → L2 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

e 𝒑 ]𝒍𝟏| ]𝒍𝟐

𝒑 ]𝒍𝟐| ]𝒍𝟏phra

se

tabl

e

L1 → L2 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

e

Spurious targets!

𝒑 ]𝒍𝟏| ]𝒍𝟐


se

tabl

e

L1 → L2 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

e 𝒑 ]𝒍𝟏| ]𝒍𝟐

phra

se

tabl

e

L1 → L2 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

e 𝒑 ]𝒍𝟏| ]𝒍𝟐

phra

se

tabl

e

L2 → L1 SMT

L1 → L2 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

e 𝒑 ]𝒍𝟏| ]𝒍𝟐

phra

se

tabl

e

L2 train

L1 → L2 SMT

L2 → L1 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

e 𝒑 ]𝒍𝟏| ]𝒍𝟐

phra

se

tabl

e


L1 → L2 SMT

L2 → L1 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

e 𝒑 ]𝒍𝟏| ]𝒍𝟐


se

tabl

e


L1 → L2 SMT

L2 → L1 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

e 𝒑 ]𝒍𝟏| ]𝒍𝟐


se

tabl

e

lexi

cal

reor

derin

g


L1 → L2 SMT

L2 → L1 SMT


Phrase-based SMT

Word/phrase penalty

Distortion model

Language model



L2 corpus


Phrasetable

tuning refine


phra

se

tabl

e 𝒑 ]𝒍𝟏| ]𝒍𝟐


se

tabl

e

lexi

cal

reor

derin

g


unsup. tuning

unsup. tuning

L1 → L2 SMT

L2 → L1 SMT


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

but…


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

but…

NMT >> SMT


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

but…

NMT >> SMT

(unsupervised) SMT has a hard ceiling!


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

NMT hybridization


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

NMT hybridization

L1 train

L2 train


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

NMT hybridization

L1 → L2 SMTL1 train

L2 train


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

NMT hybridization

L1 train

L2 train


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

L1 → L2 SMT

NMT hybridization


L2 train


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

L1 → L2 SMT

NMT hybridization


L2 → L1 NMTL2 train


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

L1 → L2 SMT

NMT hybridization


L2 trainL2 → L1 SMT


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

L1 → L2 SMT

L2 → L1 NMT

NMT hybridization


L2 train


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

L1 → L2 SMT

L2 → L1 NMT

L2 → L1 SMT

NMT hybridization




Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

L1 → L2 SMT

L2 → L1 NMT

L2 → L1 SMT

NMT hybridization

L1 train L2’ trainL1 → L2 NMT



Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

L1 → L2 SMT

L2 → L1 NMT

L2 → L1 SMT

NMT hybridization




Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

L1 → L2 SMT

L2 → L1 NMT

L2 → L1 SMT

L1 → L2 NMT

NMT hybridization



𝑁BC8 = 𝑁 ⋅ max(0, 1 − 𝑡/𝑎)


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

L1 → L2 SMT

L2 → L1 NMT

L2 → L1 SMT

L1 → L2 NMT

NMT hybridization



𝑁BC8 = 𝑁 ⋅ max(0, 1 − 𝑡/𝑎)𝑁-C8 = 𝑁 − 𝑁BC8


Phrase-based SMT

Distortion model

Phrasetable

Language model

Word/phrase penalty



L1 corpus

tuning

L2 corpus

refine

NMT

L1 → L2 SMT

L2 → L1 NMT

L2 → L1 SMT

L1 → L2 NMT

Machine Translation

L1 corpus

non-parallel L1 embeddings


L2 embeddings

Outline

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings

Initialization

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings

Initialization Training

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings


- Denoising

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings


- Denoising

- Back-translation

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings


- Cross-lingual word embeddings (Artetxe et al., ICLR’18; Lample et al., ICLR’18) - Denoising

- Back-translation

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings


- Cross-lingual word embeddings (Artetxe et al., ICLR’18; Lample et al., ICLR’18)

- Deep pretraining (Conneau & Lample, NeurIPS’19; Song et al., ICML’19; Liu et al., arXiv’20)

- Denoising

- Back-translation

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings




- Pretrain the full network through masked language modeling(e.g., multilingual BERT, XLM, MASS, mBART)

- Denoising

- Back-translation

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings





- Works well, but very expensive to train

- Denoising

- Back-translation

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings






- Masking can be seen as a form of noise!

- Denoising

- Back-translation

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings







- Synthetic parallel corpus (Marie & Fujita, arXiv’18; Artetxe et al., ACL’19)

- Denoising

- Back-translation

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings








- Basic: word-for-word translation using cross-lingual word embeddings

- Denoising

- Back-translation

- Neural approach


Machine Translation

L1 corpus

non-parallel

General recipe

L1 embeddings


L2 embeddings








- Basic: word-for-word translation using cross-lingual word embeddings

- Better: Unsupervised statistical machine translation

- Denoising

- Back-translation

- Neural approach


Results

Results

- Languages: French-English, German-English

Results


- Training: WMT-14 News Crawl

Results



- Test set: WMT-14 newstest (BLEU)

Results

FR-EN EN-FR DE-EN EN-DE




Results

FR-EN EN-FR DE-EN EN-DECross-lingual embs. (Artetxe et al., ICLR’18)* 15.6 15.1 10.2 6.6




*Tokenized BLEU (about 1-2 points higher)

Results


+ scaling up (Conneau & Lample, NeurIPS’19)* 29.4 29.4 - -





Results


+ scaling up (Conneau & Lample, NeurIPS’19)* 29.4 29.4 - -Deep pre-training (Conneau & Lample, NeurIPS’19)* 33.3 33.4 - -





Results


+ scaling up (Conneau & Lample, NeurIPS’19)* 29.4 29.4 - -Deep pre-training (Conneau & Lample, NeurIPS’19)* 33.3 33.4 - -Unsup SMT + NMT (Artetxe et al., ACL’19)* 33.5 36.2 27.0 22.5





Results


+ scaling up (Conneau & Lample, NeurIPS’19)* 29.4 29.4 - -Deep pre-training (Conneau & Lample, NeurIPS’19)* 33.3 33.4 - -Unsup SMT + NMT (Artetxe et al., ACL’19)* 33.5 36.2 27.0 22.5

detok. SacreBLEU 33.2 33.6 26.4 21.2





Results


Unsup.

Cross-lingual embs. (Artetxe et al., ICLR’18)* 15.6 15.1 10.2 6.6+ scaling up (Conneau & Lample, NeurIPS’19)* 29.4 29.4 - -

Deep pre-training (Conneau & Lample, NeurIPS’19)* 33.3 33.4 - -Unsup SMT + NMT (Artetxe et al., ACL’19)* 33.5 36.2 27.0 22.5

detok. SacreBLEU 33.2 33.6 26.4 21.2

Supervised





Results


Unsup.



detok. SacreBLEU 33.2 33.6 26.4 21.2

SupervisedWMT winner 35.0 35.8 29.0 20.6





Results


Unsup.



detok. SacreBLEU 33.2 33.6 26.4 21.2

SupervisedWMT winner 35.0 35.8 29.0 20.6Original transformer (Vaswani et al., NIPS’17)* - 41.0 - 28.4





Results


Unsup.



detok. SacreBLEU 33.2 33.6 26.4 21.2

SupervisedWMT winner 35.0 35.8 29.0 20.6Original transformer (Vaswani et al., NIPS’17)* - 41.0 - 28.4Large scale back-translation (Edunov et al., EMNLP’18) - 43.8 - 33.8





Conclusions

ConclusionsUnsupervised machine translation is a reality!


- General recipe: back-translation + smart initialization


- General recipe: back-translation + smart initialization- Align monolingual representations


- General recipe: back-translation + smart initialization- Align monolingual representations- Generalize from word level to sentence level translation



Strong results



Strong results- Competitive with supervised SOTA from 6 years ago



Strong results- Competitive with supervised SOTA from 6 years ago- …in just 2-3 years!



Strong results- Competitive with supervised SOTA from 6 years ago- …in just 2-3 years!

What’s next?

Thank you!

Twitter: @artetxemEmail: [email protected]

Unsupervised Machine Translation - Stanford UniversityIntroduction WMT 2019 English-German System...

Documents

Transcript of Unsupervised Machine Translation - Stanford UniversityIntroduction WMT 2019 English-German System...