Unsupervised Cross-Modal Alignment of Speech...

12
Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces Yu-An Chung Wei-Hung Weng Schrasing Tong James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts, USA NeurIPS Montréal, Québec, Canada December 2018

Transcript of Unsupervised Cross-Modal Alignment of Speech...

Page 1: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Yu-An Chung Wei-Hung Weng Schrasing Tong James GlassComputer Science and Artificial Intelligence Laboratory

Massachusetts Institute of TechnologyCambridge, Massachusetts, USA

NeurIPSMontréal, Québec, Canada

December 2018

Page 2: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Machine Translation (MT)

Automatic Speech Recognition (ASR)

Text-to-Speech Synthesis (TTS)

“the cat is black” “le chat est noir”

“dogs are cute”

“cats are adorable”

MT system

ASR system

TTS system

Training data

( , )

( , )

( , )

Englishtranscription

Frenchtranslation

Englishaudio

Englishtext

Englishtext

Englishaudio

Page 3: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Machine Translation (MT)

Automatic Speech Recognition (ASR)

Text-to-Speech Synthesis (TTS)

“the cat is black” “le chat est noir”

“dogs are cute”

“cats are adorable”

MT system

ASR system

TTS system

Training data

Parallel corpora for training à Expensive to collect!

( , )

( , )

( , )

Englishtranscription

Frenchtranslation

Englishaudio

Englishtext

Englishtext

Englishaudio

Page 4: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is

the largest and most popular …

Language 1 Language 2Framework

Page 5: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is

the largest and most popular …

Language 1 Language 2Do not need to be parallel!Framework

Page 6: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is

the largest and most popular …

Word2vec[Mikolov et al., 2013]

Language 1 Language 2Do not need to be parallel!

Y =y$y%⋮y'

(

∈ ℝ+×'

Framework

Page 7: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is

the largest and most popular …

Speech2vec[Chung & Glass, 2018]

Word2vec[Mikolov et al., 2013]

Language 1 Language 2Do not need to be parallel!

X =x$x%⋮x'

(

∈ ℝ+×'

Y =y$y%⋮y'

(

∈ ℝ+×'

Framework

Page 8: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is

the largest and most popular …

Speech2vec[Chung & Glass, 2018]

Word2vec[Mikolov et al., 2013]

Language 1 Language 2Do not need to be parallel!

MUSE[Lample et al., 2018]

X =x$x%⋮x'

(

∈ ℝ+×'

Y =y$y%⋮y'

(

∈ ℝ+×'

Learn a linear mapping W such that

W∗ = argminW∈ℝ7×7

WX − Y 9

Framework

Page 9: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is

the largest and most popular …

Speech2vec[Chung & Glass, 2018]

Word2vec[Mikolov et al., 2013]

Language 1 Language 2Do not need to be parallel!

MUSE[Lample et al., 2018]

p Word2vec• Learns distributed representations of words from a text corpus that model

word semantics in an unsupervised mannerp Speech2vec

• A speech version of word2vec that learns semantic word representationsfrom a speech corpus; also unsupervised

p MUSE• An unsupervised way to learn W with the assumption that the two

embedding spaces are approximately isomorphic

X =x%x&⋮x(

)

∈ ℝ,×(

Y =y%y&⋮y(

)

∈ ℝ,×(

Learn a linear mapping W such that

W∗ = argminW∈ℝ7×7

WX − Y 9

Components

Framework

Page 10: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is

the largest and most popular …

Speech2vec[Chung & Glass, 2018]

Word2vec[Mikolov et al., 2013]

Language 1 Language 2Do not need to be parallel!

MUSE[Lample et al., 2018]

p Word2vec• Learns distributed representations of words from a text corpus that model

word semantics in an unsupervised mannerp Speech2vec

• A speech version of word2vec that learns semantic word representationsfrom a speech corpus; also unsupervised

p MUSE• An unsupervised way to learn W with the assumption that the two

embedding spaces are approximately isomorphic

p Rely only on monolingual corpora of speech and text that:• Do not need to be parallel• Can be collected independently, greatly reducing human

labeling effortsp The framework is unsupervised:

• Each component uses unsupervised learning• Applicable to low-resource language pairs that lack bilingual

resources

X =x%x&⋮x(

)

∈ ℝ,×(

Y =y%y&⋮y(

)

∈ ℝ,×(

Learn a linear mapping W such that

W∗ = argminW∈ℝ7×7

WX − Y 9

Components

AdvantagesFramework

Page 11: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is

the largest and most popular …

Speech2vec[Chung & Glass, 2018]

Word2vec[Mikolov et al., 2013]

Language 1 Language 2Do not need to be parallel!

MUSE[Lample et al., 2018]

p Word2vec• Learns distributed representations of words from a text corpus that model

word semantics in an unsupervised mannerp Speech2vec

• A speech version of word2vec that learns semantic word representationsfrom a speech corpus; also unsupervised

p MUSE• An unsupervised way to learn W with the assumption that the two

embedding spaces are approximately isomorphic

p Rely only on monolingual corpora of speech and text that:• Do not need to be parallel• Can be collected independently, greatly reducing human

labeling effortsp The framework is unsupervised:

• Each component uses unsupervised learning• Applicable to low-resource language pairs that lack bilingual

resources

X =x%x&⋮x(

)

∈ ℝ,×(

Y =y%y&⋮y(

)

∈ ℝ,×(

Learn a linear mapping W such that

W∗ = argminW∈ℝ7×7

WX − Y 9

Usage of the learned W#1: Unsupervised spoken word recognition: when language 1 = language 2

(e.g., both English)

#2: Unsupervised spoken word translation: when language 1 ≠ language 2(e.g., English to French)

Nearest neighborsearch

Input spokenword “dog”

+“dog”“dogs”

“puppy”“pet”⋮

Nearest neighborsearch

Input spokenword “dog”

+“chien”“chiot”⋮

Components

AdvantagesFramework

An interesting property of our approach: synonym retrievalà The list of nearest neighbors actually contain both synonyms and different lexical forms of the input spoken word.

Foundation for Unsupervised Automatic Speech Recognition

Foundation for Unsupervised Speech-to-Text Translation

Page 12: Unsupervised Cross-Modal Alignment of Speech …people.csail.mit.edu/andyyuan/docs/nips-18.unsupervised...Word2vec [Mikolov et al., 2013] Language 1 Language 2 Do not need to be parallel!

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

10:45 AM – 12:45 PMRoom 210 & 230 AB #156