Unsupervised Cross-Modal Alignment of Speech...
Transcript of Unsupervised Cross-Modal Alignment of Speech...
Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces
Yu-An Chung Wei-Hung Weng Schrasing Tong James GlassComputer Science and Artificial Intelligence Laboratory
Massachusetts Institute of TechnologyCambridge, Massachusetts, USA
NeurIPSMontréal, Québec, Canada
December 2018
Machine Translation (MT)
Automatic Speech Recognition (ASR)
Text-to-Speech Synthesis (TTS)
“the cat is black” “le chat est noir”
“dogs are cute”
“cats are adorable”
MT system
ASR system
TTS system
Training data
( , )
( , )
( , )
Englishtranscription
Frenchtranslation
Englishaudio
Englishtext
Englishtext
Englishaudio
Machine Translation (MT)
Automatic Speech Recognition (ASR)
Text-to-Speech Synthesis (TTS)
“the cat is black” “le chat est noir”
“dogs are cute”
“cats are adorable”
MT system
ASR system
TTS system
Training data
Parallel corpora for training à Expensive to collect!
( , )
( , )
( , )
Englishtranscription
Frenchtranslation
Englishaudio
Englishtext
Englishtext
Englishaudio
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is
the largest and most popular …
Language 1 Language 2Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is
the largest and most popular …
Language 1 Language 2Do not need to be parallel!Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is
the largest and most popular …
Word2vec[Mikolov et al., 2013]
Language 1 Language 2Do not need to be parallel!
Y =y$y%⋮y'
(
∈ ℝ+×'
Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is
the largest and most popular …
Speech2vec[Chung & Glass, 2018]
Word2vec[Mikolov et al., 2013]
Language 1 Language 2Do not need to be parallel!
X =x$x%⋮x'
(
∈ ℝ+×'
Y =y$y%⋮y'
(
∈ ℝ+×'
Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is
the largest and most popular …
Speech2vec[Chung & Glass, 2018]
Word2vec[Mikolov et al., 2013]
Language 1 Language 2Do not need to be parallel!
MUSE[Lample et al., 2018]
X =x$x%⋮x'
(
∈ ℝ+×'
Y =y$y%⋮y'
(
∈ ℝ+×'
Learn a linear mapping W such that
W∗ = argminW∈ℝ7×7
WX − Y 9
Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is
the largest and most popular …
Speech2vec[Chung & Glass, 2018]
Word2vec[Mikolov et al., 2013]
Language 1 Language 2Do not need to be parallel!
MUSE[Lample et al., 2018]
p Word2vec• Learns distributed representations of words from a text corpus that model
word semantics in an unsupervised mannerp Speech2vec
• A speech version of word2vec that learns semantic word representationsfrom a speech corpus; also unsupervised
p MUSE• An unsupervised way to learn W with the assumption that the two
embedding spaces are approximately isomorphic
X =x%x&⋮x(
)
∈ ℝ,×(
Y =y%y&⋮y(
)
∈ ℝ,×(
Learn a linear mapping W such that
W∗ = argminW∈ℝ7×7
WX − Y 9
Components
Framework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is
the largest and most popular …
Speech2vec[Chung & Glass, 2018]
Word2vec[Mikolov et al., 2013]
Language 1 Language 2Do not need to be parallel!
MUSE[Lample et al., 2018]
p Word2vec• Learns distributed representations of words from a text corpus that model
word semantics in an unsupervised mannerp Speech2vec
• A speech version of word2vec that learns semantic word representationsfrom a speech corpus; also unsupervised
p MUSE• An unsupervised way to learn W with the assumption that the two
embedding spaces are approximately isomorphic
p Rely only on monolingual corpora of speech and text that:• Do not need to be parallel• Can be collected independently, greatly reducing human
labeling effortsp The framework is unsupervised:
• Each component uses unsupervised learning• Applicable to low-resource language pairs that lack bilingual
resources
X =x%x&⋮x(
)
∈ ℝ,×(
Y =y%y&⋮y(
)
∈ ℝ,×(
Learn a linear mapping W such that
W∗ = argminW∈ℝ7×7
WX − Y 9
Components
AdvantagesFramework
Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is
the largest and most popular …
Speech2vec[Chung & Glass, 2018]
Word2vec[Mikolov et al., 2013]
Language 1 Language 2Do not need to be parallel!
MUSE[Lample et al., 2018]
p Word2vec• Learns distributed representations of words from a text corpus that model
word semantics in an unsupervised mannerp Speech2vec
• A speech version of word2vec that learns semantic word representationsfrom a speech corpus; also unsupervised
p MUSE• An unsupervised way to learn W with the assumption that the two
embedding spaces are approximately isomorphic
p Rely only on monolingual corpora of speech and text that:• Do not need to be parallel• Can be collected independently, greatly reducing human
labeling effortsp The framework is unsupervised:
• Each component uses unsupervised learning• Applicable to low-resource language pairs that lack bilingual
resources
X =x%x&⋮x(
)
∈ ℝ,×(
Y =y%y&⋮y(
)
∈ ℝ,×(
Learn a linear mapping W such that
W∗ = argminW∈ℝ7×7
WX − Y 9
Usage of the learned W#1: Unsupervised spoken word recognition: when language 1 = language 2
(e.g., both English)
#2: Unsupervised spoken word translation: when language 1 ≠ language 2(e.g., English to French)
Nearest neighborsearch
Input spokenword “dog”
+“dog”“dogs”
“puppy”“pet”⋮
Nearest neighborsearch
Input spokenword “dog”
+“chien”“chiot”⋮
Components
AdvantagesFramework
An interesting property of our approach: synonym retrievalà The list of nearest neighbors actually contain both synonyms and different lexical forms of the input spoken word.
Foundation for Unsupervised Automatic Speech Recognition
Foundation for Unsupervised Speech-to-Text Translation
Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces
10:45 AM – 12:45 PMRoom 210 & 230 AB #156