Unsupervised Cross-Modal Alignment of Speech...

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Yu-An Chung Wei-Hung Weng Schrasing Tong James GlassComputer Science and Artificial Intelligence Laboratory

Massachusetts Institute of TechnologyCambridge, Massachusetts, USA

NeurIPSMontréal, Québec, Canada

December 2018

Machine Translation (MT)

Automatic Speech Recognition (ASR)

Text-to-Speech Synthesis (TTS)

“the cat is black” “le chat est noir”

“dogs are cute”

“cats are adorable”

MT system

ASR system

TTS system

Training data

( , )

( , )

( , )

Englishtranscription

Frenchtranslation

Englishaudio

Englishtext

Englishtext

Englishaudio

Machine Translation (MT)

Automatic Speech Recognition (ASR)

Text-to-Speech Synthesis (TTS)

“the cat is black” “le chat est noir”

“dogs are cute”

“cats are adorable”

MT system

ASR system

TTS system

Training data

Parallel corpora for training à Expensive to collect!

( , )

( , )

( , )

Englishtranscription

Frenchtranslation

Englishaudio

Englishtext

Englishtext

Englishaudio

Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openlyeditable and viewable content, a wiki. It is

the largest and most popular …

Language 1 Language 2Framework



Language 1 Language 2Do not need to be parallel!Framework



Word2vec[Mikolov et al., 2013]

Language 1 Language 2Do not need to be parallel!

Y =y$y%⋮y'

(

∈ ℝ+×'

Framework



Speech2vec[Chung & Glass, 2018]



X =x$x%⋮x'

(

∈ ℝ+×'

Y =y$y%⋮y'

(

∈ ℝ+×'

Framework






MUSE[Lample et al., 2018]

X =x$x%⋮x'

(

∈ ℝ+×'

Y =y$y%⋮y'

(

∈ ℝ+×'

Learn a linear mapping W such that

W∗ = argminW∈ℝ7×7

WX − Y 9

Framework







p Word2vec• Learns distributed representations of words from a text corpus that model

word semantics in an unsupervised mannerp Speech2vec

• A speech version of word2vec that learns semantic word representationsfrom a speech corpus; also unsupervised

p MUSE• An unsupervised way to learn W with the assumption that the two

embedding spaces are approximately isomorphic

X =x%x&⋮x(

)

∈ ℝ,×(

Y =y%y&⋮y(

)

∈ ℝ,×(



WX − Y 9

Components

Framework












p Rely only on monolingual corpora of speech and text that:• Do not need to be parallel• Can be collected independently, greatly reducing human

labeling effortsp The framework is unsupervised:

• Each component uses unsupervised learning• Applicable to low-resource language pairs that lack bilingual

resources

X =x%x&⋮x(

)

∈ ℝ,×(

Y =y%y&⋮y(

)

∈ ℝ,×(



WX − Y 9

Components

AdvantagesFramework












p Rely only on monolingual corpora of speech and text that:• Do not need to be parallel• Can be collected independently, greatly reducing human

labeling effortsp The framework is unsupervised:

• Each component uses unsupervised learning• Applicable to low-resource language pairs that lack bilingual

resources

X =x%x&⋮x(

)

∈ ℝ,×(

Y =y%y&⋮y(

)

∈ ℝ,×(



WX − Y 9

Usage of the learned W#1: Unsupervised spoken word recognition: when language 1 = language 2

(e.g., both English)

#2: Unsupervised spoken word translation: when language 1 ≠ language 2(e.g., English to French)

Nearest neighborsearch

Input spokenword “dog”

+“dog”“dogs”

“puppy”“pet”⋮

Nearest neighborsearch

Input spokenword “dog”

+“chien”“chiot”⋮

Components

AdvantagesFramework

An interesting property of our approach: synonym retrievalà The list of nearest neighbors actually contain both synonyms and different lexical forms of the input spoken word.

Foundation for Unsupervised Automatic Speech Recognition

Foundation for Unsupervised Speech-to-Text Translation

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

10:45 AM – 12:45 PMRoom 210 & 230 AB #156

Unsupervised Cross-Modal Alignment of Speech...

Documents

Transcript of Unsupervised Cross-Modal Alignment of Speech...