Multimodal Deep Learning

Multimodal Deep LearningJiquan NgiamAditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Stanford University

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

McGurk Effect


Audio-Visual Speech Recognition


Feature Challenge

Classifier (e.g. SVM)


Representing Lips

• Can we learn better representations for audio/visual speech recognition?

• How can multimodal data (multiple sources of input) be used to find better features?


Unsupervised Feature Learning

51.1

.

.

.

10

91.67

.

.

.

3


Unsupervised Feature Learning

51.1

.

.

.

109

1.67...

3


Multimodal Features

12.159.......

6.59


Cross-Modality Feature Learning

51.1

.

.

.

10


Feature Learning Models


Feature Learning with Autoencoders

...

...Audio Input

...

...Video Input

... ...Audio Reconstruction Video Reconstruction


Bimodal Autoencoder

...... ...

... ...

Audio Input Video Input

HiddenRepresentation

Audio Reconstruction Video Reconstruction


Shallow Learning Hi

dden

Uni

ts

Video Input Audio Input

• Mostly unimodal features learned


Bimodal Autoencoder

...... ...

... ...





Bimodal Autoencoder

...

...

... ...

Video Input



Cross-modality Learning: Learn better video features by using audio as a cue


Cross-modality Deep Autoencoder

...

...

...

...... ......

Video Input

LearnedRepresentation



Cross-modality Deep Autoencoder

...

...

...

...... ......

Audio Input




Bimodal Deep Autoencoders

......

... ...

...

...... ......


SharedRepresentation


“Visemes”(Mouth Shapes)

“Phonemes”



.........

...... ......

Video Input




“Phonemes”


...

...

...

...... ......

Audio Input




......

... ...

...

...... ......





“Phonemes”


Training Bimodal Deep Autoencoder

...

...

...

...... ......

Audio Input



.........

...... ......

Video Input



......

... ...

...

...... ......




• Train a single model to perform all 3 tasks

• Similar in spirit to denoising autoencoders(Vincent et al., 2008)


Evaluations


Visualizations of Learned Features

0 ms 33 ms 67 ms 100 ms

0 ms 33 ms 67 ms 100 ms

Audio (spectrogram) and Video features learned over 100ms windows


Lip-reading with AVLetters

• AVLetters: – 26-way Letter Classification– 10 Speakers– 60x80 pixels lip regions

• Cross-modality learning

...

...

...

...... ......

Video Input



Feature Learning Supervised Learning TestingAudio + Video Video Video



Feature Representation Classification Accuracy

Multiscale Spatial Analysis (Matthews et al., 2002)

44.6%

Local Binary Pattern(Zhao & Barnard, 2009)

58.5%





44.6%


58.5%

Video-Only Learning(Single Modality Learning) 54.2%





44.6%


58.5%

Video-Only Learning(Single Modality Learning) 54.2%

Our Features(Cross Modality Learning) 64.4%


Lip-reading with CUAVE

• CUAVE: – 10-way Digit Classification– 36 Speakers

• Cross Modality Learning.........

...... ......

Video Input



Feature Learning Supervised Learning TestingAudio + Video Video Video




Baseline Preprocessed Video 58.5%Video-Only Learning

(Single Modality Learning) 65.4%







Discrete Cosine Transform(Gurban & Thiran, 2009)

64.0%

Visemic AAM(Papandreou et al., 2009)

83.0%


Multimodal Recognition

• CUAVE: – 10-way Digit Classification– 36 Speakers

• Evaluate in clean and noisy audio scenarios– In the clean audio scenario, audio performs

extremely well alone

Feature Learning Supervised Learning Testing

Audio + Video Audio + Video Audio + Video

...

...... ......

...... ......






Feature Representation Classification Accuracy(Noisy Audio at 0db SNR)

Audio Features (RBM) 75.8%Our Best Video Features 68.7%





Bimodal Deep Autoencoder 77.3%





Bimodal Deep Autoencoder 77.3%

Bimodal Deep Autoencoder + Audio Features (RBM) 82.2%


Shared Representation Evaluation

SupervisedTesting

Audio


Video Audio


Video

Linear Classifier

Training Testing

Feature Learning Supervised Learning TestingAudio + Video Audio Video


Shared Representation Evaluation

SupervisedTesting

Audio


Video Audio


Video

Linear Classifier

Training Testing

• Method: Learned Features + Canonical Correlation Analysis

Feature Learning Supervised Learning Testing Accuracy

Audio + Video Audio Video 57.3%Audio + Video Video Audio 91.7%


McGurk Effect

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

AudioInput

VideoInput

Model Predictions

/ga/ /ba/ /da/

/ga/ /ga/ 82.6% 2.2% 15.2%/ba/ /ba/ 4.4% 89.1% 6.5%


McGurk Effect

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

AudioInput

VideoInput

Model Predictions

/ga/ /ba/ /da/

/ga/ /ga/ 82.6% 2.2% 15.2%/ba/ /ba/ 4.4% 89.1% 6.5%/ga/ /ba/ 28.3% 13.0% 58.7%


Conclusion

• Applied deep autoencoders to discover features in multimodal data

• Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue

• Multimodal Feature Learning:Learn representations that relate across audio and video data

...

...

...

...... ......

Video Input



...

...... ......

...... ......





Bimodal Learning with RBMs

…......

Audio Input

Hidden Units

...Video Input

Multimodal Deep Learning

Documents

Transcript of Multimodal Deep Learning