Multimodal Deep Learning
description
Transcript of Multimodal Deep Learning
![Page 1: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/1.jpg)
Multimodal Deep LearningJiquan NgiamAditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Stanford University
![Page 2: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/2.jpg)
![Page 3: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/3.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
McGurk Effect
![Page 4: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/4.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Audio-Visual Speech Recognition
![Page 5: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/5.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Feature Challenge
Classifier (e.g. SVM)
![Page 6: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/6.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Representing Lips
• Can we learn better representations for audio/visual speech recognition?
• How can multimodal data (multiple sources of input) be used to find better features?
![Page 7: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/7.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Unsupervised Feature Learning
51.1
.
.
.
10
91.67
.
.
.
3
![Page 8: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/8.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Unsupervised Feature Learning
51.1
.
.
.
109
1.67...
3
![Page 9: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/9.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Multimodal Features
12.159.......
6.59
![Page 10: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/10.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Cross-Modality Feature Learning
51.1
.
.
.
10
![Page 11: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/11.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Feature Learning Models
![Page 12: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/12.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Feature Learning with Autoencoders
...
...Audio Input
...
...Video Input
... ...Audio Reconstruction Video Reconstruction
![Page 13: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/13.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Autoencoder
...... ...
... ...
Audio Input Video Input
HiddenRepresentation
Audio Reconstruction Video Reconstruction
![Page 14: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/14.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Autoencoder
...... ...
... ...
Audio Input Video Input
HiddenRepresentation
Audio Reconstruction Video Reconstruction
![Page 15: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/15.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Shallow Learning Hi
dden
Uni
ts
Video Input Audio Input
• Mostly unimodal features learned
![Page 16: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/16.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Autoencoder
...... ...
... ...
Audio Input Video Input
HiddenRepresentation
Audio Reconstruction Video Reconstruction
![Page 17: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/17.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Autoencoder
...
...
... ...
Video Input
HiddenRepresentation
Audio Reconstruction Video Reconstruction
Cross-modality Learning: Learn better video features by using audio as a cue
![Page 18: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/18.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Cross-modality Deep Autoencoder
...
...
...
...... ......
Video Input
LearnedRepresentation
Audio Reconstruction Video Reconstruction
![Page 19: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/19.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Cross-modality Deep Autoencoder
...
...
...
...... ......
Audio Input
LearnedRepresentation
Audio Reconstruction Video Reconstruction
![Page 20: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/20.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Deep Autoencoders
......
... ...
...
...... ......
Audio Input Video Input
SharedRepresentation
Audio Reconstruction Video Reconstruction
“Visemes”(Mouth Shapes)
“Phonemes”
![Page 21: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/21.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Deep Autoencoders
.........
...... ......
Video Input
Audio Reconstruction Video Reconstruction
“Visemes”(Mouth Shapes)
![Page 22: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/22.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
“Phonemes”
Bimodal Deep Autoencoders
...
...
...
...... ......
Audio Input
Audio Reconstruction Video Reconstruction
![Page 23: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/23.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Deep Autoencoders
......
... ...
...
...... ......
Audio Input Video Input
SharedRepresentation
Audio Reconstruction Video Reconstruction
“Visemes”(Mouth Shapes)
“Phonemes”
![Page 24: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/24.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Training Bimodal Deep Autoencoder
...
...
...
...... ......
Audio Input
SharedRepresentation
Audio Reconstruction Video Reconstruction
.........
...... ......
Video Input
SharedRepresentation
Audio Reconstruction Video Reconstruction
......
... ...
...
...... ......
Audio Input Video Input
SharedRepresentation
Audio Reconstruction Video Reconstruction
• Train a single model to perform all 3 tasks
• Similar in spirit to denoising autoencoders(Vincent et al., 2008)
![Page 25: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/25.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Evaluations
![Page 26: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/26.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Visualizations of Learned Features
0 ms 33 ms 67 ms 100 ms
0 ms 33 ms 67 ms 100 ms
Audio (spectrogram) and Video features learned over 100ms windows
![Page 27: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/27.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with AVLetters
• AVLetters: – 26-way Letter Classification– 10 Speakers– 60x80 pixels lip regions
• Cross-modality learning
...
...
...
...... ......
Video Input
LearnedRepresentation
Audio Reconstruction Video Reconstruction
Feature Learning Supervised Learning TestingAudio + Video Video Video
![Page 28: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/28.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with AVLetters
Feature Representation Classification Accuracy
Multiscale Spatial Analysis (Matthews et al., 2002)
44.6%
Local Binary Pattern(Zhao & Barnard, 2009)
58.5%
![Page 29: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/29.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with AVLetters
Feature Representation Classification Accuracy
Multiscale Spatial Analysis (Matthews et al., 2002)
44.6%
Local Binary Pattern(Zhao & Barnard, 2009)
58.5%
Video-Only Learning(Single Modality Learning) 54.2%
![Page 30: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/30.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with AVLetters
Feature Representation Classification Accuracy
Multiscale Spatial Analysis (Matthews et al., 2002)
44.6%
Local Binary Pattern(Zhao & Barnard, 2009)
58.5%
Video-Only Learning(Single Modality Learning) 54.2%
Our Features(Cross Modality Learning) 64.4%
![Page 31: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/31.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with CUAVE
• CUAVE: – 10-way Digit Classification– 36 Speakers
• Cross Modality Learning.........
...... ......
Video Input
LearnedRepresentation
Audio Reconstruction Video Reconstruction
Feature Learning Supervised Learning TestingAudio + Video Video Video
![Page 32: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/32.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with CUAVE
Feature Representation Classification Accuracy
Baseline Preprocessed Video 58.5%Video-Only Learning
(Single Modality Learning) 65.4%
![Page 33: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/33.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with CUAVE
Feature Representation Classification Accuracy
Baseline Preprocessed Video 58.5%Video-Only Learning
(Single Modality Learning) 65.4%
Our Features(Cross Modality Learning) 68.7%
![Page 34: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/34.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Lip-reading with CUAVE
Feature Representation Classification Accuracy
Baseline Preprocessed Video 58.5%Video-Only Learning
(Single Modality Learning) 65.4%
Our Features(Cross Modality Learning) 68.7%
Discrete Cosine Transform(Gurban & Thiran, 2009)
64.0%
Visemic AAM(Papandreou et al., 2009)
83.0%
![Page 35: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/35.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Multimodal Recognition
• CUAVE: – 10-way Digit Classification– 36 Speakers
• Evaluate in clean and noisy audio scenarios– In the clean audio scenario, audio performs
extremely well alone
Feature Learning Supervised Learning Testing
Audio + Video Audio + Video Audio + Video
...
...... ......
...... ......
Audio Input Video Input
SharedRepresentation
Audio Reconstruction Video Reconstruction
![Page 36: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/36.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Multimodal Recognition
Feature Representation Classification Accuracy(Noisy Audio at 0db SNR)
Audio Features (RBM) 75.8%Our Best Video Features 68.7%
![Page 37: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/37.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Multimodal Recognition
Feature Representation Classification Accuracy(Noisy Audio at 0db SNR)
Audio Features (RBM) 75.8%Our Best Video Features 68.7%
Bimodal Deep Autoencoder 77.3%
![Page 38: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/38.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Multimodal Recognition
Feature Representation Classification Accuracy(Noisy Audio at 0db SNR)
Audio Features (RBM) 75.8%Our Best Video Features 68.7%
Bimodal Deep Autoencoder 77.3%
Bimodal Deep Autoencoder + Audio Features (RBM) 82.2%
![Page 39: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/39.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Shared Representation Evaluation
SupervisedTesting
Audio
SharedRepresentation
Video Audio
SharedRepresentation
Video
Linear Classifier
Training Testing
Feature Learning Supervised Learning TestingAudio + Video Audio Video
![Page 40: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/40.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Shared Representation Evaluation
SupervisedTesting
Audio
SharedRepresentation
Video Audio
SharedRepresentation
Video
Linear Classifier
Training Testing
• Method: Learned Features + Canonical Correlation Analysis
Feature Learning Supervised Learning Testing Accuracy
Audio + Video Audio Video 57.3%Audio + Video Video Audio 91.7%
![Page 41: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/41.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
McGurk Effect
A visual /ga/ combined with an audio /ba/ is often perceived as /da/.
AudioInput
VideoInput
Model Predictions
/ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%/ba/ /ba/ 4.4% 89.1% 6.5%
![Page 42: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/42.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
McGurk Effect
A visual /ga/ combined with an audio /ba/ is often perceived as /da/.
AudioInput
VideoInput
Model Predictions
/ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%/ba/ /ba/ 4.4% 89.1% 6.5%/ga/ /ba/ 28.3% 13.0% 58.7%
![Page 43: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/43.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Conclusion
• Applied deep autoencoders to discover features in multimodal data
• Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue
• Multimodal Feature Learning:Learn representations that relate across audio and video data
...
...
...
...... ......
Video Input
LearnedRepresentation
Audio Reconstruction Video Reconstruction
...
...... ......
...... ......
Audio Input Video Input
SharedRepresentation
Audio Reconstruction Video Reconstruction
![Page 44: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/44.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
![Page 45: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/45.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
![Page 46: Multimodal Deep Learning](https://reader036.fdocuments.in/reader036/viewer/2022062315/56815f8f550346895dce905d/html5/thumbnails/46.jpg)
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng
Bimodal Learning with RBMs
…......
Audio Input
Hidden Units
...Video Input