Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intelligence)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
-
Upload
xavier-giro -
Category
Data & Analytics
-
view
351 -
download
6
Transcript of Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Day 4 Lecture 4
Multimodal Deep Learning
Xavier Giroacute-i-Nieto
[course site]
2
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
3
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
4
5
Language amp Vision Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation
6
Language amp Vision Encoder-Decoder
7
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
8
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into account image features in the first hidden state
Multimodal Recurrent Neural Network
9
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
10
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
11
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
2
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
3
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
4
5
Language amp Vision Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation
6
Language amp Vision Encoder-Decoder
7
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
8
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into account image features in the first hidden state
Multimodal Recurrent Neural Network
9
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
10
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
11
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
3
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
4
5
Language amp Vision Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation
6
Language amp Vision Encoder-Decoder
7
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
8
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into account image features in the first hidden state
Multimodal Recurrent Neural Network
9
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
10
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
11
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
4
5
Language amp Vision Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation
6
Language amp Vision Encoder-Decoder
7
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
8
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into account image features in the first hidden state
Multimodal Recurrent Neural Network
9
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
10
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
11
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
5
Language amp Vision Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation
6
Language amp Vision Encoder-Decoder
7
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
8
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into account image features in the first hidden state
Multimodal Recurrent Neural Network
9
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
10
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
11
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
6
Language amp Vision Encoder-Decoder
7
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
8
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into account image features in the first hidden state
Multimodal Recurrent Neural Network
9
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
10
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
11
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
7
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
8
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into account image features in the first hidden state
Multimodal Recurrent Neural Network
9
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
10
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
11
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
8
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into account image features in the first hidden state
Multimodal Recurrent Neural Network
9
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
10
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
11
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
9
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
10
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
11
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
10
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
11
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
11
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
12
Captioning Show Attend amp Tell
Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
13
Captioning LSTM with image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
14
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
15
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
16
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
17
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
18
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
19
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
20
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
21
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
22
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
23
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
24
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
25
Challenges Visual Question Answering
Visual Question Answering
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
26
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]
Challenges Visual Question Answering
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
27
Vision and language Embeddings
Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
28
Vision and language Embeddings
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
29
Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings
Christopher Olah
Visualizing Representations
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
30
Vision and language Devise
One-hot encoding Embedding space
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
31
Vision and language Devise
A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
32
Vision and language Devise
Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
33
Vision and language Embedding
Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
34
Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
35
Multimedia
Text
Audio
Vision
and ratings geolocation time stamps
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
36
Audio and Video Soundnet
Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Object amp Scenes recognition in videos by analysing the audio track (only)
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Videos for training are unlabeled Relies on CNNs trained on labeled images
Audio and Video Soundnet
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Audio and Video Soundnet
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art
Audio and Video Soundnet
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016
Visualization of the 1D filters over raw audio in conv1
Audio and Video Soundnet
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016
Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)
Audio and Video Soundnet
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
45
Audio and Video Sonorizaton
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
Learn synthesized sounds from videos of people hitting objects with a drumstick
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
46
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
No end-to-end
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
47
Audio and Video Visual Sounds
Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
CNN(VGG)
Frame from a slient video
Audio feature
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
Speech and Video Vid2Speech
No end-to-end
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
50
Speech and Video Vid2Speech
Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)
Speech and Video LipNet
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)
Speech and Video Watch Listen Attend amp Spell
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
55
Conclusions
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
56
Conclusions
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
57
Conclusions
[course site]
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
58
Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
59
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi