Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

Day 4 Lecture 4

Multimodal Deep Learning

Xavier Giroacute-i-Nieto

[course site]

Multimedia

Vision

and ratings geolocation time stamps

Multimedia

Vision

Language amp Vision Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

Captioning (+ Retrieval) DenseCap

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

t = 1 t = T

hidden stateat t = T

first chunkof data

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Slide credit Issey Masuda

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

Challenges Visual Question Answering

Humans

UC Berkeley amp Sony

Baseline LSTMampCNN

Baseline Nearest neighbor

Baseline Prior per question type

Baseline All yes

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

Vision and language Devise

One-hot encoding Embedding space

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

Multimedia

Vision

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Visualization of the 1D filters over raw audio in conv1

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

Audio and Video Visual Sounds

No end-to-end

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

No end-to-end

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

Conclusions

[course site]

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Multimedia

Vision

Multimedia

Vision

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Multimedia

Vision

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Captioning HRNE

t = 1 t = T

first chunkof data

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

ldquoYesrdquo

EncodeEncode

Decode

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Embedding

Predict answerMerge

Question

AnswerKite

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Humans

Baseline LSTMampCNN

Baseline All yes

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Christopher Olah

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

Multimedia

Vision

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

CNN(VGG)

Audio feature

No end-to-end

Conclusions

[course site]

No end-to-end

Conclusions

[course site]

Conclusions

[course site]

Conclusions

[course site]

Conclusions

[course site]

Conclusions

[course site]

Conclusions

[course site]

Conclusions

[course site]

Conclusions

[course site]

Conclusions

[course site]

Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

Data & Analytics

Transcript of Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)

Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Intelligence)

Deep Learning for Computer Vision: Welcome (UPC TelecomBCN 2016)

Deep Learning for Computer Vision: Image Retrieval (UPC 2016)

Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial Intelligence)

Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)

Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Language UPC 2017)

Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intelligence)

Reinforcement Learning - Deep Reinforcement Learningmmartin/URL/Lecture4.pdfReinforcement Learning Deep Reinforcement Learning Mario Martin CS-UPC May 17, 2018 ... Deep Q-Network algorithm

Object Detection (D2L4 2017 UPC Deep Learning for Computer Vision)

Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)

Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)

Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)

Deep Learning for Computer Vision: Backward Propagation (UPC 2016)

Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)

Deep Learning for Computer Vision: Segmentation (UPC 2016)

Deep Learning for Computer Vision: Attention Models (UPC 2016)

Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC 2017)

Speaker ID I (D3L3 Deep Learning for Speech and Language UPC 2017)

Deep Learning for Computer Vision: Medical Imaging (UPC 2016)