Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

Post on 07-Feb-2017

352 views 6 download

Transcript of Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)

Day 4 Lecture 4

Multimodal Deep Learning

Xavier Giroacute-i-Nieto

[course site]

2

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

3

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

4

5

Language amp Vision Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

2

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

3

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

4

5

Language amp Vision Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

3

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

4

5

Language amp Vision Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

4

5

Language amp Vision Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

5

Language amp Vision Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

Lectures D3L4 amp D4L2 by Marta Ruiz on Neural Machine Translation

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

6

Language amp Vision Encoder-Decoder

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

7

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

8

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into account image features in the first hidden state

Multimodal Recurrent Neural Network

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

9

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

10

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

11

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

12

Captioning Show Attend amp Tell

Xu Kelvin Jimmy Ba Ryan Kiros Kyunghyun Cho Aaron C Courville Ruslan Salakhutdinov Richard S Zemel and Yoshua Bengio Show Attend and Tell Neural Image Caption Generation with Visual Attention ICML 2015

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

13

Captioning LSTM with image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

14

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

15

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

16

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

17

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

18

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

19

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

20

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

21

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

22

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

23

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

24

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

25

Challenges Visual Question Answering

Visual Question Answering

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

26

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo BSc ETSETB 2016 [Keras]

Challenges Visual Question Answering

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

27

Vision and language Embeddings

Perronin F CVPR Tutorial on LSVR CVPRrsquo14 Output embedding for LSVR

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

28

Vision and language Embeddings

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

29

Vision and language EmbeddingsLecture D2L4 by Toni Bonafonteon Word Embeddings

Christopher Olah

Visualizing Representations

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

30

Vision and language Devise

One-hot encoding Embedding space

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

31

Vision and language Devise

A Krizhevsky I Sutskever GE Hinton ldquoImagenet classification with deep convolutional neural networksrdquo Part of Advances in Neural Information Processing Systems 25 (NIPS 2012)

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

32

Vision and language Devise

Frome Andrea Greg S Corrado Jon Shlens Samy Bengio Jeff Dean and Tomas Mikolov Devise A deep visual-semantic embedding model NIPS 2013

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

33

Vision and language Embedding

Viacutector Campos Degravelia Fernagravendez Jordi Torres Xavier Giroacute-i-Nieto Brendan Jou and Shih-Fu Chang (work under progress)

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

34

Learn moreJulia Hockenmeirer (UIUC) Vision to Language ( Microsoft Research)

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

35

Multimedia

Text

Audio

Vision

and ratings geolocation time stamps

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

36

Audio and Video Soundnet

Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Object amp Scenes recognition in videos by analysing the audio track (only)

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

37Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

38Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Videos for training are unlabeled Relies on CNNs trained on labeled images

Audio and Video Soundnet

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

39Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Audio and Video Soundnet

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

40Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art

Audio and Video Soundnet

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

41Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

42Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

43Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video NIPS 2016

Visualization of the 1D filters over raw audio in conv1

Audio and Video Soundnet

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

44Aytar Yusuf Carl Vondrick and Antonio Torralba Soundnet Learning sound representations from unlabeled video In Advances in Neural Information Processing Systems pp 892-900 2016

Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7)

Audio and Video Soundnet

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

45

Audio and Video Sonorizaton

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

Learn synthesized sounds from videos of people hitting objects with a drumstick

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

46

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

No end-to-end

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

47

Audio and Video Visual Sounds

Owens Andrew Phillip Isola Josh McDermott Antonio Torralba Edward H Adelson and William T Freeman Visually indicated sounds CVPR 2016

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

48Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

CNN(VGG)

Frame from a slient video

Audio feature

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

49Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

Speech and Video Vid2Speech

No end-to-end

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

50

Speech and Video Vid2Speech

Ephrat Ariel and Shmuel Peleg Vid2speech Speech Reconstruction from Silent Video ICASSP 2017

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

51Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

52Assael Yannis M Brendan Shillingford Shimon Whiteson and Nando de Freitas LipNet Sentence-level Lipreading arXiv preprint arXiv161101599 (2016)

Speech and Video LipNet

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

53Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

54Chung Joon Son Andrew Senior Oriol Vinyals and Andrew Zisserman Lip reading sentences in the wild arXiv preprint arXiv161105358 (2016)

Speech and Video Watch Listen Attend amp Spell

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

55

Conclusions

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

56

Conclusions

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

57

Conclusions

[course site]

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

58

Learn moreRuus Salakhutdinov ldquoMultimodal Machine Learningrdquo (NIPS 2015 Workshop)

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

59

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi