Learning from Text, Speech, and Vision...
Transcript of Learning from Text, Speech, and Vision...
![Page 1: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/1.jpg)
MultimodalityLearning from Text, Speech, and Vision
CMU 11-4/611 Natural Language Processing
Lecture 28April 14, 2020 Shruti Palaskar
![Page 2: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/2.jpg)
Outline
I. What is multimodality?
II. Types of modalities
III. Commonly used Models
IV. Multimodal Fusion and Representation Learning
V. Multimodal Tasks: Use Cases
2
![Page 3: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/3.jpg)
I. What is Multimodality?
3
![Page 4: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/4.jpg)
Human Interaction is Inherently Multimodal
4
![Page 5: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/5.jpg)
How We Perceive
5
![Page 6: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/6.jpg)
How We Perceive
6
![Page 7: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/7.jpg)
The Dream: Sci-Fi Movies
7
JARVIS The Matrix
![Page 8: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/8.jpg)
Reality?
8
![Page 9: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/9.jpg)
Give a caption.
9
![Page 10: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/10.jpg)
Give a caption.
10
Human: A Small Dogs Ears Stick Up As It Runs In The Grass.
Model: A Black And White Dog Is Running On Grass With A Frisbee In Its Mouth
![Page 11: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/11.jpg)
Single sentence image description -> Captioning
11
![Page 12: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/12.jpg)
Give a caption.
12
![Page 13: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/13.jpg)
Give a caption.
13
Human: A Young Girl In A White Dress Standing In Front Of A Fence And Fountain.
Model: Two Men Are Standing In Front Of A Fountain
![Page 14: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/14.jpg)
Reality?
14
![Page 15: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/15.jpg)
Watch the video and answer questions.
15
Q. is there only one person ?
Q. does she walk in with a towel around her neck ?
Q. does she interact with the dog ?
Q. does she drop the towel on the floor ?
QUESTIONS
![Page 16: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/16.jpg)
Watch the video and answer questions.
16
Q. is there only one person ?A. there is only one person and a dog .
Q. does she walk in with a towel around her neck ?A. she walks in from outside with the towel around her
neck .
Q. does she interact with the dog ?A. she does not interact with the dog
Q. does she drop the towel on the floor ?A. she dropped the towel on the floor at the end of the
video .
QUESTIONS
![Page 17: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/17.jpg)
Simple questions, simple answers -> Video Question Answering
17
![Page 18: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/18.jpg)
Reality? Baby Steps. Still a long way to go.
18
![Page 19: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/19.jpg)
...Challenges
Common challenges based on the tasks we just saw
- Training Dataset bias- Very complicated tasks- Lack of common sense reasoning within models- No world knowledge available like humans do
- Physics, Nature, Memory, Experience
How do we teach machines to perceive?
19
![Page 20: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/20.jpg)
Outline
I. What is multimodality?
II. Types of modalities
III. Commonly used Models
IV. Multimodal Fusion and Representation Learning
V. Multimodal Tasks: Use Cases
20
![Page 21: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/21.jpg)
II. Types of modalities
21
![Page 22: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/22.jpg)
Types of Modalities
22
IMAGE/VIDEO
SPEECH/AUDIO
TEXT
EMOTION/AFFECT/SENTIMENT
![Page 23: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/23.jpg)
Example Dataset: ImageNet
● Object Recognition● Image Tagging/Categorization● ~14M images● Knowledge Ontology● Hierarchical Tags
○ Mammal -> Placental -> Carnivore -> Canine -> Dog -> Working Dog -> Husky
Deng et al. 2009 23
![Page 24: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/24.jpg)
Example Dataset: How2 Dataset
● Speech● Video● English Transcript
● Portuguese Transcript● Summary
Sanabria et al. 2018 24
![Page 25: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/25.jpg)
Example Dataset: Open Pose
● Action Recognition● Pose Estimation● Human Dynamic● Body Dynamics
Wei et al. 2016 25
![Page 26: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/26.jpg)
III. Commonly Used Models
26
![Page 27: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/27.jpg)
Multilayer Perceptrons
27
Single Perceptron
![Page 28: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/28.jpg)
Multilayer Perceptrons
28Single Perceptron
![Page 29: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/29.jpg)
Multilayer Perceptrons: Uses in Multimedia
29
![Page 30: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/30.jpg)
Multilayer Perceptrons: Limitations
30
Limitation #1
Very large amount of input data samples (xi), which requires a gigantic amount of model parameters.
![Page 31: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/31.jpg)
Convolutional Neural Networks (CNNs)
31
Translation invariance: we can use same parameters to capture a specific “feature” in any area of the image. We can use different sets of parameters to capture different features.
These operations are equivalent to perform convolutions with different filters.
![Page 32: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/32.jpg)
Convolutional Neural Networks (CNNs)
32LeCun et al. 1998
![Page 33: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/33.jpg)
Convolutional Neural Networks (CNNs) for Image Encoding
33Krizhevsky et al. 2012
![Page 34: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/34.jpg)
Multilayer Perceptrons: Limitations
34
Limitation #1
Very large amount of input data samples (xi), which requires a gigantic amount of model parameters.
Limitation #2
Does not naturally handle input data of variable dimension
(eg. audio/video/word sequences)
![Page 35: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/35.jpg)
Recurrent Neural Networks
35
Build specific connections capturing the temporal evolution
→ Shared weights in time
![Page 36: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/36.jpg)
Recurrent Neural Networks
36
![Page 37: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/37.jpg)
Recurrent Neural Networks for Video Encoding
37
Combination is commonly implemented as a small NN on top of a pooling operation (e.g. max, sum, average).
Recurrent Neural Networks are well suited for processing sequences.
Donahue et al. 2015
![Page 38: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/38.jpg)
Attention Mechanism
38Olah and Cate 2016
![Page 39: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/39.jpg)
Loss Function: Softmax
39Slide by LP Morency
![Page 40: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/40.jpg)
IV. Multimodal Fusion & Representation Learning
40
![Page 41: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/41.jpg)
Fusion: Model Agnostic
41Slide by LP Morency
![Page 42: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/42.jpg)
Fusion: Model Based
42Slide by LP Morency
![Page 43: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/43.jpg)
Representation Learning: Encoder-Decoder
43
![Page 44: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/44.jpg)
Representation Learning
44
Word2Vec
Mikolov et al. 2013
![Page 45: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/45.jpg)
Representation Learning: RNNs
45Cho et al. 2014
![Page 46: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/46.jpg)
Representation Learning: Self-Supervised
46
![Page 47: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/47.jpg)
Representation Learning: Transfer Learning
47
![Page 48: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/48.jpg)
Representation Learning: Joint Learning
48
![Page 49: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/49.jpg)
Representation Learning: Joint Learning (Similarity)
49
![Page 50: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/50.jpg)
V. Common Tasks, Use Cases
50
![Page 51: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/51.jpg)
V. Common Tasks
1. Vision and Language2. Speech, Vision and Language3. Multimedia4. Emotion and Affect
51
● Image/Video Captioning● Visual Question Answering● Visual Dialog ● Video Summarization● Lip Reading● Audio Visual Speech Recognition● Visual Speech Synthesis● …
![Page 52: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/52.jpg)
1. Vision and Language Common Tasks
52
![Page 53: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/53.jpg)
Image Captioning
53Vinyals et al. 2015
![Page 54: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/54.jpg)
Image Captioning
54
Karpathy et al. 2015
Slides by Marc Bolaños
![Page 55: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/55.jpg)
Image Captioning: Show, Attend and Tell
55Xu et al. 2015
![Page 56: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/56.jpg)
Image Captioning and Detection
56Johnson et al. 2016
![Page 57: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/57.jpg)
Video Captioning
57Donahue et al. 2015
![Page 58: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/58.jpg)
Video Captioning
58
Pan et al. 2016
Slides by Marc Bolaños
![Page 59: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/59.jpg)
Visual Question Answering
59
![Page 60: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/60.jpg)
Visual Question Answering
60
![Page 61: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/61.jpg)
Visual Question Answering
61
![Page 62: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/62.jpg)
Visual Question Answering
62
![Page 63: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/63.jpg)
Visual Question Answering
63
![Page 64: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/64.jpg)
Video Summarization
on behalf of expert village my name is lizbeth muller and today we are going to show you how to make spanish omelet . i 'm going to dice a little bit of peppers here . i 'm not going to use a lot , i 'm going to use very very little . a little bit more then this maybe . you can use red peppers if you like to get a little bit color in your omelet . some people do and some people do n't . but i find that some of the people that are mexicans who are friends of mine that have a mexican she like to put red peppers and green peppers and yellow peppers in hers and with a lot of onions . that is the way they make there spanish omelets that is what she says . i loved it , it actually tasted really good . you are going to take the onion also and dice it really small . you do n't want big chunks of onion in there cause it is just pops out of the omelet . so we are going to dice the up also very very small . so we have small pieces of onions and peppers ready to go .
how to cut peppers to make a spanish omelette ; get expert tips and advice on making cuban breakfast recipes in this free cooking video .
Transcript (290 words on avg)
“Teaser” (33 words on avg)
~1.5 minutes of audio and video
![Page 65: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/65.jpg)
Video Summarization: Hierarchical Model
Palaskar et al. 2019
![Page 66: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/66.jpg)
Action Recognition
66
![Page 67: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/67.jpg)
2. Speech, Vision and Language Common Tasks
67
![Page 68: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/68.jpg)
Audio Visual Speech Recognition: Lip Reading
68
Assael et al. 2016
![Page 69: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/69.jpg)
Lip Reading: Watch, Listen, Attend and Spell
69
Chung et al. 2017
![Page 70: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/70.jpg)
3. Multimedia Common Tasks
70
![Page 71: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/71.jpg)
Multimedia Retrieval
71
![Page 72: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/72.jpg)
Multimedia Retrieval
72
![Page 73: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/73.jpg)
Multimedia Retrieval: Shared Multimodal Representation
73
![Page 74: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/74.jpg)
Multimedia Retrieval
74
![Page 75: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/75.jpg)
4. Emotion and Affect
75
![Page 76: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/76.jpg)
Affect Recognition: Emotion, Sentiment, Persuasion, Personality
76
![Page 77: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/77.jpg)
Outline
I. What is multimodality?
II. Types of modalities
III. Commonly used Models
IV. Multimodal Fusion and Representation Learning
V. Multimodal Tasks: Use Cases
77
![Page 78: Learning from Text, Speech, and Vision Multimodalitydemo.clab.cs.cmu.edu/NLP/S20/files/slides/28-multimodality.pdfLearning from Text, Speech, and Vision CMU 11-4/611 Natural Language](https://reader030.fdocuments.in/reader030/viewer/2022040616/5f14b30a042121689d10a586/html5/thumbnails/78.jpg)
Takeaways
● Lots of multimodal data generated everyday● Need automatic ways to understand it
○ Privacy○ Security○ Regulation○ Storage
● Different models used for different downstream tasks○ Highly open-ended research!
● Try it out for fun on Kaggle!
Thank [email protected] 78