Multimodal Machine Learning - piazza.com
Transcript of Multimodal Machine Learning - piazza.com
1
Louis-Philippe Morency
Multimodal Machine Learning
Lecture 2.1: Basic Concepts β
Neural Networks
* Original version co-developed with Tadas Baltrusaitis
Lecture Objectives
βͺ Unimodal basic representationsβͺ Visual, language and acoustic modalities
βͺ Data-driven machine learningβͺ Training, validation and testing
βͺ Example: K-nearest neighbor
βͺ Linear Classificationβͺ Score function
βͺ Two loss functions (cross-entropy and hinge loss)
βͺ Neural networks
βͺ Course project team formation
Multimodal Machine Learning
Verbal Vocal Visual
These challenges are non-exclusive.
4
Unimodal Basic
Representations
5
Unimodal Classification β Visual Modality
Dog ?
β¦
Binary classification
problem
Color image
Each pixelis represented in βπ, d is the number of colors (d=3 for RGB)
Input
observ
ation π
π
label π¦π β π΄ = {0,1}
6
Unimodal Classification β Visual Modality
Dog
β¦
Cat
Duck
Pig
Bird ?
Each pixelis represented in βπ, d is the number of colors (d=3 for RGB)
-or-
-or-
-or-
-or-
label π¦π β π΄ = {0,1,2,3, β¦ }
Input
observ
ation π
π
Multi-class
classification problem
7
Unimodal Classification β Visual Modality
β¦
Puppy ?label vector ππ β π΄π = 0,1 π
Input
observ
ation π
π
Each pixelis represented in βπ, d is the number of colors (d=3 for RGB)
Dog ?
Cat ?
Duck?
Pig ?
Bird ?
Multi-label (or multi-task)
classification problem
8
Unimodal Classification β Visual Modality
β¦
label vector ππ β π΄π = βπ
Input
observ
ation π
π
Each pixelis represented in βπ, d is the number of colors (d=3 for RGB)
Weight ?
Height ?
Age ?
Distance ?
Happy ?
Multi-label (or multi-task)
regression problem
9
Unimodal Classification β Language Modality
Wri
tte
n la
ng
ua
ge
Sp
ok
en
la
ng
ua
ge
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
Input observ
ation π
π
β¦
βone-hotβ vector
ππ = number of words in dictionary
Word-level
classification
Sentiment ?(positive or negative)
Part-of-speech ?(noun, verb,β¦)
Named entity ?(names of person,β¦)
10
Unimodal Classification β Language Modality
Wri
tte
n la
ng
ua
ge
Sp
ok
en
la
ng
ua
ge
0
1
0
0
1
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
Input observ
ation π
π
β¦
βbag-of-wordβ vector
ππ = number of words in dictionary
Document-level
classification
Sentiment ?(positive or negative)
11
Unimodal Classification β Language Modality
Wri
tte
n la
ng
ua
ge
Sp
ok
en
la
ng
ua
ge
0
1
0
0
1
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
Input observ
ation π
π
β¦
βbag-of-wordβ vector
ππ = number of words in dictionary
Utterance-level
classification
Sentiment ?(positive or negative)
12
Unimodal Classification β Acoustic Modality
Digitalized acoustic signal
β’ Sampling rates: 8~96kHz
β’ Bit depth: 8, 16 or 24 bits
β’ Time window size: 20ms
β’ Offset: 10ms
Spectogram
0.21
0.14
0.56
0.45
0.9
0.98
0.75
0.34
0.24
0.11
0.02Input observ
ation π
π
Spoken word ?
13
Unimodal Classification β Acoustic Modality
Digitalized acoustic signal
β’ Sampling rates: 8~96kHz
β’ Bit depth: 8, 16 or 24 bits
β’ Time window size: 20ms
β’ Offset: 10ms
Spectogram
0.21
0.14
0.56
0.45
0.9
0.98
0.75
0.34
0.24
0.11
0.02Input observ
ation π
π
β¦
0.24
0.26
0.58
0.9
0.99
0.79
0.45
0.34
0.24
Spoken word ?
Voice quality ?
Emotion ?
14
Data-Driven
Machine Learning
Data-Driven Machine Learning
1. Dataset: Collection of labeled samples D: {π₯π , π¦π}
2. Training: Learn classifier on training set
3. Testing: Evaluate classifier on hold-out test set
Dataset
Training set Test set
16
Simple Classifier ?
Dataset
?Basket
Dog
Kayak ?
-or-
-or-
Traffic light-or-
17
Simple Classifier: Nearest Neighbor
Training
?Basket
Dog
Kayak ?
-or-
-or-
Traffic light-or-
Nearest Neighbor Classifier
βͺ Non-parametric approachesβkey ideas:
βͺ βLet the data speak for themselvesβ
βͺ βPredict new cases based on similar casesβ
βͺ βUse multiple local models instead of a single global modelβ
βͺ What is the complexity of the NN classifier w.r.t training
set of N images and test set of M images?
βͺ at training time?
O(1)
βͺ At test time?
O(N)
19
Simple Classifier: Nearest Neighbor
Distance metrics
L1 (Manhattan) distance:
L2 (Eucledian) distance:
π1 π₯1, π₯2 =
π
π₯1πβ π₯2
π
π2 π₯1, π₯2 =
π
π₯1πβ π₯2
π2
Which distance metric to use?
First hyper-parameter!
20
Definition of K-Nearest Neighbor
What value should we set K?
Second hyper-parameter!
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
Data-Driven Approach
1. Dataset: Collection of labeled samples D: {π₯π , π¦π}
2. Training: Learn classifier on training set
3. Validation: Select optimal hyper-parameters
4. Testing: Evaluate classifier on hold-out test set
Training
Data
Validation
Data
Test
Data
Evaluation methods (for validation and testing)
βͺ Holdout set: The available data set D is divided into two disjoint subsets, βͺ the training set Dtrain (for learning a model)
βͺ the test set Dtest (for testing the model)
βͺ Important: training set should not be used in testing and the test set should not be used in learning. βͺ Unseen test set provides a unbiased estimate of accuracy.
βͺ The test set is also called the holdout set. (the examples in the original data set D are all labeled with classes.)
βͺ This method is mainly used when the data set D is large.
βͺ Holdout methods can also be used for validation
Evaluation methods (for validation and testing)
βͺ n-fold cross-validation: The available data is partitioned
into n equal-size disjoint subsets.
βͺ Use each subset as the test set and combine the rest n-1
subsets as the training set to learn a classifier.
βͺ The procedure is run n times, which give n accuracies.
βͺ The final estimated accuracy of learning is the average of
the n accuracies.
βͺ 10-fold and 5-fold cross-validations are commonly used.
βͺ This method is used when the available data is not large.
Evaluation methods (for validation and testing)
βͺ Leave-one-out cross-validation: This method is
used when the data set is very small.
βͺ Each fold of the cross validation has only a
single test example and all the rest of the data is
used in training.
βͺ If the original data has m examples, this is m-
fold cross-validation
βͺ It is a special case of cross-validation
25
Linear Classification:
Scores and Loss
26
Linear Classification (e.g., neural network)
?
1. Define a (linear) score function
2. Define the loss function (possibly nonlinear)
3. Optimization
Image
(Size: 32*32*3)
27
1) Score Function
Dog ?
Cat ?
Duck ?
Pig ?
Bird ?
What should be
the prediction
score for each
label class?
π π₯π;π, π = ππ₯π + π
For linear classifier:
Class score
Input observation (ith element of the dataset)
(Size: 32*32*3)
Image
Weights Bias vector
Parameters
[3072x1]
[10x3072] [10x1]
[10x3073][10x1]
28
The planar decision surface
in data-space for the simple
linear discriminant function:
Interpreting a Linear Classifier
ππ₯π + π > 0
π(π₯)
π€
βπ
π€
π(π₯)
π(π₯)
π(π₯)
29
Some Notation Tricks β Multi-Label Classification
π π₯π;π, π = ππ₯π + π π π₯π;π = ππ₯π
InputWeights Bias
[3072x1][10x3072] [10x1]
x + InputWeights
[3073x1][10x3073]
x
Add a β1β at the
end of the input
observation vector
The bias vector will
be the last column of
the weight matrix
π = π1 π2 β¦ ππ
30
Some Notation Tricks
π π₯π;π, πGeneral formulation of linear classifier:
βdogβ linear classifier:
π π₯π;π, π πππ ππππ
π π₯π;ππππ, ππππ or
or
Linear classifier for label j:
π π₯π;π, π π ππ
π π₯π;ππ , ππ or
or
31
Interpreting Multiple Linear Classifiers
π π₯π;ππ , ππ = πππ₯π + ππ
CIFAR-10 object
recognition datasetππππ
πππππ
πππππππππ
32
Linear Classification: 2) Loss Function
(or cost function or objective)
π π₯π;π
2 (dog) ?
1 (cat) ?
0 (duck) ?
3 (pig) ?
4 (bird) ?(Size: 32*32*3)
Image
98.7
45.6
-12.3
12.2
-45.3
Scores
π₯π
Label
π¦π = 2 (πππ)
Loss
πΏπ = ?
Multi-class problem
How to assign
only one number
representing
how βunhappyβ
we are about
these scores?
The loss function quantifies the amount by which
the prediction scores deviate from the actual values.
A first challenge: how to normalize the scores?
33
First Loss Function: Cross-Entropy Loss
(or logistic loss)
Logistic function: π π =1
1 + πβπ
0.5
0
0
1
π π
π β’Score function
34
First Loss Function: Cross-Entropy Loss
(or logistic loss)
Logistic function: π π =1
1 + πβπ
Logistic regression:(two classes)
= π π€ππ₯π
0.5
0
0
1
π π
π
π π¦π = "πππ" π₯π; π€)
β’Score function
= truefor two-class problem
35
First Loss Function: Cross-Entropy Loss
(or logistic loss)
Logistic function: π π =1
1 + πβπ
π π¦π π₯π;π) =πππ¦π
Οπ πππ
Softmax function:(multiple classes)
Logistic regression:(two classes)
= π π€ππ₯ππ π¦π = "πππ" π₯π; π€)= truefor two-class problem
36
First Loss Function: Cross-Entropy Loss
(or logistic loss)
πΏπ = βlogπππ¦π
Οπ πππ
Softmax function
Cross-entropy loss:
Minimizing the
negative log likelihood.
37
Second Loss Function: Hinge Loss
loss due to
example i sum over all
incorrect labels
difference between the correct class
score and incorrect class score
(or max-margin loss or Multi-class SVM loss)
38
Second Loss Function: Hinge Loss
Example:
e.g. 10
(or max-margin loss or Multi-class SVM loss)
39
Two Loss Functions
40
Basic Concepts:
Neural Networks
Neural Networks β inspiration
βͺ Made up of artificial neurons
Neural Networks β score function
βͺ Made up of artificial neurons
βͺ Linear function (dot product) followed by a nonlinear
activation function
βͺ Example a Multi Layer Perceptron
Basic NN building block
βͺ Weighted sum followed by an activation function
Activation function
Output
Input
Weighted sum
ππ₯ + π
π¦ = π(ππ₯ + π)
Neural Networks β activation function
βͺ π π₯ = tanh π₯
βͺ Sigmoid - π π₯ = (1 + πβπ₯)β1
βͺ Linear β π π₯ = ππ₯ + π
βͺ ReLUβͺ Rectifier Linear Units
βͺ Faster training - no gradient vanishing
βͺ Induces sparsity
π π₯ = max 0, π₯ ~log(1 + exp(π₯) )
45
Multi-Layer Feedforward Network
π3
π2π1
π¦ππ₯ππ2;π2
π₯ = π(π2π₯ + π2)
π¦π = π π₯π = π3;π3(π2;π2
(π1;π1π₯π))
π3;π3π₯ = π(π3π₯ + π3)
Score function
Activation functions (individual layers)
πΏπ = (π π₯π β π¦π)2 = (π3;π3
(π2;π2(π1;π1
π₯π)) )2
Loss function (e.g., Euclidean loss)
π1;π1π₯ = π(π1π₯ + π1)
Neural Networks inference and learning
βͺ Inference (Testing)
βͺ Use the score function (y = π π;π )
βͺ Have a trained model (parameters π)
βͺ Learning model parameters (Training)
βͺ Loss function (πΏ)
βͺ Gradient
βͺ Optimization