Multimodal Machine Learning - piazza.com

46
1 Louis-Philippe Morency Multimodal Machine Learning Lecture 2.1: Basic Concepts – Neural Networks * Original version co-developed with Tadas Baltrusaitis

Transcript of Multimodal Machine Learning - piazza.com

Page 1: Multimodal Machine Learning - piazza.com

1

Louis-Philippe Morency

Multimodal Machine Learning

Lecture 2.1: Basic Concepts –

Neural Networks

* Original version co-developed with Tadas Baltrusaitis

Page 2: Multimodal Machine Learning - piazza.com

Lecture Objectives

β–ͺ Unimodal basic representationsβ–ͺ Visual, language and acoustic modalities

β–ͺ Data-driven machine learningβ–ͺ Training, validation and testing

β–ͺ Example: K-nearest neighbor

β–ͺ Linear Classificationβ–ͺ Score function

β–ͺ Two loss functions (cross-entropy and hinge loss)

β–ͺ Neural networks

β–ͺ Course project team formation

Page 3: Multimodal Machine Learning - piazza.com

Multimodal Machine Learning

Verbal Vocal Visual

These challenges are non-exclusive.

Page 4: Multimodal Machine Learning - piazza.com

4

Unimodal Basic

Representations

Page 5: Multimodal Machine Learning - piazza.com

5

Unimodal Classification – Visual Modality

Dog ?

…

Binary classification

problem

Color image

Each pixelis represented in ℛ𝑑, d is the number of colors (d=3 for RGB)

Input

observ

ation 𝒙

π’Š

label 𝑦𝑖 ∈ 𝒴 = {0,1}

Page 6: Multimodal Machine Learning - piazza.com

6

Unimodal Classification – Visual Modality

Dog

…

Cat

Duck

Pig

Bird ?

Each pixelis represented in ℛ𝑑, d is the number of colors (d=3 for RGB)

-or-

-or-

-or-

-or-

label 𝑦𝑖 ∈ 𝒴 = {0,1,2,3, … }

Input

observ

ation 𝒙

π’Š

Multi-class

classification problem

Page 7: Multimodal Machine Learning - piazza.com

7

Unimodal Classification – Visual Modality

…

Puppy ?label vector π’šπ’Š ∈ π’΄π‘š = 0,1 π‘š

Input

observ

ation 𝒙

π’Š

Each pixelis represented in ℛ𝑑, d is the number of colors (d=3 for RGB)

Dog ?

Cat ?

Duck?

Pig ?

Bird ?

Multi-label (or multi-task)

classification problem

Page 8: Multimodal Machine Learning - piazza.com

8

Unimodal Classification – Visual Modality

…

label vector π’šπ’Š ∈ π’΄π‘š = β„π‘š

Input

observ

ation 𝒙

π’Š

Each pixelis represented in ℝ𝑑, d is the number of colors (d=3 for RGB)

Weight ?

Height ?

Age ?

Distance ?

Happy ?

Multi-label (or multi-task)

regression problem

Page 9: Multimodal Machine Learning - piazza.com

9

Unimodal Classification – Language Modality

Wri

tte

n la

ng

ua

ge

Sp

ok

en

la

ng

ua

ge

0

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

Input observ

ation 𝒙

π’Š

…

β€œone-hot” vector

π’™π’Š = number of words in dictionary

Word-level

classification

Sentiment ?(positive or negative)

Part-of-speech ?(noun, verb,…)

Named entity ?(names of person,…)

Page 10: Multimodal Machine Learning - piazza.com

10

Unimodal Classification – Language Modality

Wri

tte

n la

ng

ua

ge

Sp

ok

en

la

ng

ua

ge

0

1

0

0

1

0

1

0

0

0

0

1

0

0

0

0

1

0

0

0

Input observ

ation 𝒙

π’Š

…

β€œbag-of-word” vector

π’™π’Š = number of words in dictionary

Document-level

classification

Sentiment ?(positive or negative)

Page 11: Multimodal Machine Learning - piazza.com

11

Unimodal Classification – Language Modality

Wri

tte

n la

ng

ua

ge

Sp

ok

en

la

ng

ua

ge

0

1

0

0

1

0

1

0

0

0

0

1

0

0

0

0

1

0

0

0

Input observ

ation 𝒙

π’Š

…

β€œbag-of-word” vector

π’™π’Š = number of words in dictionary

Utterance-level

classification

Sentiment ?(positive or negative)

Page 12: Multimodal Machine Learning - piazza.com

12

Unimodal Classification – Acoustic Modality

Digitalized acoustic signal

β€’ Sampling rates: 8~96kHz

β€’ Bit depth: 8, 16 or 24 bits

β€’ Time window size: 20ms

β€’ Offset: 10ms

Spectogram

0.21

0.14

0.56

0.45

0.9

0.98

0.75

0.34

0.24

0.11

0.02Input observ

ation 𝒙

π’Š

Spoken word ?

Page 13: Multimodal Machine Learning - piazza.com

13

Unimodal Classification – Acoustic Modality

Digitalized acoustic signal

β€’ Sampling rates: 8~96kHz

β€’ Bit depth: 8, 16 or 24 bits

β€’ Time window size: 20ms

β€’ Offset: 10ms

Spectogram

0.21

0.14

0.56

0.45

0.9

0.98

0.75

0.34

0.24

0.11

0.02Input observ

ation 𝒙

π’Š

…

0.24

0.26

0.58

0.9

0.99

0.79

0.45

0.34

0.24

Spoken word ?

Voice quality ?

Emotion ?

Page 14: Multimodal Machine Learning - piazza.com

14

Data-Driven

Machine Learning

Page 15: Multimodal Machine Learning - piazza.com

Data-Driven Machine Learning

1. Dataset: Collection of labeled samples D: {π‘₯𝑖 , 𝑦𝑖}

2. Training: Learn classifier on training set

3. Testing: Evaluate classifier on hold-out test set

Dataset

Training set Test set

Page 16: Multimodal Machine Learning - piazza.com

16

Simple Classifier ?

Dataset

?Basket

Dog

Kayak ?

-or-

-or-

Traffic light-or-

Page 17: Multimodal Machine Learning - piazza.com

17

Simple Classifier: Nearest Neighbor

Training

?Basket

Dog

Kayak ?

-or-

-or-

Traffic light-or-

Page 18: Multimodal Machine Learning - piazza.com

Nearest Neighbor Classifier

β–ͺ Non-parametric approachesβ€”key ideas:

β–ͺ β€œLet the data speak for themselves”

β–ͺ β€œPredict new cases based on similar cases”

β–ͺ β€œUse multiple local models instead of a single global model”

β–ͺ What is the complexity of the NN classifier w.r.t training

set of N images and test set of M images?

β–ͺ at training time?

O(1)

β–ͺ At test time?

O(N)

Page 19: Multimodal Machine Learning - piazza.com

19

Simple Classifier: Nearest Neighbor

Distance metrics

L1 (Manhattan) distance:

L2 (Eucledian) distance:

𝑑1 π‘₯1, π‘₯2 =

𝑗

π‘₯1π‘—βˆ’ π‘₯2

𝑗

𝑑2 π‘₯1, π‘₯2 =

𝑗

π‘₯1π‘—βˆ’ π‘₯2

𝑗2

Which distance metric to use?

First hyper-parameter!

Page 20: Multimodal Machine Learning - piazza.com

20

Definition of K-Nearest Neighbor

What value should we set K?

Second hyper-parameter!

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

Page 21: Multimodal Machine Learning - piazza.com

Data-Driven Approach

1. Dataset: Collection of labeled samples D: {π‘₯𝑖 , 𝑦𝑖}

2. Training: Learn classifier on training set

3. Validation: Select optimal hyper-parameters

4. Testing: Evaluate classifier on hold-out test set

Training

Data

Validation

Data

Test

Data

Page 22: Multimodal Machine Learning - piazza.com

Evaluation methods (for validation and testing)

β–ͺ Holdout set: The available data set D is divided into two disjoint subsets, β–ͺ the training set Dtrain (for learning a model)

β–ͺ the test set Dtest (for testing the model)

β–ͺ Important: training set should not be used in testing and the test set should not be used in learning. β–ͺ Unseen test set provides a unbiased estimate of accuracy.

β–ͺ The test set is also called the holdout set. (the examples in the original data set D are all labeled with classes.)

β–ͺ This method is mainly used when the data set D is large.

β–ͺ Holdout methods can also be used for validation

Page 23: Multimodal Machine Learning - piazza.com

Evaluation methods (for validation and testing)

β–ͺ n-fold cross-validation: The available data is partitioned

into n equal-size disjoint subsets.

β–ͺ Use each subset as the test set and combine the rest n-1

subsets as the training set to learn a classifier.

β–ͺ The procedure is run n times, which give n accuracies.

β–ͺ The final estimated accuracy of learning is the average of

the n accuracies.

β–ͺ 10-fold and 5-fold cross-validations are commonly used.

β–ͺ This method is used when the available data is not large.

Page 24: Multimodal Machine Learning - piazza.com

Evaluation methods (for validation and testing)

β–ͺ Leave-one-out cross-validation: This method is

used when the data set is very small.

β–ͺ Each fold of the cross validation has only a

single test example and all the rest of the data is

used in training.

β–ͺ If the original data has m examples, this is m-

fold cross-validation

β–ͺ It is a special case of cross-validation

Page 25: Multimodal Machine Learning - piazza.com

25

Linear Classification:

Scores and Loss

Page 26: Multimodal Machine Learning - piazza.com

26

Linear Classification (e.g., neural network)

?

1. Define a (linear) score function

2. Define the loss function (possibly nonlinear)

3. Optimization

Image

(Size: 32*32*3)

Page 27: Multimodal Machine Learning - piazza.com

27

1) Score Function

Dog ?

Cat ?

Duck ?

Pig ?

Bird ?

What should be

the prediction

score for each

label class?

𝑓 π‘₯𝑖;π‘Š, 𝑏 = π‘Šπ‘₯𝑖 + 𝑏

For linear classifier:

Class score

Input observation (ith element of the dataset)

(Size: 32*32*3)

Image

Weights Bias vector

Parameters

[3072x1]

[10x3072] [10x1]

[10x3073][10x1]

Page 28: Multimodal Machine Learning - piazza.com

28

The planar decision surface

in data-space for the simple

linear discriminant function:

Interpreting a Linear Classifier

π‘Šπ‘₯𝑖 + 𝑏 > 0

𝑓(π‘₯)

𝑀

βˆ’π‘

𝑀

𝑓(π‘₯)

𝑓(π‘₯)

𝑓(π‘₯)

Page 29: Multimodal Machine Learning - piazza.com

29

Some Notation Tricks – Multi-Label Classification

𝑓 π‘₯𝑖;π‘Š, 𝑏 = π‘Šπ‘₯𝑖 + 𝑏 𝑓 π‘₯𝑖;π‘Š = π‘Šπ‘₯𝑖

InputWeights Bias

[3072x1][10x3072] [10x1]

x + InputWeights

[3073x1][10x3073]

x

Add a β€œ1” at the

end of the input

observation vector

The bias vector will

be the last column of

the weight matrix

π‘Š = π‘Š1 π‘Š2 … π‘Šπ‘

Page 30: Multimodal Machine Learning - piazza.com

30

Some Notation Tricks

𝑓 π‘₯𝑖;π‘Š, 𝑏General formulation of linear classifier:

β€œdog” linear classifier:

𝑓 π‘₯𝑖;π‘Š, 𝑏 π‘‘π‘œπ‘” π‘“π‘‘π‘œπ‘”

𝑓 π‘₯𝑖;π‘Šπ‘‘π‘œπ‘”, π‘π‘‘π‘œπ‘” or

or

Linear classifier for label j:

𝑓 π‘₯𝑖;π‘Š, 𝑏 𝑗 𝑓𝑗

𝑓 π‘₯𝑖;π‘Šπ‘— , 𝑏𝑗 or

or

Page 31: Multimodal Machine Learning - piazza.com

31

Interpreting Multiple Linear Classifiers

𝑓 π‘₯𝑖;π‘Šπ‘— , 𝑏𝑗 = π‘Šπ‘—π‘₯𝑖 + 𝑏𝑗

CIFAR-10 object

recognition datasetπ‘“π‘π‘Žπ‘Ÿ

π‘“π‘‘π‘’π‘’π‘Ÿ

π‘“π‘Žπ‘–π‘Ÿπ‘π‘™π‘Žπ‘›π‘’

Page 32: Multimodal Machine Learning - piazza.com

32

Linear Classification: 2) Loss Function

(or cost function or objective)

𝑓 π‘₯𝑖;π‘Š

2 (dog) ?

1 (cat) ?

0 (duck) ?

3 (pig) ?

4 (bird) ?(Size: 32*32*3)

Image

98.7

45.6

-12.3

12.2

-45.3

Scores

π‘₯𝑖

Label

𝑦𝑖 = 2 (π‘‘π‘œπ‘”)

Loss

𝐿𝑖 = ?

Multi-class problem

How to assign

only one number

representing

how β€œunhappy”

we are about

these scores?

The loss function quantifies the amount by which

the prediction scores deviate from the actual values.

A first challenge: how to normalize the scores?

Page 33: Multimodal Machine Learning - piazza.com

33

First Loss Function: Cross-Entropy Loss

(or logistic loss)

Logistic function: 𝜎 𝑓 =1

1 + π‘’βˆ’π‘“

0.5

0

0

1

𝜎 𝑓

𝑓 ➒Score function

Page 34: Multimodal Machine Learning - piazza.com

34

First Loss Function: Cross-Entropy Loss

(or logistic loss)

Logistic function: 𝜎 𝑓 =1

1 + π‘’βˆ’π‘“

Logistic regression:(two classes)

= 𝜎 𝑀𝑇π‘₯𝑖

0.5

0

0

1

𝜎 𝑓

𝑓

𝑝 𝑦𝑖 = "π‘‘π‘œπ‘”" π‘₯𝑖; 𝑀)

➒Score function

= truefor two-class problem

Page 35: Multimodal Machine Learning - piazza.com

35

First Loss Function: Cross-Entropy Loss

(or logistic loss)

Logistic function: 𝜎 𝑓 =1

1 + π‘’βˆ’π‘“

𝑝 𝑦𝑖 π‘₯𝑖;π‘Š) =𝑒𝑓𝑦𝑖

σ𝑗 𝑒𝑓𝑗

Softmax function:(multiple classes)

Logistic regression:(two classes)

= 𝜎 𝑀𝑇π‘₯𝑖𝑝 𝑦𝑖 = "π‘‘π‘œπ‘”" π‘₯𝑖; 𝑀)= truefor two-class problem

Page 36: Multimodal Machine Learning - piazza.com

36

First Loss Function: Cross-Entropy Loss

(or logistic loss)

𝐿𝑖 = βˆ’log𝑒𝑓𝑦𝑖

σ𝑗 𝑒𝑓𝑗

Softmax function

Cross-entropy loss:

Minimizing the

negative log likelihood.

Page 37: Multimodal Machine Learning - piazza.com

37

Second Loss Function: Hinge Loss

loss due to

example i sum over all

incorrect labels

difference between the correct class

score and incorrect class score

(or max-margin loss or Multi-class SVM loss)

Page 38: Multimodal Machine Learning - piazza.com

38

Second Loss Function: Hinge Loss

Example:

e.g. 10

(or max-margin loss or Multi-class SVM loss)

Page 39: Multimodal Machine Learning - piazza.com

39

Two Loss Functions

Page 40: Multimodal Machine Learning - piazza.com

40

Basic Concepts:

Neural Networks

Page 41: Multimodal Machine Learning - piazza.com

Neural Networks – inspiration

β–ͺ Made up of artificial neurons

Page 42: Multimodal Machine Learning - piazza.com

Neural Networks – score function

β–ͺ Made up of artificial neurons

β–ͺ Linear function (dot product) followed by a nonlinear

activation function

β–ͺ Example a Multi Layer Perceptron

Page 43: Multimodal Machine Learning - piazza.com

Basic NN building block

β–ͺ Weighted sum followed by an activation function

Activation function

Output

Input

Weighted sum

π‘Šπ‘₯ + 𝑏

𝑦 = 𝑓(π‘Šπ‘₯ + 𝑏)

Page 44: Multimodal Machine Learning - piazza.com

Neural Networks – activation function

β–ͺ 𝑓 π‘₯ = tanh π‘₯

β–ͺ Sigmoid - 𝑓 π‘₯ = (1 + π‘’βˆ’π‘₯)βˆ’1

β–ͺ Linear – 𝑓 π‘₯ = π‘Žπ‘₯ + 𝑏

β–ͺ ReLUβ–ͺ Rectifier Linear Units

β–ͺ Faster training - no gradient vanishing

β–ͺ Induces sparsity

𝑓 π‘₯ = max 0, π‘₯ ~log(1 + exp(π‘₯) )

Page 45: Multimodal Machine Learning - piazza.com

45

Multi-Layer Feedforward Network

π‘Š3

π‘Š2π‘Š1

𝑦𝑖π‘₯𝑖𝑓2;π‘Š2

π‘₯ = 𝜎(π‘Š2π‘₯ + 𝑏2)

𝑦𝑖 = 𝑓 π‘₯𝑖 = 𝑓3;π‘Š3(𝑓2;π‘Š2

(𝑓1;π‘Š1π‘₯𝑖))

𝑓3;π‘Š3π‘₯ = 𝜎(π‘Š3π‘₯ + 𝑏3)

Score function

Activation functions (individual layers)

𝐿𝑖 = (𝑓 π‘₯𝑖 βˆ’ 𝑦𝑖)2 = (𝑓3;π‘Š3

(𝑓2;π‘Š2(𝑓1;π‘Š1

π‘₯𝑖)) )2

Loss function (e.g., Euclidean loss)

𝑓1;π‘Š1π‘₯ = 𝜎(π‘Š1π‘₯ + 𝑏1)

Page 46: Multimodal Machine Learning - piazza.com

Neural Networks inference and learning

β–ͺ Inference (Testing)

β–ͺ Use the score function (y = 𝑓 𝒙;π‘Š )

β–ͺ Have a trained model (parameters π‘Š)

β–ͺ Learning model parameters (Training)

β–ͺ Loss function (𝐿)

β–ͺ Gradient

β–ͺ Optimization