Advanced Multimodal Machine Learning
Transcript of Advanced Multimodal Machine Learning
1
Louis-Philippe Morency
AdvancedMultimodal Machine Learning
Lecture 4.1: Recurrent Networks
* Original version co-developed with Tadas Baltrusaitis
2
Administrative Stuff
Upcoming Schedule
βͺ First project assignment:
βͺ Proposal presentation (10/2 and 10/4)
βͺ First project report (Sunday 10/7)
βͺ Second project assignment
βͺ Midterm presentations (11/6 and 11/8)
βͺ Midterm report (Sunday 11/11)
βͺ Final project assignment
βͺ Final presentation (12/4 & 12/6)
βͺ Final report (Sunday 12/9)
Proposal Presentation (10/2 and 10/4)
βͺ 5 minutes (about 5-10 slides)
βͺ All team members should be involved in the presentation
βͺ Will receive feedback from instructors and other students
βͺ 1-2 minutes between presentations reserved for written
feedback
βͺ Main presentation points
βͺ General research problem and motivation
βͺ Dataset and input modalities
βͺ Multimodal challenges and prior work
βͺ You need to submit a copy of your slides (PDF or PPT)
βͺ Deadline: Friday 10/5 (on Gradescope)
Project Proposal Report
βͺ Part 1 (updated version of your pre-proposal)
βͺ Research problem:
βͺ Describe and motivate the research problem
βͺ Define in generic terms the main computational
challenges
βͺ Dataset and Input Modalities:
βͺ Describe the dataset(s) you are planning to use for this
project.
βͺ Describe the input modalities and annotations available in
this dataset.
Project Proposal Report
βͺ Part 2
βͺ Related Work:
βͺ Include 12-15 paper citations which give an overview of
the prior work
βͺ Present in more details the 3-4 research papers most
related to your work
βͺ Research Challenges and Hypotheses:
βͺ Describe your specific challenges and/or research
hypotheses
βͺ Highlight the novel aspect of your proposed research
Project Proposal Report
βͺ Part 3
βͺ Language Modality Exploration:
βͺ Explore neural language models on your dataset (e.g., using Keras)
βͺ Train at least two different language models (e.g., using SimpleRNN,
GRU or LSTM) on your dataset and compare their perplexity.
βͺ Include qualitative examples of successes and failure cases.
βͺ Visual Modality Exploration:
βͺ Explore pre-trained Convolutional Neural Networks (CNNs) on your
dataset
βͺ Load a pre-existing CNN model trained for object recognition (e.g.,
VGG-Net) and process your test images.
βͺ Extract features at different network layers in the network and visualize
them (using t-sne visualization) with overlaid class labels with different
colors.
Lecture Objectives
βͺ Word representations & distributional hypothesis
βͺ Learning neural representations (e.g., Word2vec)
βͺ Language models and sequence modeling tasks
βͺ Recurrent neural networks
βͺ Backpropagation through time
βͺ Gated recurrent neural networks
βͺ Long Short-Term Memory (LSTM) model
9
Representing Words:
Distributed Semantics
Possible ways of representing words
Classic binary word representation: [0; 0; 0; 0;β¦.; 0; 0; 1; 0;β¦; 0; 0]
100,000d vector
Learned word representation: [0,1; 0,0003; 0;β¦.; 0,02; 0.08; 0,05]
300d vector
Given a text corpus containing 100,000 unique words
Only non-zero at the index of the word
Classic word feature representation: [5; 1; 0; 0;β¦.; 0; 20; 1; 0;β¦; 3; 0]
300d vector
Manually define 300 βgoodβ features (e.g., ends on βing)
This 300-dimension vector should approximate the
βmeaningβ of the word
11
βͺ Distribution Hypothesis (DH) [Lenci 2008]
βͺ At least certain aspects of the meaning of lexical expressionsdepend on their distributional properties in the linguistic contexts
βͺ The degree of semantic similarity between two linguisticexpressions and is a function of the similarity of the linguisticcontexts in which and can appear
βͺ Weak and strong DH
βͺ Weak view as a quantitative method for semantic analysis andlexical resource induction
βͺ Strong view as a cognitive hypothesis about the form and origin ofsemantic representations; assuming that word distributions incontext play a specific causal role in forming meaningrepresentations.
The Distributional Hypothesis
11
12
βͺ He handed her glass of bardiwac.
βͺ Beef dishes are made to complement the bardiwacs.
βͺ Nigel staggered to his feet, face flushed from too muchbardiwac.
βͺ Malbec, one of the lesser-known bardiwac grapes, respondswell to Australiaβs sunshine.
βͺ I dined off bread and cheese and this excellent bardiwac.
βͺ The drinks were delicious: blood-red bardiwac as well as light,sweet Rhenish.
bardiwac is a heavy red alcoholic beverage made fromgrapes
What is the meaning of βbardiwacβ?
12
Geometric interpretation
βͺ row vector xdog
describes usage of
word dog in the
corpus
βͺ can be seen as
coordinates of point
in n-dimensional
Euclidean space Rn
13Stefan Evert 2010
Distance and similarity
βͺ illustrated for two
dimensions: get and
use: xdog = (115, 10)
βͺ similarity = spatial
proximity (Euclidean
distance)
βͺ location depends on
frequency of noun
(fdog 2.7 Β· fcat)
Stefan Evert 2010
Angle and similarity
βͺ direction more
important than
location
βͺ normalise βlengthβ
||xdog|| of vector
βͺ or use angle as
distance measure
Stefan Evert 2010
Semantic maps
16
17
Learning Neural
Word Representations
How to learn neural word representations?
Distribution hypothesis: Approximate the
word meaning by its surrounding words
Words used in a similar context will lie close together
He was walking away because β¦
He was running away because β¦
Instead of capturing co-occurrence counts directly,
predict surrounding words of every word
x W1 W2 y
[0; 0; 0; 0;β¦.; 0; 0; 1; 0;β¦; 0; 0] [0; 1; 0; 0;β¦.; 0; 0; 0; 0;β¦; 0; 0]
[0; 0; 0; 1;β¦.; 0; 0; 0; 0;β¦; 0; 0]
[0; 0; 0; 0;β¦.; 1; 0; 0; 0;β¦; 0; 0]
[0; 0; 0; 0;β¦.; 0; 0; 0; 0;β¦; 0; 1]
walking
He was walking away because β¦
He was running away because β¦
He
Was
Away
because
No activation function -> very fast
How to learn neural word representations?
300d 300d
10
0 0
00
d
10
0 0
00
d
Word2vec algorithm: https://code.google.com/p/word2vec/
How to use these word representations
Classic NLP:
Walking: [0; 0; 0; 0;β¦.; 0; 0; 1; 0;β¦; 0; 0]
Running: [0; 0; 0; 0;β¦.; 0; 0; 0; 0;β¦; 1; 0]
Goal:
Walking: [0,1; 0,0003; 0;β¦.; 0,02; 0.08; 0,05]
Running: [0,1; 0,0004; 0;β¦.; 0,01; 0.09; 0,05]
Similarity = 0.0
Similarity = 0.9
If we would have a vocabulary of 100 000 words:
100 000 dimensional vector
300 dimensional vector
x W1
300d
100
000d
Transform: xβ=x*W
Vector space models of words
While learning these word representations, we are
actually building a vector space in which all words
reside with certain relationships between them
This vector space allows for algebraic operations:
Vec(king) β vec(man) + vec(woman) β vec(queen)
Encodes both syntactic and semantic relationships
Why linear algebra is working?
Vector space models of words: semantic relationships
Trained on the Google news corpus with over 300 billion words
23
Language Sequence
Modeling Tasks
24
Sequence Modeling: Sequence Label Prediction
Sentiment ?(positive or negative)
Prediction
Ideal for anyone with an interest in disguises
Sentiment label?
25
Sequence Modeling: Sequence Prediction
Part-of-speech ?(noun, verb,β¦)
Prediction
Ideal for anyone with an interest in disguises
POS? POS? POS? POS? POS? POS? POS? POS?
26
Sequence Modeling: Sequence Representation
Sequence representationLearning
Ideal for anyone with an interest in disguises
[0,1; 0,0004; 0;β¦.; 0,01; 0.09; 0,05]
27
Sequence Modeling: Language Model
Language ModelPrediction
Ideal for anyone with
Next word?
an interest in disguises
Application: Speech Recognition
)(
)()|(maxarg
acousticsP
cewordsequenPcewordsequenacousticsP
cewordsequen
Language model
=)|(maxarg acousticscewordsequenPcewordsequen
)()|(maxarg cewordsequenPcewordsequenacousticsPcewordsequen
29
Application: Language Generation
Generation
[0,1;
0,0004;
β¦.;
0.09;
0,05]
Embedding
[0,1;
0,0004;
β¦.;
0.09;
0,05]
Example: Image captioning
N-Gram Language Model Formulations
βͺ Word sequences
βͺ Chain rule of probability
βͺ Bigram approximation
βͺ N-gram approximation
n
n www ...11 =
)|()|()...|()|()()( 1
1
1
1
1
2
131211
β
=
β
== kn
k
k
n
n
n wwPwwPwwPwwPwPwP
)|()( 1
1
1
1
β
+β
=
=k
Nk
n
k
k
n wwPwP
)|()( 1
1
1 β
=
= k
n
k
k
n wwPwP
Evaluating Language Model: Perplexity
Perplexity is the inverse probability of the test set, normalized by the number of words:
Chain rule:
For bigrams:
The best language model is one that best predicts an unseen test set
β’ Gives the highest P(sentence)
PP(W ) = P(w1w2...wN )-
1
N
=1
P(w1w2...wN )N
32
Challenges in Sequence Modeling
βͺ Language Model
βͺ Sentiment ?(positive or negative)
βͺ Part-of-speech ?(noun, verb,β¦)
Main Challenges:
βͺ Sequences of variable lengths (e.g., sentences)
βͺ Keep the number of parameters at a minimum
βͺ Take advantage of possible redundancy
βͺ Sequence representation
Model
Time-Delay Neural Network
Main Challenges:
βͺ Sequences of variable lengths (e.g., sentences)
βͺ Keep the number of parameters at a minimum
βͺ Take advantage of possible redundancy
Ideal for anyone with an interest in disguises
1D Convolution
Neural-based Unigram Language Model (LM)
Neural
Network
P(b|a): not from count, but the NN that can predict the next word.
P(βdog on the beachβ)
=P(dog|START)P(on|dog)P(the|on)P(beach|the)
1-of-N encoding
of βSTARTβ
P(next word is
βdogβ)
Neural
Network
1-of-N encoding
of βdogβ
P(next word is
βonβ)
Neural
Network
1-of-N encoding
of βonβ
P(next word is
βtheβ)
Neural
Network
1-of-N encoding
of βtheβ
P(next word is
βbeachβ)
Neural-based Unigram Language Model (LM)
Neural
Network
P(b|a): not from count, but the NN that can predict the next word.
P(βdog on the beachβ)
=P(dog|START)P(on|dog)P(the|on)P(beach|the)
1-of-N encoding
of βSTARTβ
P(next word is
βdogβ)
Neural
Network
1-of-N encoding
of βdogβ
P(next word is
βonβ)
Neural
Network
1-of-N encoding
of βonβ
P(next word is
βtheβ)
Neural
Network
1-of-N encoding
of βtheβ
P(next word is
βbeachβ)
It does not model sequential
information between predictions.
Recurrent Neural Networks!
36
Recurrent Neural
Networks
Sequence Prediction
x1 x2 x3
y1 y2 y3
(xi are vectors)Input data: π₯1 π₯2 π₯3 β¦β¦
Wi
Wo
Wi Wi
Wo Woh1 h2 h3
(or Unigram Language Model)
(yi are vectors)Output data: π¦1 π¦2 π¦3 β¦β¦
How can we include temporal dynamics?
Elman Network for Sequence Prediction
copy copy
x1 x2 x3
y1 y2 y3
Can be trained using backpropagation
Wi
Wo
Wh WhWhWi Wi
Wo Wo
0h1
a1 a2
h2 h3
The same model parameters are used again and again.
(or Unigram Language Model)
(xi are vectors)Input data: π₯1 π₯2 π₯3 β¦β¦
(yi are vectors)Output data: π¦1 π¦2 π¦3 β¦β¦
39
Recurrent Neural Network
π(π‘)
π(π‘)
π(π‘)π½
πΌ
πΏ(π‘)
π¦(π‘)π(π‘) = πππ‘ππ’ππ‘(π(π‘), π½)
πΏ(π‘) = βππππ(π = π¦(π‘)|π(π‘))
π(π‘) = π‘ππβ(πΌπ(π‘))
Feedforward Neural Network
40
Recurrent Neural Networks
πΎ
π(π‘)
π(π‘)
π(π‘)π½
πΌ
πΏ(π‘)
π¦(π‘)π(π‘) = πππ‘ππ’ππ‘(π(π‘), π½)
πΏ(π‘) = βππππ(π = π¦(π‘)|π(π‘))
π(π‘) = π‘ππβ(πΌπ π‘ +πΎπ(π‘β1))
πΏ =
π‘
πΏ(π‘)
41
Recurrent Neural Networks - Unrolling
π(1)
π(π)
π(1)π½
πΌ
πΏ(1)
π¦(1)
π(2)
π(2)
π(2)
πΏ(2)
π¦(2)
πΎ
π(3)
π(3)
π(3)
πΏ(3)
π¦(3)
π(π‘)
π(π‘)
π(π‘)
πΏ(π‘)
π¦(π‘)π(π‘) = πππ‘ππ’ππ‘(π(π‘), π½)
πΏ(π‘) = βππππ(π = π¦(π‘)|π(π‘))
π(π‘) = π‘ππβ(πΌπ π‘ +πΎπ(π‘β1))
Same model parameters are used for all time parts.
πΏ =
π‘
πΏ(π‘)
RNN-based Language Model
1-of-N encoding
of βSTARTβ1-of-N encoding
of βdogβ
1-of-N encoding
of βonβ1-of-N encoding
of βniceβ
β’Models long-term information
P(next word is
βdogβ)P(next word is
βonβ)P(next word is
βtheβ)P(next word is
βbeachβ)
RNN-based Sentence Generation (Decoder)
1-of-N encoding
of βSTARTβ1-of-N encoding
of βdogβ
1-of-N encoding
of βonβ1-of-N encoding
of βtheβ
β’Models long-term information
P(next word is
βdogβ)P(next word is
βonβ)P(next word is
βtheβ)P(next word is
βbeachβ)
Context
44
Sequence Modeling: Sequence Prediction
Sentiment ?(positive or negative)
Prediction
Ideal for anyone with an interest in disguises
Sentiment label?
RNN for Sequence Prediction
P(word is
positive)
Ideal for anyone disguises
πΏ =1
π
π‘
πΏ(π‘) =1
π
π‘
βππππ(π = π¦(π‘)|π(π‘))
P(word is
positive)
P(word is
positive)
P(word is
positive)
RNN for Sequence Prediction
P(sequence is
positive)
Ideal for anyone disguises
πΏ = πΏ(π) = βππππ(π = π¦(π)|π(π))
47
Sequence Modeling: Sequence Representation
Sequence representationLearning
Ideal for anyone with an interest in disguises
[0,1; 0,0004; 0;β¦.; 0,01; 0.09; 0,05]
RNN for Sequence Representation
1-of-N encoding
of βSTARTβ1-of-N encoding
of βdogβ
1-of-N encoding
of βonβ1-of-N encoding
of βniceβ
P(next word is
βdogβ)P(next word is
βonβ)P(next word is
βtheβ)P(next word is
βbeachβ)
RNN for Sequence Representation (Encoder)
1-of-N encoding
of βSTARTβ1-of-N encoding
of βdogβ
1-of-N encoding
of βonβ1-of-N encoding
of βniceβ
Sequence
Representation
RNN-based for Machine Translation
1-of-N encoding
of βleβ1-of-N encoding
of βchienβ
1-of-N encoding
of βlaβ1-of-N encoding
of βplageβ
1-of-N encoding
of βsurβ
Le chien sur la plage The dog on the beach
Encoder-Decoder Architecture
1-of-N encoding
of βleβ1-of-N encoding
of βchienβ
1-of-N encoding
of βlaβ1-of-N encoding
of βplageβ
1-of-N encoding
of βsurβ
Context
What is the loss function?
52
βͺ Character-level βlanguage modelsββͺ Xiang Zhang, Junbo Zhao and Yann LeCun, Character-level
Convolutional Networks for Text Classification, NIPS 2015
http://arxiv.org/pdf/1509.01626v2.pdf
βͺ Skip-though: embedding at the sentence levelβͺ Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S.
Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler. Skip-
Thought Vectors, NIPS 2015
http://arxiv.org/pdf/1506.06726v1.pdf
Related Topics
53
Backpropagation
Through Time
54
Optimization: Gradient Computation
π
π
π¦π»π π¦ =ππ¦
ππ₯1,ππ¦
ππ₯2,ππ¦
ππ₯3
Vector representation:
π¦ = π(π)
π = π(π)π»π π¦ =
ππ
ππ
π
π»π π¦
Gradient
βlocalβ Jacobianβbackpropβ Gradient
(matrix of size β Γ π₯ computed
using partial derivatives)
55
Backpropagation Algorithm
Forward pass
βͺ Following the graph topology,
compute value of each unit
Backpropagation pass
βͺ Initialize output gradient = 1
βͺ Compute βlocalβ Jacobian matrix
using values from forward pass
βͺ Use the chain rule:
Gradient βlocalβ Jacobian
βbackpropβ gradient
= xπ
ππ
π π = πππ‘ππ’ππ‘(π2,πΎπ)
π1 = π(π;πΎπ)
ππ ππ = π(ππ;πΎπ)
πΎπ
πΎπ
πΎπ
πΏ
π¦
πΏ = βππππ(π = π¦|π)(cross-entropy)
56
Recurrent Neural Networks
π(1)
π(π)
π(1)π½
πΌ
πΏ(1)
π¦(1)
π(2)
π(2)
π(2)
πΏ(2)
π¦(2)
πΎ
π(3)
π(3)
π(3)
πΏ(3)
π¦(3)
π(π)
π(π)
π(π)
πΏ(π)
π¦(π)
π(π‘) = πππ‘ππ’ππ‘(π(π‘), π½)
πΏ(π‘) = βππππ(π = π¦(π‘)|π(π‘))
π(π‘) = π‘ππβ(πΌπ π‘ +πΎπ(π‘β1))
πΏ =
π‘
πΏ(π‘)
57
Backpropagation Through Time
π(π)
π(π)
π(π)
πΏ(π)
π¦(π)
ππΏ
ππΏ(π‘)
πΏ =
π‘
πΏ(π‘) = β
π‘
ππππ(π = π¦(π‘)|π(π‘))
Gradient
βlocalβ Jacobian
βbackpropβ gradient=
xπΏ(π‘)
π(π‘)π»π π‘ πΏ
π
π(π) π»π π πΏ = π»π π πΏππ§(π)
ππ(π)= π»π π πΏπ½
= π ππππππ π§ππ‘ β ππ,π¦(π‘)
π»π π‘ πΏπ(π‘+1)π(π‘)
πΏ(π) or
orπ(π)=
ππΏ
ππ§π(π‘)
= 1
=ππΏ
ππΏ(π‘)ππΏ(π‘)
ππ§π(π‘)
= π»π π‘ πΏππ(π‘)
ππ(π‘)+ π»π π‘+1 πΏ
ππ(π‘+1)
ππ(π‘)
58
Backpropagation Through Time
π(π)
π(π)
π(π)
πΏ(π)
π¦(π)
πΏ =
π‘
πΏ(π‘) = β
π‘
ππππ(π = π¦(π‘)|π(π‘))
Gradient
βlocalβ Jacobian
βbackpropβ gradient=
x
π½
πΌ
πΎ
π½ π»π½πΏ
πΎ π»πΎπΏ =
π‘
π»π π‘ πΏππ(π‘)
ππΎ
πΌ π»πΌπΏ =
π‘
π»π π‘ πΏππ(π‘)
ππΌ
=
π‘
π»π π‘ πΏππ(π‘)
ππ½
59
Gated Recurrent
Neural Networks
RNN for Sequence Prediction
P(word is
positive)
Ideal for anyone disguises
πΏ =
π‘
πΏ(π‘) =
π‘
βππππ(π = π¦(π‘)|π(π‘))
P(word is
positive)
P(word is
positive)
P(word is
positive)
RNN for Sequence Prediction
P(sequence is
positive)
Ideal for anyone disguises
πΏ = πΏ(π) = βππππ(π = π¦(π)|π(π))
62
Recurrent Neural Networks
π(1)
π(π)
π(1)π½
πΌ
πΏ(1)
π¦(1)
π(2)
π(2)
π(2)
πΏ(2)
π¦(2)
πΎ
π(3)
π(3)
π(3)
πΏ(3)
π¦(3)
π(π)
π(π)
π(π)
πΏ(π)
π¦(π)
π(π‘) = πππ‘ππ’ππ‘(π(π‘), π½)
πΏ(π‘) = βππππ(π = π¦(π‘)|π(π‘))
π(π‘) = π‘ππβ(πΌπ π‘ +πΎπ(π‘β1))
πΏ =
π‘
πΏ(π‘)
63
Long-term Dependencies
Vanishing gradient problem for RNNs:
β’ The influence of a given input on the hidden layer, and therefore on
the network output, either decays or blows up exponentially as it
cycles around the network's recurrent connections.
π(π‘)~π‘ππβ(πΎπ(π‘β1))
64
Recurrent Neural Networks
π(π‘)
π(π‘)
π(π‘)
πΏ(π‘)
π¦(π‘)π(π‘)
π(π‘)
tanh+1
-1π(π‘+1)
65
LSTM ideas: (1) βMemoryβ Cell and Self Loop
π(π‘)
π(π‘)
π(π‘)
πΏ(π‘)
π¦(π‘)
Self-
π(π‘)
π(π‘)
tanh+1
-1π(π‘)cell
Self-
loop
π(π‘+1)
Long Short-Term Memory (LSTM)
[Hochreiter and Schmidhuber, 1997]
66
LSTM Ideas: (2) Input and Output Gates
π(π‘)
π(π‘)
π(π‘)
πΏ(π‘)
π¦(π‘)
Self-
π(π‘)
π(π‘)
tanh+1
-1π(π‘)cell
π(π‘+1)
π(π‘)
π(π‘)
sigmoid
+1
0
x
π(π‘)
π(π‘)
sigmoid
+1
0
x
Input gate
Output gate
sum
Self-
loop
[Hochreiter and Schmidhuber, 1997]
67
LSTM Ideas: (3) Forget Gate
π(π‘)
π(π‘)
π(π‘)
πΏ(π‘)
π¦(π‘)
Self-
π(π‘)
π(π‘)
tanh+1
-1π(π‘)cell
Self-
loop
π(π‘+1)
π(π‘)
π(π‘)
sigmoid
+1
0
x
π(π‘)
π(π‘)
sigmoid
+1
0
π(π‘)
π(π‘)
sigmoid
+1
0
x
xInput gate
Forget gate
Output gate
sum
[Gers et al., 2000]
ππππ
=
π‘ππβπ ππππ ππππ πππ
πΎ π(π‘)
π(π‘)
π
π
π
π
π(π‘) = πβ¨π π‘β1 + πβ¨π
π(π‘) = πβ¨tanh(π π‘ )
68
Recurrent Neural Network using LSTM Units
π(1)
π(π)
π½
πΎ
πΏ(1)
π¦(1)
π(2)
π(2)
πΏ(2)
π¦(2)
π(3)
π(3)
πΏ(3)
π¦(3)
π(π)
π(π)
πΏ(π)
π¦(π)
LSTM(1) LSTM(2) LSTM(3) LSTM(π)
Gradient can still be computer using backpropagation!
69
Bi-directional LSTM Network
π(1)
π(π)
π½
πΎπ
πΏ(1)
π¦(1)
π(2)
π(2)
πΏ(2)
π¦(2)
π(3)
π(3)
πΏ(3)
π¦(3)
π(π)
π(π)
πΏ(π)
π¦(π)
LSTM(1) LSTM(2) LSTM(3) LSTM(π)
πΎπ
LSTM(1) LSTM(2) LSTM(3) LSTM(π)
1 1 1 1
2222
70
Deep LSTM Network
π(1)
π(π)
π½
πΎπ
πΏ(1)
π¦(1)
π(2)
π(2)
πΏ(2)
π¦(2)
π(3)
π(3)
πΏ(3)
π¦(3)
π(π)
π(π)
πΏ(π)
π¦(π)
LSTM(1) LSTM(2) LSTM(3) LSTM(π)
πΎπ
LSTM(1) LSTM(2) LSTM(3) LSTM(π)
1 1 1 1
2222