Deep Generative Language Models - DGM4NLP · Deep Generative Language Models DGM4NLP Miguel Rios...
Transcript of Deep Generative Language Models - DGM4NLP · Deep Generative Language Models DGM4NLP Miguel Rios...
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Deep Generative Language ModelsDGM4NLP
Miguel RiosUniversity of Amsterdam
May 2, 2019
Rios DGM4NLP 1 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Outline
1 Language Modelling
2 Variational Auto-encoder for Sentences
Rios DGM4NLP 2 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
Rios DGM4NLP 3 / 46
Discriminative embedding modelsword2vec
In the event of a chemical spill, most children know they shouldevacuate as advised by people in charge.
Place words in Rd as to answer questions like
“Have I seen this word in this context?”
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
Rios DGM4NLP 3 / 46
Discriminative embedding modelsword2vec
In the event of a chemical spill, most children know they shouldevacuate as advised by people in charge.
Place words in Rd as to answer questions like
“Have I seen this word in this context?”
Fit a binary classifier
positive examples
negative examples
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
The models processes a sentence and outputs a wordrepresentation:
Rios DGM4NLP 4 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
Rios DGM4NLP 5 / 46
quickly evacuate the area / deje el lugar rapidamente
x
y
z
a
θ
m
n
|B|
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
Rios DGM4NLP 5 / 46
quickly evacuate the area / deje el lugar rapidamente
x
y
z
a
θ
m
n
|B|
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
Rios DGM4NLP 5 / 46
quickly evacuate the area / deje el lugar rapidamente
x
y
z
a
θ
m
n
|B|
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
Rios DGM4NLP 5 / 46
quickly evacuate the area / deje el lugar rapidamente
x
y
z
a
θ
m
n
|B|
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
Rios DGM4NLP 5 / 46
quickly evacuate the area / deje el lugar rapidamente
x
y
z
a
θ
m
n
|B|
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
Rios DGM4NLP 5 / 46
quickly evacuate the area / deje el lugar rapidamente
x
y
z
a
θ
m
n
|B|
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
Rios DGM4NLP 5 / 46
quickly evacuate the area / deje el lugar rapidamente
x
y
z
a
θ
m
n
|B|
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
Rios DGM4NLP 5 / 46
quickly evacuate the area / deje el lugar rapidamente
x
y
z
a
θ
m
n
|B|
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
Rios DGM4NLP 5 / 46
quickly evacuate the area / deje el lugar rapidamente
x
y
z
a
θ
m
n
|B|
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Recap Generative Models of Word Representation
x
embedding128d
h: BiRNN100d
u 100d
f(z)
x
s 100d
sample z100d
g(zaj)
y
Generative model
Inference model
Rios DGM4NLP 6 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Introduction
you know nothing, jon x
ground control to major x
the x
Rios DGM4NLP 7 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Introduction
you know nothing, jon x
ground control to major x
the x
Rios DGM4NLP 7 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Introduction
you know nothing, jon x
ground control to major x
the x
Rios DGM4NLP 7 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Introduction
the quick brown x
the quick brown fox x
the quick brown fox jumps x
the quick brown fox jumps over x
the quick brown fox jumps over the x
the quick brown fox jumps over the lazy x
the quick brown fox jumps over the lazy dog
Rios DGM4NLP 8 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Introduction
the quick brown x
the quick brown fox x
the quick brown fox jumps x
the quick brown fox jumps over x
the quick brown fox jumps over the x
the quick brown fox jumps over the lazy x
the quick brown fox jumps over the lazy dog
Rios DGM4NLP 8 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Introduction
the quick brown x
the quick brown fox x
the quick brown fox jumps x
the quick brown fox jumps over x
the quick brown fox jumps over the x
the quick brown fox jumps over the lazy x
the quick brown fox jumps over the lazy dog
Rios DGM4NLP 8 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Introduction
the quick brown x
the quick brown fox x
the quick brown fox jumps x
the quick brown fox jumps over x
the quick brown fox jumps over the x
the quick brown fox jumps over the lazy x
the quick brown fox jumps over the lazy dog
Rios DGM4NLP 8 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Introduction
the quick brown x
the quick brown fox x
the quick brown fox jumps x
the quick brown fox jumps over x
the quick brown fox jumps over the x
the quick brown fox jumps over the lazy x
the quick brown fox jumps over the lazy dog
Rios DGM4NLP 8 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Introduction
the quick brown x
the quick brown fox x
the quick brown fox jumps x
the quick brown fox jumps over x
the quick brown fox jumps over the x
the quick brown fox jumps over the lazy x
the quick brown fox jumps over the lazy dog
Rios DGM4NLP 8 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Introduction
the quick brown x
the quick brown fox x
the quick brown fox jumps x
the quick brown fox jumps over x
the quick brown fox jumps over the x
the quick brown fox jumps over the lazy x
the quick brown fox jumps over the lazy dog
Rios DGM4NLP 8 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Definition
Language models give us the probability of a sentence;
At a time step, they assign a probability to the next word.
Rios DGM4NLP 9 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Definition
Language models give us the probability of a sentence;
At a time step, they assign a probability to the next word.
Rios DGM4NLP 9 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Applications
Very useful on different tasks:
Speech recognition;
Spelling correction;
Machine translation;
LMs are useful in almost any tasks that deals with generatinglanguage.
Rios DGM4NLP 10 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Applications
Very useful on different tasks:
Speech recognition;
Spelling correction;
Machine translation;
LMs are useful in almost any tasks that deals with generatinglanguage.
Rios DGM4NLP 10 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Applications
Very useful on different tasks:
Speech recognition;
Spelling correction;
Machine translation;
LMs are useful in almost any tasks that deals with generatinglanguage.
Rios DGM4NLP 10 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Applications
Very useful on different tasks:
Speech recognition;
Spelling correction;
Machine translation;
LMs are useful in almost any tasks that deals with generatinglanguage.
Rios DGM4NLP 10 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Applications
Very useful on different tasks:
Speech recognition;
Spelling correction;
Machine translation;
LMs are useful in almost any tasks that deals with generatinglanguage.
Rios DGM4NLP 10 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Language Models
N-gram based LMs;
Log-linear LMs;
Neural LMs.
Rios DGM4NLP 11 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Language Models
N-gram based LMs;
Log-linear LMs;
Neural LMs.
Rios DGM4NLP 11 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Language Models
N-gram based LMs;
Log-linear LMs;
Neural LMs.
Rios DGM4NLP 11 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
x is a sequence of words
x = x1, x2, x3, x4, x5= you, know, nothing, jon, snow
Rios DGM4NLP 12 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
x is a sequence of words
x = x1, x2, x3, x4, x5= you, know, nothing, jon, snow
Rios DGM4NLP 12 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
To compute the probability of a sentence
p(x) = p (x1, x2, . . . , xn) (1)
We apply the chain rule:
p(x) =∏i
p (xi |x1, . . . , xi−1) (2)
We limit the history with a Markov order:p (xi |x1, . . . , xi−1) ' p (xi |xi−4, xi−3, xi−2, xi−1)
[Jelinek and Mercer, 1980, Goodman, 2001]Rios DGM4NLP 13 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
To compute the probability of a sentence
p(x) = p (x1, x2, . . . , xn) (1)
We apply the chain rule:
p(x) =∏i
p (xi |x1, . . . , xi−1) (2)
We limit the history with a Markov order:p (xi |x1, . . . , xi−1) ' p (xi |xi−4, xi−3, xi−2, xi−1)
[Jelinek and Mercer, 1980, Goodman, 2001]Rios DGM4NLP 13 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
To compute the probability of a sentence
p(x) = p (x1, x2, . . . , xn) (1)
We apply the chain rule:
p(x) =∏i
p (xi |x1, . . . , xi−1) (2)
We limit the history with a Markov order:p (xi |x1, . . . , xi−1) ' p (xi |xi−4, xi−3, xi−2, xi−1)
[Jelinek and Mercer, 1980, Goodman, 2001]Rios DGM4NLP 13 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
Chain rule:p(x) =
∏i
p (xi |x1, . . . , xi−1) (3)
Rios DGM4NLP 14 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
Chain rule:p(x) =
∏i
p (xi |x1, . . . , xi−1) (3)
Rios DGM4NLP 14 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
We make a Markov assumption of conditional independence:
p (xi |x1, . . . , xi−1) ' p (xi |xi−1) (4)
Rios DGM4NLP 15 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
We make a Markov assumption of conditional independence:
p (xi |x1, . . . , xi−1) ' p (xi |xi−1) (4)
Rios DGM4NLP 15 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
We make the Markov assumption:p (xi |x1, . . . , xi−1) ' p (xi |xi−1)
If we do not observe the bigram p(xi |xi−1) is 0and the probability of a sentence will be 0.
MLE
pMLE (xi |xi−1) =count (xi−1, xi )
count (xi−1)(5)
Laplace smoothing:
padd1 (xi |xi−1) =count (xi−1, xi ) + 1
count (xi−1) + V(6)
Rios DGM4NLP 16 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
We make the Markov assumption:p (xi |x1, . . . , xi−1) ' p (xi |xi−1)
If we do not observe the bigram p(xi |xi−1) is 0and the probability of a sentence will be 0.
MLE
pMLE (xi |xi−1) =count (xi−1, xi )
count (xi−1)(5)
Laplace smoothing:
padd1 (xi |xi−1) =count (xi−1, xi ) + 1
count (xi−1) + V(6)
Rios DGM4NLP 16 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
We make the Markov assumption:p (xi |x1, . . . , xi−1) ' p (xi |xi−1)
If we do not observe the bigram p(xi |xi−1) is 0and the probability of a sentence will be 0.
MLE
pMLE (xi |xi−1) =count (xi−1, xi )
count (xi−1)(5)
Laplace smoothing:
padd1 (xi |xi−1) =count (xi−1, xi ) + 1
count (xi−1) + V(6)
Rios DGM4NLP 16 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
N-gram LM
We make the Markov assumption:p (xi |x1, . . . , xi−1) ' p (xi |xi−1)
If we do not observe the bigram p(xi |xi−1) is 0and the probability of a sentence will be 0.
MLE
pMLE (xi |xi−1) =count (xi−1, xi )
count (xi−1)(5)
Laplace smoothing:
padd1 (xi |xi−1) =count (xi−1, xi ) + 1
count (xi−1) + V(6)
Rios DGM4NLP 16 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
p(y |x) =exp w · φ(x , y)∑
y ′∈Vyexpw · φ (x , y ′)
(7)
y is the next word and Vy is the vocabulary;
x is the history;
φ is a feature function that returns an n-dimensional vector;
w are the model parameters.
Rios DGM4NLP 17 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
p(y |x) =exp w · φ(x , y)∑
y ′∈Vyexpw · φ (x , y ′)
(7)
y is the next word and Vy is the vocabulary;
x is the history;
φ is a feature function that returns an n-dimensional vector;
w are the model parameters.
Rios DGM4NLP 17 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
p(y |x) =exp w · φ(x , y)∑
y ′∈Vyexpw · φ (x , y ′)
(7)
y is the next word and Vy is the vocabulary;
x is the history;
φ is a feature function that returns an n-dimensional vector;
w are the model parameters.
Rios DGM4NLP 17 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
p(y |x) =exp w · φ(x , y)∑
y ′∈Vyexpw · φ (x , y ′)
(7)
y is the next word and Vy is the vocabulary;
x is the history;
φ is a feature function that returns an n-dimensional vector;
w are the model parameters.
Rios DGM4NLP 17 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
p(y |x) =exp w · φ(x , y)∑
y ′∈Vyexpw · φ (x , y ′)
(7)
y is the next word and Vy is the vocabulary;
x is the history;
φ is a feature function that returns an n-dimensional vector;
w are the model parameters.
Rios DGM4NLP 17 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
n-gram features xj−1 = the and xj = puppy.
gappy n-gram features xj−2 = the and xj = puppy.
class features: xj belongs to class ABC;
gazetteer features: xj is a place name;
Rios DGM4NLP 18 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
n-gram features xj−1 = the and xj = puppy.
gappy n-gram features xj−2 = the and xj = puppy.
class features: xj belongs to class ABC;
gazetteer features: xj is a place name;
Rios DGM4NLP 18 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
n-gram features xj−1 = the and xj = puppy.
gappy n-gram features xj−2 = the and xj = puppy.
class features: xj belongs to class ABC;
gazetteer features: xj is a place name;
Rios DGM4NLP 18 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
n-gram features xj−1 = the and xj = puppy.
gappy n-gram features xj−2 = the and xj = puppy.
class features: xj belongs to class ABC;
gazetteer features: xj is a place name;
Rios DGM4NLP 18 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
With features of words and histories we can share statisticalweight
With n-grams, there is no sharing at all
We also get smoothing for free
We can add arbitrary features
We use Stochastic Gradient Descent (SGD)
Rios DGM4NLP 19 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
With features of words and histories we can share statisticalweight
With n-grams, there is no sharing at all
We also get smoothing for free
We can add arbitrary features
We use Stochastic Gradient Descent (SGD)
Rios DGM4NLP 19 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
With features of words and histories we can share statisticalweight
With n-grams, there is no sharing at all
We also get smoothing for free
We can add arbitrary features
We use Stochastic Gradient Descent (SGD)
Rios DGM4NLP 19 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
With features of words and histories we can share statisticalweight
With n-grams, there is no sharing at all
We also get smoothing for free
We can add arbitrary features
We use Stochastic Gradient Descent (SGD)
Rios DGM4NLP 19 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Log-linear LM
With features of words and histories we can share statisticalweight
With n-grams, there is no sharing at all
We also get smoothing for free
We can add arbitrary features
We use Stochastic Gradient Descent (SGD)
Rios DGM4NLP 19 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Neural LM
n-gram language models have proven to be effective in varioustasks
log-linear models allow us to share weights through features
maybe our history is still too limited, e.g. n-1 words
we need to find useful features
Rios DGM4NLP 20 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Neural LM
n-gram language models have proven to be effective in varioustasks
log-linear models allow us to share weights through features
maybe our history is still too limited, e.g. n-1 words
we need to find useful features
Rios DGM4NLP 20 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Neural LM
n-gram language models have proven to be effective in varioustasks
log-linear models allow us to share weights through features
maybe our history is still too limited, e.g. n-1 words
we need to find useful features
Rios DGM4NLP 20 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Neural LM
n-gram language models have proven to be effective in varioustasks
log-linear models allow us to share weights through features
maybe our history is still too limited, e.g. n-1 words
we need to find useful features
Rios DGM4NLP 20 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
With NN we can exploit distributed representations to allowfor statistical weight sharing.
How does it work:
1 each word is mapped to an embedding: an m-dimensionalfeature vector;
2 a probability function over word sequences is expressed interms of these vectors;
3 we jointly learn the feature vectors and the parameters of theprobability function.
[Bengio et al., 2003]Rios DGM4NLP 21 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
With NN we can exploit distributed representations to allowfor statistical weight sharing.
How does it work:
1 each word is mapped to an embedding: an m-dimensionalfeature vector;
2 a probability function over word sequences is expressed interms of these vectors;
3 we jointly learn the feature vectors and the parameters of theprobability function.
[Bengio et al., 2003]Rios DGM4NLP 21 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
With NN we can exploit distributed representations to allowfor statistical weight sharing.
How does it work:1 each word is mapped to an embedding: an m-dimensional
feature vector;
2 a probability function over word sequences is expressed interms of these vectors;
3 we jointly learn the feature vectors and the parameters of theprobability function.
[Bengio et al., 2003]Rios DGM4NLP 21 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
With NN we can exploit distributed representations to allowfor statistical weight sharing.
How does it work:1 each word is mapped to an embedding: an m-dimensional
feature vector;2 a probability function over word sequences is expressed in
terms of these vectors;
3 we jointly learn the feature vectors and the parameters of theprobability function.
[Bengio et al., 2003]Rios DGM4NLP 21 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
With NN we can exploit distributed representations to allowfor statistical weight sharing.
How does it work:1 each word is mapped to an embedding: an m-dimensional
feature vector;2 a probability function over word sequences is expressed in
terms of these vectors;3 we jointly learn the feature vectors and the parameters of the
probability function.
[Bengio et al., 2003]Rios DGM4NLP 21 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
Similar words are expected to have similar feature vectors:(dog,cat), (running,walking), (bedroom,room)
With this, probability mass is naturally transferred from (1) to(2):
The cat is walking in the bedroom.
The dog is running in the room.
Take-away message:The presence of only one sentence in the training data willincrease the probability of a combinatorial number ofneighbours in sentence space.
Rios DGM4NLP 22 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
Similar words are expected to have similar feature vectors:(dog,cat), (running,walking), (bedroom,room)
With this, probability mass is naturally transferred from (1) to(2):
The cat is walking in the bedroom.
The dog is running in the room.
Take-away message:The presence of only one sentence in the training data willincrease the probability of a combinatorial number ofneighbours in sentence space.
Rios DGM4NLP 22 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
Similar words are expected to have similar feature vectors:(dog,cat), (running,walking), (bedroom,room)
With this, probability mass is naturally transferred from (1) to(2):
The cat is walking in the bedroom.
The dog is running in the room.
Take-away message:The presence of only one sentence in the training data willincrease the probability of a combinatorial number ofneighbours in sentence space.
Rios DGM4NLP 22 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
Similar words are expected to have similar feature vectors:(dog,cat), (running,walking), (bedroom,room)
With this, probability mass is naturally transferred from (1) to(2):
The cat is walking in the bedroom.
The dog is running in the room.
Take-away message:The presence of only one sentence in the training data willincrease the probability of a combinatorial number ofneighbours in sentence space.
Rios DGM4NLP 22 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
Similar words are expected to have similar feature vectors:(dog,cat), (running,walking), (bedroom,room)
With this, probability mass is naturally transferred from (1) to(2):
The cat is walking in the bedroom.
The dog is running in the room.
Take-away message:The presence of only one sentence in the training data willincrease the probability of a combinatorial number ofneighbours in sentence space.
Rios DGM4NLP 22 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
FF-LM
Eyou,Eknow,Enothing ∈ R100
x = [Eyou; Eknow; Enothing] ∈ R300
y = W 3 tanh (W 1x + b1) + W 2x + b2(8)
Rios DGM4NLP 23 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
FF-LM
Eyou,Eknow,Enothing ∈ R100
x = [Eyou; Eknow; Enothing] ∈ R300
y = W 3 tanh (W 1x + b1) + W 2x + b2(8)
Rios DGM4NLP 23 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
FF-LM
The non-linear activation functions perform featurecombinations that a linear model cannot do;
End-to-end training on next word prediction.
Rios DGM4NLP 24 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
FF-LM
The non-linear activation functions perform featurecombinations that a linear model cannot do;
End-to-end training on next word prediction.
Rios DGM4NLP 24 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
FF-LM
The non-linear activation functions perform featurecombinations that a linear model cannot do;
End-to-end training on next word prediction.
Rios DGM4NLP 24 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
FF-LM
We now have much better generalisation, but still a limitedhistory/context.
Recurrent neural networks have unlimited history
Rios DGM4NLP 25 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
FF-LM
We now have much better generalisation, but still a limitedhistory/context.
Recurrent neural networks have unlimited history
Rios DGM4NLP 25 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
FF-LM
We now have much better generalisation, but still a limitedhistory/context.
Recurrent neural networks have unlimited history
Rios DGM4NLP 25 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
FF-LM
We now have much better generalisation, but still a limitedhistory/context.Recurrent neural networks have unlimited history
Rios DGM4NLP 26 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
FF-LM
We now have much better generalisation, but still a limitedhistory/context.
Recurrent neural networks have unlimited history
Rios DGM4NLP 26 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Feed-forward NLM
FF-LM
We now have much better generalisation, but still a limitedhistory/context.Recurrent neural networks have unlimited history
Rios DGM4NLP 26 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
RNN NLM
RNN-LM
Start: predict second word from first
[Mikolov et al., 2010]Rios DGM4NLP 27 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
RNN NLM
RNN-LM
Start: predict second word from first
[Mikolov et al., 2010]Rios DGM4NLP 27 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
RNN NLM
Rios DGM4NLP 28 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Outline
1 Language Modelling
2 Variational Auto-encoder for Sentences
Rios DGM4NLP 29 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Sen VAE
Model observations as draws from the marginal of a DGM.
An NN maps from a latent sentence embedding z ∈ Rdz to adistribution p(x |z , θ) over sentences,
p(x |θ) =∫p(z)p(x |z , θ)dz
=∫N (z |0, I )
∏|x |i=1 Cat (xi |f (z , x<i ; θ))dz
(9)
[Bowman et al., 2015]Rios DGM4NLP 30 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Sen VAE
Model observations as draws from the marginal of a DGM.
An NN maps from a latent sentence embedding z ∈ Rdz to adistribution p(x |z , θ) over sentences,
p(x |θ) =∫p(z)p(x |z , θ)dz
=∫N (z |0, I )
∏|x |i=1 Cat (xi |f (z , x<i ; θ))dz
(9)
[Bowman et al., 2015]Rios DGM4NLP 30 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Sen VAE
z xm1 θ
N
φ
Generative model
Z ∼ N (0, I )
Xi |z , x<i ∼ Cat(fθ(z , x<i ))
Inference model
Z ∼ N (µφ(xm1 ), σφ(xm1 )2)
Rios DGM4NLP 31 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Sen VAE
Generation is one word at a time without Markovassumptions, but f ()conditions on zin addition to the observed prefix.
The conditional p(x |z , θ) is the decoder.
p(x |θ) is the marginal likelihood.
We train the model to assign high (marginal) probability toobservations like a LMs.
However the model uses a latent space to exploitneighbourhood and smoothness in latent space to captureregularities in data space.For example, it may group sentences according to certain e.g.lexical choices, syntactic complexity, lexical semantics, etc...
Rios DGM4NLP 32 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Sen VAE
Generation is one word at a time without Markovassumptions, but f ()conditions on zin addition to the observed prefix.
The conditional p(x |z , θ) is the decoder.
p(x |θ) is the marginal likelihood.
We train the model to assign high (marginal) probability toobservations like a LMs.
However the model uses a latent space to exploitneighbourhood and smoothness in latent space to captureregularities in data space.For example, it may group sentences according to certain e.g.lexical choices, syntactic complexity, lexical semantics, etc...
Rios DGM4NLP 32 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Sen VAE
Generation is one word at a time without Markovassumptions, but f ()conditions on zin addition to the observed prefix.
The conditional p(x |z , θ) is the decoder.
p(x |θ) is the marginal likelihood.
We train the model to assign high (marginal) probability toobservations like a LMs.
However the model uses a latent space to exploitneighbourhood and smoothness in latent space to captureregularities in data space.For example, it may group sentences according to certain e.g.lexical choices, syntactic complexity, lexical semantics, etc...
Rios DGM4NLP 32 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Sen VAE
Generation is one word at a time without Markovassumptions, but f ()conditions on zin addition to the observed prefix.
The conditional p(x |z , θ) is the decoder.
p(x |θ) is the marginal likelihood.
We train the model to assign high (marginal) probability toobservations like a LMs.
However the model uses a latent space to exploitneighbourhood and smoothness in latent space to captureregularities in data space.For example, it may group sentences according to certain e.g.lexical choices, syntactic complexity, lexical semantics, etc...
Rios DGM4NLP 32 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Sen VAE
Generation is one word at a time without Markovassumptions, but f ()conditions on zin addition to the observed prefix.
The conditional p(x |z , θ) is the decoder.
p(x |θ) is the marginal likelihood.
We train the model to assign high (marginal) probability toobservations like a LMs.
However the model uses a latent space to exploitneighbourhood and smoothness in latent space to captureregularities in data space.For example, it may group sentences according to certain e.g.lexical choices, syntactic complexity, lexical semantics, etc...
Rios DGM4NLP 32 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Approximate Inference
The model has a diagonal Gaussian distribution as variationalposterior:qφ(z |x) = N (z |µφ(x), diag (σφ(x)))
With reparametrisation:z = hφ(ε, x) = µφ(x) + σφ(x)� ε, where ε ∼ N (0, I) f
Analytical KL:KL [qφ(z |x)‖pθ(z)] =12
∑Dzd=1
(− log σ2φ(x)− 1 + σ2φ(x) + µ2φ(x)
)
Rios DGM4NLP 33 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Approximate Inference
The model has a diagonal Gaussian distribution as variationalposterior:qφ(z |x) = N (z |µφ(x), diag (σφ(x)))
With reparametrisation:z = hφ(ε, x) = µφ(x) + σφ(x)� ε, where ε ∼ N (0, I) f
Analytical KL:KL [qφ(z |x)‖pθ(z)] =12
∑Dzd=1
(− log σ2φ(x)− 1 + σ2φ(x) + µ2φ(x)
)
Rios DGM4NLP 33 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Approximate Inference
The model has a diagonal Gaussian distribution as variationalposterior:qφ(z |x) = N (z |µφ(x), diag (σφ(x)))
With reparametrisation:z = hφ(ε, x) = µφ(x) + σφ(x)� ε, where ε ∼ N (0, I) f
Analytical KL:KL [qφ(z |x)‖pθ(z)] =12
∑Dzd=1
(− log σ2φ(x)− 1 + σ2φ(x) + µ2φ(x)
)
Rios DGM4NLP 33 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Approximate Inference
We jointly estimate the parameters of both generative andinference by maximising a lowerbound on the log-likelihoodfunction (ELBO):
L(θ, φ|x) = Eq(z|x ,φ)[log p(x |z , θ)].−KL(q(z |x , φ)|p(z))
(10)
Rios DGM4NLP 34 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Architecture
Gaussian Sen VAE parametrises a categorical distribution overthe vocabulary for each given prefix, and, it conditions on alatent embedding:Z ∼ N (0, I ),Xi |z , x<i ∼ Cat (f (z , x<i ; θ))
f (z , x<i ; θ) = softmax (si )
ei = emb (xi ; θemb)
h0 = tanh (affine (z ; θ init ))
hi = GRU (hi−1, ei−1; θgru)
si = affine (hi ; θout)
(11)
Rios DGM4NLP 35 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Architecture
Gaussian Sen VAE parametrises a categorical distribution overthe vocabulary for each given prefix, and, it conditions on alatent embedding:Z ∼ N (0, I ),Xi |z , x<i ∼ Cat (f (z , x<i ; θ))
f (z , x<i ; θ) = softmax (si )
ei = emb (xi ; θemb)
h0 = tanh (affine (z ; θ init ))
hi = GRU (hi−1, ei−1; θgru)
si = affine (hi ; θout)
(11)
Rios DGM4NLP 35 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
The Strong Decoder Problem
The VAE may ignore the latent variable given the interactionbetween the prior and posterior in the KL divergence.
This problem appears when we have strong decodersconditional likelihoods p(x |z) parametrised by high capacitymodels
The model might achieve a high ELBO without usinginformation from z
RNN LM is strong decoder because they condition on allprevious context when generating a word
Rios DGM4NLP 36 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
The Strong Decoder Problem
The VAE may ignore the latent variable given the interactionbetween the prior and posterior in the KL divergence.
This problem appears when we have strong decodersconditional likelihoods p(x |z) parametrised by high capacitymodels
The model might achieve a high ELBO without usinginformation from z
RNN LM is strong decoder because they condition on allprevious context when generating a word
Rios DGM4NLP 36 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
The Strong Decoder Problem
The VAE may ignore the latent variable given the interactionbetween the prior and posterior in the KL divergence.
This problem appears when we have strong decodersconditional likelihoods p(x |z) parametrised by high capacitymodels
The model might achieve a high ELBO without usinginformation from z
RNN LM is strong decoder because they condition on allprevious context when generating a word
Rios DGM4NLP 36 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
The Strong Decoder Problem
The VAE may ignore the latent variable given the interactionbetween the prior and posterior in the KL divergence.
This problem appears when we have strong decodersconditional likelihoods p(x |z) parametrised by high capacitymodels
The model might achieve a high ELBO without usinginformation from z
RNN LM is strong decoder because they condition on allprevious context when generating a word
Rios DGM4NLP 36 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
What to do?
Weakening the Decoder, the model relies on the latentvariables the reconstruction of the observed data.
KL Annealing, weigh the KL term in the ELBO with a factorthat is annealed from 0 to 1 over a fixed number of steps ofsize γ ∈ (0, 1)
Word Dropout, by dropping a percentage of the input atrandom, the decoder has to rely on the latent variable to fill inthe missing gaps.
Freebits because it allows encoding the first r nats ofinformation for free.max(r ,KL(qφ(z |x)‖p(z)))
Rios DGM4NLP 37 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
What to do?
Weakening the Decoder, the model relies on the latentvariables the reconstruction of the observed data.
KL Annealing, weigh the KL term in the ELBO with a factorthat is annealed from 0 to 1 over a fixed number of steps ofsize γ ∈ (0, 1)
Word Dropout, by dropping a percentage of the input atrandom, the decoder has to rely on the latent variable to fill inthe missing gaps.
Freebits because it allows encoding the first r nats ofinformation for free.max(r ,KL(qφ(z |x)‖p(z)))
Rios DGM4NLP 37 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
What to do?
Weakening the Decoder, the model relies on the latentvariables the reconstruction of the observed data.
KL Annealing, weigh the KL term in the ELBO with a factorthat is annealed from 0 to 1 over a fixed number of steps ofsize γ ∈ (0, 1)
Word Dropout, by dropping a percentage of the input atrandom, the decoder has to rely on the latent variable to fill inthe missing gaps.
Freebits because it allows encoding the first r nats ofinformation for free.max(r ,KL(qφ(z |x)‖p(z)))
Rios DGM4NLP 37 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
What to do?
Weakening the Decoder, the model relies on the latentvariables the reconstruction of the observed data.
KL Annealing, weigh the KL term in the ELBO with a factorthat is annealed from 0 to 1 over a fixed number of steps ofsize γ ∈ (0, 1)
Word Dropout, by dropping a percentage of the input atrandom, the decoder has to rely on the latent variable to fill inthe missing gaps.
Freebits because it allows encoding the first r nats ofinformation for free.max(r ,KL(qφ(z |x)‖p(z)))
Rios DGM4NLP 37 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Metrics
For VAE models, Negative log-likelihood NLL is estimated,since we do not have access to the exact marginal likelihood.
We use an importance sampling (IS) estimate://
p(x |θ) =∫p(z , x |θ)dz
IS=∫q(z |x)p(z,x |θ)q(z|x) dz
≈ 1S
∑Ss=1
p(z(s),x |θ)q(z(s)|x)
where z(s) ∼ q(z |x)(12)
Perplexity PPL: the exponent of average per-word entropy,given N i.i.d. sequences
PPL = exp
(∑Ni=1 log p (xi )∑N
i=1 |xi |
)(13)
perplexity is based on the importance sampled NLL
Rios DGM4NLP 38 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Metrics
For VAE models, Negative log-likelihood NLL is estimated,since we do not have access to the exact marginal likelihood.
We use an importance sampling (IS) estimate://
p(x |θ) =∫p(z , x |θ)dz
IS=∫q(z |x)p(z,x |θ)q(z|x) dz
≈ 1S
∑Ss=1
p(z(s),x |θ)q(z(s)|x)
where z(s) ∼ q(z |x)(12)
Perplexity PPL: the exponent of average per-word entropy,given N i.i.d. sequences
PPL = exp
(∑Ni=1 log p (xi )∑N
i=1 |xi |
)(13)
perplexity is based on the importance sampled NLL
Rios DGM4NLP 38 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Metrics
For VAE models, Negative log-likelihood NLL is estimated,since we do not have access to the exact marginal likelihood.
We use an importance sampling (IS) estimate://
p(x |θ) =∫p(z , x |θ)dz
IS=∫q(z |x)p(z,x |θ)q(z|x) dz
≈ 1S
∑Ss=1
p(z(s),x |θ)q(z(s)|x)
where z(s) ∼ q(z |x)(12)
Perplexity PPL: the exponent of average per-word entropy,given N i.i.d. sequences
PPL = exp
(∑Ni=1 log p (xi )∑N
i=1 |xi |
)(13)
perplexity is based on the importance sampled NLL
Rios DGM4NLP 38 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Metrics
For VAE models, Negative log-likelihood NLL is estimated,since we do not have access to the exact marginal likelihood.
We use an importance sampling (IS) estimate://
p(x |θ) =∫p(z , x |θ)dz
IS=∫q(z |x)p(z,x |θ)q(z|x) dz
≈ 1S
∑Ss=1
p(z(s),x |θ)q(z(s)|x)
where z(s) ∼ q(z |x)(12)
Perplexity PPL: the exponent of average per-word entropy,given N i.i.d. sequences
PPL = exp
(∑Ni=1 log p (xi )∑N
i=1 |xi |
)(13)
perplexity is based on the importance sampled NLL
Rios DGM4NLP 38 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Baseline
RNNLM (Dyer et al., 2016)
At each step, an RNNLM parameterises a categoricaldistribution over the vocabulary, i.e.Xi |x<i ∼ Cat (f (x<i ; θ)) and
f (x<i ; θ) = softmax (si ) and
ei = emb (xi ; θemb)
hi = GRU (hi−1, ei−1; θgru)
si = affine (hi ; θout)
(14)
Embedding layer (emb), one (or more) GRU cell(s) (h0 ∈ θ isa parameter of the model), and an affine layer to map fromthe dimensionality of the GRU to the vocabulary size.
Rios DGM4NLP 39 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Baseline
RNNLM (Dyer et al., 2016)
At each step, an RNNLM parameterises a categoricaldistribution over the vocabulary, i.e.Xi |x<i ∼ Cat (f (x<i ; θ)) and
f (x<i ; θ) = softmax (si ) and
ei = emb (xi ; θemb)
hi = GRU (hi−1, ei−1; θgru)
si = affine (hi ; θout)
(14)
Embedding layer (emb), one (or more) GRU cell(s) (h0 ∈ θ isa parameter of the model), and an affine layer to map fromthe dimensionality of the GRU to the vocabulary size.
Rios DGM4NLP 39 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Baseline
RNNLM (Dyer et al., 2016)
At each step, an RNNLM parameterises a categoricaldistribution over the vocabulary, i.e.Xi |x<i ∼ Cat (f (x<i ; θ)) and
f (x<i ; θ) = softmax (si ) and
ei = emb (xi ; θemb)
hi = GRU (hi−1, ei−1; θgru)
si = affine (hi ; θout)
(14)
Embedding layer (emb), one (or more) GRU cell(s) (h0 ∈ θ isa parameter of the model), and an affine layer to map fromthe dimensionality of the GRU to the vocabulary size.
Rios DGM4NLP 39 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Data
Wall Street Journal part of the Penn Treebank corpus
Preprocessing on train-validation-test split as [Dyer et al.,2016]
WSJ section 1-21 training,
Section 23 as test corpus
Section 24 as validation
Rios DGM4NLP 40 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Data
Wall Street Journal part of the Penn Treebank corpus
Preprocessing on train-validation-test split as [Dyer et al.,2016]
WSJ section 1-21 training,
Section 23 as test corpus
Section 24 as validation
Rios DGM4NLP 40 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Data
Wall Street Journal part of the Penn Treebank corpus
Preprocessing on train-validation-test split as [Dyer et al.,2016]
WSJ section 1-21 training,
Section 23 as test corpus
Section 24 as validation
Rios DGM4NLP 40 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Data
Wall Street Journal part of the Penn Treebank corpus
Preprocessing on train-validation-test split as [Dyer et al.,2016]
WSJ section 1-21 training,
Section 23 as test corpus
Section 24 as validation
Rios DGM4NLP 40 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Data
Wall Street Journal part of the Penn Treebank corpus
Preprocessing on train-validation-test split as [Dyer et al.,2016]
WSJ section 1-21 training,
Section 23 as test corpus
Section 24 as validation
Rios DGM4NLP 40 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Results
NLL↓ PPL↓RNN-LM 118.7±0.12 107.1±0.46
VAE 118.4±0.09 105.7±0.36
Annealing 117.9±0.08 103.7±0.31
Free-bits 117.5±0.18 101.9±0.77
Rios DGM4NLP 41 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Samples
decode greedily from a prior sampl and the variability is due tothe generator’s reliance on the latent sample.
The VAE ignores z and greedy generation from a prior sampleis essentially deterministic in that case
Rios DGM4NLP 42 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Samples
decode greedily from a prior sampl and the variability is due tothe generator’s reliance on the latent sample.
The VAE ignores z and greedy generation from a prior sampleis essentially deterministic in that case
Rios DGM4NLP 42 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Samples
Homotopy, decode greedily from points lying between aposterior sample conditioned on the first sentence and aposterior sample conditioned on the last sentence.
zα = α ∗ z1 + (1− α) ∗ z2α ∈ [0, 1]
Rios DGM4NLP 43 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
Samples
Homotopy, decode greedily from points lying between aposterior sample conditioned on the first sentence and aposterior sample conditioned on the last sentence.
zα = α ∗ z1 + (1− α) ∗ z2α ∈ [0, 1]
Rios DGM4NLP 43 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
References I
Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and ChristianJanvin. A neural probabilistic language model. Journal ofMachine Learning Research, 3:1137–1155, 2003. URLhttp://dblp.uni-trier.de/db/journals/jmlr/jmlr3.
html#BengioDVJ03.
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai,Rafal Jozefowicz, and Samy Bengio. Generating sentences froma continuous space. CoRR, abs/1511.06349, 2015. URLhttp://arxiv.org/abs/1511.06349.
Rios DGM4NLP 44 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
References II
Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A.Smith. Recurrent neural network grammars. In Proceedings ofthe 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human LanguageTechnologies, pages 199–209, San Diego, California, June 2016.Association for Computational Linguistics. doi:10.18653/v1/N16-1024. URLhttps://www.aclweb.org/anthology/N16-1024.
Joshua T. Goodman. A bit of progress in language modeling.Comput. Speech Lang., 15(4):403–434, October 2001. ISSN0885-2308. doi: 10.1006/csla.2001.0174. URLhttp://dx.doi.org/10.1006/csla.2001.0174.
Rios DGM4NLP 45 / 46
Language ModellingVariational Auto-encoder for Sentences
ReferencesReferences
References III
Fred Jelinek and Robert L. Mercer. Interpolated estimation ofMarkov source parameters from sparse data. In Edzard S.Gelsema and Laveen N. Kanal, editors, Proceedings, Workshopon Pattern Recognition in Practice, pages 381–397. NorthHolland, Amsterdam, 1980.
Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, andSanjeev Khudanpur. Recurrent neural network based languagemodel. In Takao Kobayashi, Keikichi Hirose, and SatoshiNakamura, editors, INTERSPEECH, pages 1045–1048. ISCA,2010. URL http://dblp.uni-trier.de/db/conf/
interspeech/interspeech2010.html#MikolovKBCK10.
Rios DGM4NLP 46 / 46