Deep Generative Language Models - DGM4NLP · Deep Generative Language Models DGM4NLP Miguel Rios...

Language ModellingVariational Auto-encoder for Sentences

ReferencesReferences

Deep Generative Language ModelsDGM4NLP

Miguel RiosUniversity of Amsterdam

May 2, 2019

Rios DGM4NLP 1 / 46

Outline

1 Language Modelling

2 Variational Auto-encoder for Sentences

Rios DGM4NLP 2 / 46

Recap Generative Models of Word Representation

Rios DGM4NLP 3 / 46

Discriminative embedding modelsword2vec

In the event of a chemical spill, most children know they shouldevacuate as advised by people in charge.

Place words in Rd as to answer questions like

“Have I seen this word in this context?”

Rios DGM4NLP 3 / 46

Discriminative embedding modelsword2vec

In the event of a chemical spill, most children know they shouldevacuate as advised by people in charge.

Place words in Rd as to answer questions like

“Have I seen this word in this context?”

Fit a binary classifier

positive examples

negative examples

The models processes a sentence and outputs a wordrepresentation:

Rios DGM4NLP 4 / 46

Rios DGM4NLP 5 / 46

quickly evacuate the area / deje el lugar rapidamente

Rios DGM4NLP 5 / 46

embedding128d

h: BiRNN100d

u 100d

s 100d

sample z100d

g(zaj)

Generative model

Inference model

Rios DGM4NLP 6 / 46

Introduction

you know nothing, jon x

ground control to major x

Rios DGM4NLP 7 / 46

Introduction

Rios DGM4NLP 7 / 46

Introduction

Rios DGM4NLP 7 / 46

Introduction

the quick brown x

the quick brown fox x

the quick brown fox jumps x

the quick brown fox jumps over x

the quick brown fox jumps over the x

the quick brown fox jumps over the lazy x

the quick brown fox jumps over the lazy dog

Rios DGM4NLP 8 / 46

Introduction

the quick brown x

Rios DGM4NLP 8 / 46

Introduction

the quick brown x

Rios DGM4NLP 8 / 46

Introduction

the quick brown x

Rios DGM4NLP 8 / 46

Introduction

the quick brown x

Rios DGM4NLP 8 / 46

Introduction

the quick brown x

Rios DGM4NLP 8 / 46

Introduction

the quick brown x

Rios DGM4NLP 8 / 46

Definition

Language models give us the probability of a sentence;

At a time step, they assign a probability to the next word.

Rios DGM4NLP 9 / 46

Definition

Language models give us the probability of a sentence;

At a time step, they assign a probability to the next word.

Rios DGM4NLP 9 / 46

Applications

Very useful on different tasks:

Speech recognition;

Spelling correction;

Machine translation;

LMs are useful in almost any tasks that deals with generatinglanguage.

Rios DGM4NLP 10 / 46

Applications

Speech recognition;

Applications

Speech recognition;

Applications

Speech recognition;

Applications

Speech recognition;

Language Models

N-gram based LMs;

Log-linear LMs;

Neural LMs.

Language Models

N-gram based LMs;

Log-linear LMs;

Neural LMs.

Language Models

N-gram based LMs;

Log-linear LMs;

Neural LMs.

N-gram LM

x is a sequence of words

x = x1, x2, x3, x4, x5= you, know, nothing, jon, snow

N-gram LM

x is a sequence of words

x = x1, x2, x3, x4, x5= you, know, nothing, jon, snow

N-gram LM

To compute the probability of a sentence

p(x) = p (x1, x2, . . . , xn) (1)

We apply the chain rule:

p(x) =∏i

p (xi |x1, . . . , xi−1) (2)

We limit the history with a Markov order:p (xi |x1, . . . , xi−1) ' p (xi |xi−4, xi−3, xi−2, xi−1)

[Jelinek and Mercer, 1980, Goodman, 2001]Rios DGM4NLP 13 / 46

N-gram LM

p(x) = p (x1, x2, . . . , xn) (1)

p(x) =∏i

p (xi |x1, . . . , xi−1) (2)

N-gram LM

p(x) = p (x1, x2, . . . , xn) (1)

p(x) =∏i

p (xi |x1, . . . , xi−1) (2)

N-gram LM

Chain rule:p(x) =

p (xi |x1, . . . , xi−1) (3)

N-gram LM

Chain rule:p(x) =

p (xi |x1, . . . , xi−1) (3)

N-gram LM

We make a Markov assumption of conditional independence:

p (xi |x1, . . . , xi−1) ' p (xi |xi−1) (4)

N-gram LM

We make a Markov assumption of conditional independence:

p (xi |x1, . . . , xi−1) ' p (xi |xi−1) (4)

N-gram LM

We make the Markov assumption:p (xi |x1, . . . , xi−1) ' p (xi |xi−1)

If we do not observe the bigram p(xi |xi−1) is 0and the probability of a sentence will be 0.

pMLE (xi |xi−1) =count (xi−1, xi )

count (xi−1)(5)

Laplace smoothing:

padd1 (xi |xi−1) =count (xi−1, xi ) + 1

count (xi−1) + V(6)

N-gram LM

count (xi−1)(5)

Laplace smoothing:

N-gram LM

count (xi−1)(5)

Laplace smoothing:

N-gram LM

count (xi−1)(5)

Laplace smoothing:

Log-linear LM

p(y |x) =exp w · φ(x , y)∑

y ′∈Vyexpw · φ (x , y ′)

y is the next word and Vy is the vocabulary;

x is the history;

φ is a feature function that returns an n-dimensional vector;

w are the model parameters.

Log-linear LM

p(y |x) =exp w · φ(x , y)∑

x is the history;

Log-linear LM

p(y |x) =exp w · φ(x , y)∑

x is the history;

Log-linear LM

p(y |x) =exp w · φ(x , y)∑

x is the history;

Log-linear LM

p(y |x) =exp w · φ(x , y)∑

x is the history;

Log-linear LM

n-gram features xj−1 = the and xj = puppy.

gappy n-gram features xj−2 = the and xj = puppy.

class features: xj belongs to class ABC;

gazetteer features: xj is a place name;

Log-linear LM

With features of words and histories we can share statisticalweight

With n-grams, there is no sharing at all

We also get smoothing for free

We can add arbitrary features

We use Stochastic Gradient Descent (SGD)

Log-linear LM

Neural LM

n-gram language models have proven to be effective in varioustasks

log-linear models allow us to share weights through features

maybe our history is still too limited, e.g. n-1 words

we need to find useful features

Neural LM

Feed-forward NLM

With NN we can exploit distributed representations to allowfor statistical weight sharing.

How does it work:

1 each word is mapped to an embedding: an m-dimensionalfeature vector;

2 a probability function over word sequences is expressed interms of these vectors;

3 we jointly learn the feature vectors and the parameters of theprobability function.

[Bengio et al., 2003]Rios DGM4NLP 21 / 46

Feed-forward NLM

How does it work:

1 each word is mapped to an embedding: an m-dimensionalfeature vector;

Feed-forward NLM

How does it work:1 each word is mapped to an embedding: an m-dimensional

feature vector;

Feed-forward NLM

feature vector;2 a probability function over word sequences is expressed in

terms of these vectors;

Feed-forward NLM

feature vector;2 a probability function over word sequences is expressed in

terms of these vectors;3 we jointly learn the feature vectors and the parameters of the

probability function.

Feed-forward NLM

Similar words are expected to have similar feature vectors:(dog,cat), (running,walking), (bedroom,room)

With this, probability mass is naturally transferred from (1) to(2):

The cat is walking in the bedroom.

The dog is running in the room.

Take-away message:The presence of only one sentence in the training data willincrease the probability of a combinatorial number ofneighbours in sentence space.

Feed-forward NLM

Eyou,Eknow,Enothing ∈ R100

x = [Eyou; Eknow; Enothing] ∈ R300

y = W 3 tanh (W 1x + b1) + W 2x + b2(8)

Feed-forward NLM

Eyou,Eknow,Enothing ∈ R100

x = [Eyou; Eknow; Enothing] ∈ R300

y = W 3 tanh (W 1x + b1) + W 2x + b2(8)

Feed-forward NLM

The non-linear activation functions perform featurecombinations that a linear model cannot do;

End-to-end training on next word prediction.

Feed-forward NLM

We now have much better generalisation, but still a limitedhistory/context.

Recurrent neural networks have unlimited history

Feed-forward NLM

We now have much better generalisation, but still a limitedhistory/context.Recurrent neural networks have unlimited history

Feed-forward NLM

We now have much better generalisation, but still a limitedhistory/context.Recurrent neural networks have unlimited history

RNN NLM

RNN-LM

Start: predict second word from first

[Mikolov et al., 2010]Rios DGM4NLP 27 / 46

RNN NLM

RNN-LM

Start: predict second word from first

[Mikolov et al., 2010]Rios DGM4NLP 27 / 46

RNN NLM

Outline

1 Language Modelling

2 Variational Auto-encoder for Sentences

Sen VAE

Model observations as draws from the marginal of a DGM.

An NN maps from a latent sentence embedding z ∈ Rdz to adistribution p(x |z , θ) over sentences,

p(x |θ) =∫p(z)p(x |z , θ)dz

=∫N (z |0, I )

∏|x |i=1 Cat (xi |f (z , x<i ; θ))dz

[Bowman et al., 2015]Rios DGM4NLP 30 / 46

Sen VAE

Model observations as draws from the marginal of a DGM.

An NN maps from a latent sentence embedding z ∈ Rdz to adistribution p(x |z , θ) over sentences,

p(x |θ) =∫p(z)p(x |z , θ)dz

=∫N (z |0, I )

∏|x |i=1 Cat (xi |f (z , x<i ; θ))dz

[Bowman et al., 2015]Rios DGM4NLP 30 / 46

Sen VAE

z xm1 θ

Generative model

Z ∼ N (0, I )

Xi |z , x<i ∼ Cat(fθ(z , x<i ))

Inference model

Z ∼ N (µφ(xm1 ), σφ(xm1 )2)

Sen VAE

Generation is one word at a time without Markovassumptions, but f ()conditions on zin addition to the observed prefix.

The conditional p(x |z , θ) is the decoder.

p(x |θ) is the marginal likelihood.

We train the model to assign high (marginal) probability toobservations like a LMs.

However the model uses a latent space to exploitneighbourhood and smoothness in latent space to captureregularities in data space.For example, it may group sentences according to certain e.g.lexical choices, syntactic complexity, lexical semantics, etc...

Sen VAE

Approximate Inference

The model has a diagonal Gaussian distribution as variationalposterior:qφ(z |x) = N (z |µφ(x), diag (σφ(x)))

With reparametrisation:z = hφ(ε, x) = µφ(x) + σφ(x)� ε, where ε ∼ N (0, I) f

Analytical KL:KL [qφ(z |x)‖pθ(z)] =12

∑Dzd=1

(− log σ2φ(x)− 1 + σ2φ(x) + µ2φ(x)

∑Dzd=1

(− log σ2φ(x)− 1 + σ2φ(x) + µ2φ(x)

∑Dzd=1

(− log σ2φ(x)− 1 + σ2φ(x) + µ2φ(x)

We jointly estimate the parameters of both generative andinference by maximising a lowerbound on the log-likelihoodfunction (ELBO):

L(θ, φ|x) = Eq(z|x ,φ)[log p(x |z , θ)].−KL(q(z |x , φ)|p(z))

Architecture

Gaussian Sen VAE parametrises a categorical distribution overthe vocabulary for each given prefix, and, it conditions on alatent embedding:Z ∼ N (0, I ),Xi |z , x<i ∼ Cat (f (z , x<i ; θ))

f (z , x<i ; θ) = softmax (si )

ei = emb (xi ; θemb)

h0 = tanh (affine (z ; θ init ))

hi = GRU (hi−1, ei−1; θgru)

si = affine (hi ; θout)

Architecture

Gaussian Sen VAE parametrises a categorical distribution overthe vocabulary for each given prefix, and, it conditions on alatent embedding:Z ∼ N (0, I ),Xi |z , x<i ∼ Cat (f (z , x<i ; θ))

f (z , x<i ; θ) = softmax (si )

h0 = tanh (affine (z ; θ init ))

The Strong Decoder Problem

The VAE may ignore the latent variable given the interactionbetween the prior and posterior in the KL divergence.

This problem appears when we have strong decodersconditional likelihoods p(x |z) parametrised by high capacitymodels

The model might achieve a high ELBO without usinginformation from z

RNN LM is strong decoder because they condition on allprevious context when generating a word

What to do?

Weakening the Decoder, the model relies on the latentvariables the reconstruction of the observed data.

KL Annealing, weigh the KL term in the ELBO with a factorthat is annealed from 0 to 1 over a fixed number of steps ofsize γ ∈ (0, 1)

Word Dropout, by dropping a percentage of the input atrandom, the decoder has to rely on the latent variable to fill inthe missing gaps.

Freebits because it allows encoding the first r nats ofinformation for free.max(r ,KL(qφ(z |x)‖p(z)))

What to do?

Metrics

For VAE models, Negative log-likelihood NLL is estimated,since we do not have access to the exact marginal likelihood.

We use an importance sampling (IS) estimate://

p(x |θ) =∫p(z , x |θ)dz

IS=∫q(z |x)p(z,x |θ)q(z|x) dz

≈ 1S

∑Ss=1

p(z(s),x |θ)q(z(s)|x)

where z(s) ∼ q(z |x)(12)

Perplexity PPL: the exponent of average per-word entropy,given N i.i.d. sequences

PPL = exp

(∑Ni=1 log p (xi )∑N

i=1 |xi |

perplexity is based on the importance sampled NLL

Metrics

p(x |θ) =∫p(z , x |θ)dz

≈ 1S

∑Ss=1

PPL = exp

i=1 |xi |

Metrics

p(x |θ) =∫p(z , x |θ)dz

≈ 1S

∑Ss=1

PPL = exp

i=1 |xi |

Metrics

p(x |θ) =∫p(z , x |θ)dz

≈ 1S

∑Ss=1

PPL = exp

i=1 |xi |

Baseline

RNNLM (Dyer et al., 2016)

At each step, an RNNLM parameterises a categoricaldistribution over the vocabulary, i.e.Xi |x<i ∼ Cat (f (x<i ; θ)) and

f (x<i ; θ) = softmax (si ) and

Embedding layer (emb), one (or more) GRU cell(s) (h0 ∈ θ isa parameter of the model), and an affine layer to map fromthe dimensionality of the GRU to the vocabulary size.

Baseline

Wall Street Journal part of the Penn Treebank corpus

Preprocessing on train-validation-test split as [Dyer et al.,2016]

WSJ section 1-21 training,

Section 23 as test corpus

Section 24 as validation

Results

NLL↓ PPL↓RNN-LM 118.7±0.12 107.1±0.46

VAE 118.4±0.09 105.7±0.36

Annealing 117.9±0.08 103.7±0.31

Free-bits 117.5±0.18 101.9±0.77

Samples

decode greedily from a prior sampl and the variability is due tothe generator’s reliance on the latent sample.

The VAE ignores z and greedy generation from a prior sampleis essentially deterministic in that case

Samples

decode greedily from a prior sampl and the variability is due tothe generator’s reliance on the latent sample.

The VAE ignores z and greedy generation from a prior sampleis essentially deterministic in that case

Samples

Homotopy, decode greedily from points lying between aposterior sample conditioned on the first sentence and aposterior sample conditioned on the last sentence.

zα = α ∗ z1 + (1− α) ∗ z2α ∈ [0, 1]

Samples

Homotopy, decode greedily from points lying between aposterior sample conditioned on the first sentence and aposterior sample conditioned on the last sentence.

zα = α ∗ z1 + (1− α) ∗ z2α ∈ [0, 1]

References I

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and ChristianJanvin. A neural probabilistic language model. Journal ofMachine Learning Research, 3:1137–1155, 2003. URLhttp://dblp.uni-trier.de/db/journals/jmlr/jmlr3.

html#BengioDVJ03.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai,Rafal Jozefowicz, and Samy Bengio. Generating sentences froma continuous space. CoRR, abs/1511.06349, 2015. URLhttp://arxiv.org/abs/1511.06349.

References II

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A.Smith. Recurrent neural network grammars. In Proceedings ofthe 2016 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human LanguageTechnologies, pages 199–209, San Diego, California, June 2016.Association for Computational Linguistics. doi:10.18653/v1/N16-1024. URLhttps://www.aclweb.org/anthology/N16-1024.

Joshua T. Goodman. A bit of progress in language modeling.Comput. Speech Lang., 15(4):403–434, October 2001. ISSN0885-2308. doi: 10.1006/csla.2001.0174. URLhttp://dx.doi.org/10.1006/csla.2001.0174.

References III

Fred Jelinek and Robert L. Mercer. Interpolated estimation ofMarkov source parameters from sparse data. In Edzard S.Gelsema and Laveen N. Kanal, editors, Proceedings, Workshopon Pattern Recognition in Practice, pages 381–397. NorthHolland, Amsterdam, 1980.

Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky, andSanjeev Khudanpur. Recurrent neural network based languagemodel. In Takao Kobayashi, Keikichi Hirose, and SatoshiNakamura, editors, INTERSPEECH, pages 1045–1048. ISCA,2010. URL http://dblp.uni-trier.de/db/conf/

interspeech/interspeech2010.html#MikolovKBCK10.

Deep Generative Language Models - DGM4NLP · Deep Generative Language Models DGM4NLP Miguel Rios...

Documents

Transcript of Deep Generative Language Models - DGM4NLP · Deep Generative Language Models DGM4NLP Miguel Rios...