Neural Variational Inference for Text ProcessingNeural Variational Inference for Text Processing...

Neural Variational Inference for Text Processing

Yishu Miao1 [email protected] Yu1 [email protected] Blunsom12 [email protected] of Oxford, 2Google Deepmind

AbstractRecent advances in neural variational inferencehave spawned a renaissance in deep latent vari-able models. In this paper we introduce a genericvariational inference framework for generativeand conditional models of text. While traditionalvariational methods derive an analytic approxi-mation for the intractable distributions over latentvariables, here we construct an inference networkconditioned on the discrete text input to pro-vide the variational distribution. We validate thisframework on two very different text modellingapplications, generative document modelling andsupervised question answering. Our neural vari-ational document model combines a continuousstochastic document representation with a bag-of-words generative model and achieves the low-est reported perplexities on two standard test cor-pora. The neural answer selection model em-ploys a stochastic representation layer within anattention mechanism to extract the semantics be-tween a question and answer pair. On two ques-tion answering benchmarks this model exceedsall previous published benchmarks.

1. IntroductionProbabilistic generative models underpin many successfulapplications within the field of natural language process-ing (NLP). Their popularity stems from their ability to useunlabelled data effectively, to incorporate abundant linguis-tic features, and to learn interpretable dependencies amongdata. However these successes are tempered by the fact thatas the structure of such generative models becomes deeperand more complex, true Bayesian inference becomes in-tractable due to the high dimensional integrals required.Markov chain Monte Carlo (MCMC) (Neal, 1993; Andrieu

Proceedings of the 33 rd International Conference on MachineLearning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

et al., 2003) and variational inference (Jordan et al., 1999;Attias, 2000; Beal, 2003) are the standard approaches forapproximating these integrals. However the computationalcost of the former results in impractical training for thelarge and deep neural networks which are now fashion-able, and the latter is conventionally confined due to theunderestimation of posterior variance. The lack of effec-tive and efficient inference methods hinders our ability tocreate highly expressive models of text, especially in thesituation where the model is non-conjugate.

This paper introduces a neural variational framework forgenerative models of text, inspired by the variational auto-encoder (Rezende et al., 2014; Kingma & Welling, 2014).The principle idea is to build an inference network, imple-mented by a deep neural network conditioned on text, to ap-proximate the intractable distributions over the latent vari-ables. Instead of providing an analytic approximation, as intraditional variational Bayes, neural variational inferencelearns to model the posterior probability, thus endowingthe model with strong generalisation abilities. Due to theflexibility of deep neural networks, the inference networkis capable of learning complicated non-linear distributionsand processing structured inputs such as word sequences.Inference networks can be designed as, but not restrictedto, multilayer perceptrons (MLP), convolutional neural net-works (CNN), and recurrent neural networks (RNN), ap-proaches which are rarely used in conventional generativemodels. By using the reparameterisation method (Rezendeet al., 2014; Kingma & Welling, 2014), the inference net-work is trained through back-propagating unbiased and lowvariance gradients w.r.t. the latent variables. Within thisframework, we propose a Neural Variational DocumentModel (NVDM) for document modelling and a Neural An-swer Selection Model (NASM) for question answering, atask that selects the sentences that correctly answer a fac-toid question from a set of candidate sentences.

The NVDM (Figure 1) is an unsupervised generative modelof text which aims to extract a continuous semantic latentvariable for each document. This model can be interpretedas a variational auto-encoder: an MLP encoder (inference

arX

iv:1

511.

0603

8v4

[cs

.CL

] 4

Jun

201

6


q (h∣X) (Inference Network)

X

p(X∣h)

h

X

Figure 1. NVDM for document modelling.

a

q

y

z q

c(a,h)

z a

h

p(h∣q)p( y∣zq , z a)

α

Figure 2. NASM for question answer selection.

network) compresses the bag-of-words document represen-tation into a continuous latent distribution, and a softmaxdecoder (generative model) reconstructs the document bygenerating the words independently. A primary featureof NVDM is that each word is generated directly from adense continuous document representation instead of themore common binary semantic vector (Hinton & Salakhut-dinov, 2009; Larochelle & Lauly, 2012; Srivastava et al.,2013; Mnih & Gregor, 2014). Our experiments demon-strate that our neural document model achieves the state-of-the-art perplexities on the 20NewsGroups and RCV1-v2.

The NASM (Figure 2) is a supervised conditional modelwhich imbues LSTMs (Hochreiter & Schmidhuber, 1997)with a latent stochastic attention mechanism to model thesemantics of question-answer pairs and predict their relat-edness. The attention model is designed to focus on thephrases of an answer that are strongly connected to thequestion semantics and is modelled by a latent distribu-tion. This mechanism allows the model to deal with theambiguity inherent in the task and learns pair-specific rep-resentations that are more effective at predicting answermatches, rather than independent embeddings of questionand answer sentences. Bayesian inference provides a nat-ural safeguard against overfitting, especially as the trainingsets available for this task are small. The experiments showthat the LSTM with a latent stochastic attention mecha-nism learns an effective attention model and outperformsboth previously published results, and our own strong non-stochastic attention baselines.

In summary, we demonstrate the effectiveness of neuralvariational inference for text processing on two diversetasks. These models are simple, expressive and can betrained efficiently with the highly scalable stochastic gra-dient back-propagation. Our neural variational frameworkis suitable for both unsupervised and supervised learningtasks, and can be generalised to incorporate any type ofneural networks.

2. Neural Variational Inference FrameworkLatent variable modelling is popular in many NLP prob-lems, but it is non-trivial to carry out effective and efficient

inference for models with complex and deep structure. Inthis section we introduce a generic neural variational in-ference framework that we apply to both the unsupervisedNVDM and supervised NASM in the follow sections.

We define a generative model with a latent variable h,which can be considered as the stochastic units in deepneural networks. We designate the observed parent andchild nodes of h as x and y respectively. Hence, thejoint distribution of the generative model is pθ(x,y) =∑h pθ(y|h)pθ(h|x)p(x), and the variational lower bound

L is derived as:

L = Eq(h)[log pθ(y|h)pθ(h|x)p(x)− log q(h)] (1)

6 log

∫q(h)

q(h)pθ(y|h)pθ(h|x)p(x)dh = log pθ(x,y)

where θ parameterises the generative distributions pθ(y|h)and pθ(h|x). In order to have a tight lower bound, the vari-ational distribution q(h) should approach the true poste-rior p(h|x,y). Here, we employ a parameterised diagonalGaussian N (h|µ(x,y),diag(σ2(x,y))) as qφ(h|x,y).The three steps to construct the inference network are:

1. Construct vector representations of the observed vari-ables: u = fx(x), v = fy(y).

2. Assemble a joint representation: π = g(u,v).

3. Parameterise the variational distribution over the latentvariable: µ = l1(π), logσ = l2(π).

fx(·) and fy(·) can be any type of deep neural networksthat are suitable for the observed data; g(·) is an MLP thatconcatenates the vector representations of the conditioningvariables; l(·) is a linear transformation which outputs theparameters of the Gaussian distribution. By sampling fromthe variational distribution, h ∼ qφ(h|x,y), we are able tocarry out stochastic back-propagation to optimise the lowerbound (Eq. 1).

During training, the model parameters θ together with theinference network parameters φ are updated by stochas-tic back-propagation based on the samples h drawn fromqφ(h|x,y). For the gradients w.r.t. θ, we have the form:

∇θL ' 1L

∑Ll=1∇θ log pθ(y|h

(l))pθ(h(l)|x) (2)


For the gradients w.r.t. φ we reparameterise h = µ+ σ · εand sample ε(l) ∼ N (0, I) to reduce the variance instochastic estimation (Rezende et al., 2014; Kingma &Welling, 2014). The update of φ can be carried out by back-propagating the gradients w.r.t. µ and σ:

s(h) = log pθ(y|h)pθ(h|x)− log qφ(h|x,y)∇µL ' 1

L

∑Ll=1∇h(l) [s(h(l))] (3)

∇σL ' 12L

∑Ll=1 ε

(l)∇h(l) [s(h(l))] (4)

It is worth mentioning that unsupervised learning is a spe-cial case of the neural variational framework where h hasno parent node x. In that case h is directly drawn from theprior p(h) instead of the conditional distribution pθ(h|x),and s(h) = log pθ(y|h)pθ(h)− log qφ(h|y).

Here we only discuss the scenario where the latent vari-ables are continuous and the parameterised diagonal Gaus-sian is employed as the variational distribution. Howeverthe framework is also suitable for discrete units, and theonly modification needed is to replace the Gaussian witha multinomial parameterised by the outputs of a softmaxfunction. Though the reparameterisation trick for continu-ous variables is not applicable for this case, a policy gra-dient approach (Mnih & Gregor, 2014) can help to alle-viate the high variance problem during stochastic estima-tion. (Kingma et al., 2014) proposed a variational infer-ence framework for semi-supervised learning, but the priordistribution over the hidden variable p(h) remains as thestandard Gaussian prior, while we apply a conditional pa-rameterised Gaussian distribution, which is jointly learnedwith the variational distribution.

3. Neural Variational Document ModelThe Neural Variational Document Model (Figure 1) is asimple instance of unsupervised learning where a continu-ous hidden variable h ∈ RK , which generates all the wordsin a document independently, is introduced to represent itssemantic content. Let X ∈ R|V | be the bag-of-words rep-resentation of a document and xi ∈ R|V | be the one-hotrepresentation of the word at position i.

As an unsupervised generative model, we could interpretNVDM as a variational autoencoder: an MLP encoderq(h|X) compresses document representations into con-tinuous hidden vectors (X → h); a softmax decoderp(X|h) =

∏Ni=1 p(xi|h) reconstructs the documents by

independently generating the words (h → {xi}). Tomaximise the log-likelihood log

∑h p(X|h)p(h) of doc-

uments, we derive the lower bound:

L=Eqφ(h|X)

[N∑i=1

log pθ(xi|h)

]−DKL[qφ(h|X)‖p(h)] (5)

where N is the number of words in the document and p(h)

is a Gaussian prior for h. Here, we consider N is ob-served for all the documents. The conditional probabilityover words pθ(xi|h) (decoder) is modelled by multinomiallogistic regression and shared across documents:

pθ(xi|h) =exp{−E(xi;h, θ))}∑|V |j=1 exp{−E(xj ;h, θ)}

(6)

E(xi;h, θ) = −hTRxi − bxi (7)

where R ∈ RK×|V | learns the semantic word embeddingsand bxi represents the bias term.

As there is no supervision information for the latent seman-tics, h, the posterior approximation qφ(h|X) is only condi-tioned on the current document X . The inference networkqφ(h|X) = N (h|µ(X), diag(σ2(X))) is modelled as:

π = g(fMLPX (X)) (8)

µ = l1(π), logσ = l2(π) (9)

For each document X , the neural network generates itsown parameters µ and σ that parameterise the latent distri-bution over document semantics h. Based on the samplesh ∼ qφ(h|X), the lower bound (Eq. 5) can be optimisedby back-propagating the stochastic gradients w.r.t. θ and φ.

Since p(h) is a standard Gaussian prior, the Gaussian KL-Divergence DKL[qφ(h|X)‖p(h)] can be computed analyt-ically to further lower the variance of the gradients. More-over, it also acts as a regulariser for updating the parametersof the inference network qφ(h|X).

4. Neural Answer Selection ModelAnswer sentence selection is a question answeringparadigm where a model must identify the correct sen-tences answering a factual question from a set of candi-date sentences. Assume a question q is associated witha set of answer sentences {a1,a2, ...,an}, together withtheir judgements {y1,y2, ...,yn}, where ym = 1 if theanswer am is correct and ym = 0 otherwise. This is aclassification task where we treat each training data pointas a triple (q,a,y) while predicting y for the unlabelledquestion-answer pair (q,a).

The Neural Answer Selection Model (Figure 2) is a super-vised model that learns the question and answer represen-tations and predicts their relatedness. It employs two dif-ferent LSTMs to embed raw question inputs q and answerinputs a. Let sq(j) and sa(i) be the state outputs of thetwo LSTMs, and i, j be the positions of the states. Con-ventionally, the last state outputs sq(|q|) and sa(|a|), asthe independent question and answer representations, canbe used for relatedness prediction. In NASM, however, weaim to learn pair-specific representations through a latentattention mechanism, which is more effective for pair re-latedness prediction.


NASM applies an attention model to focus on the wordsin the answer sentence that are prominent for predictingthe answer matched to the current question. Instead of us-ing a deterministic question vector, such as sq(|q|), NASMemploys a latent distribution pθ(h|q) to model the ques-tion semantics, which is a parameterised diagonal Gaus-sian N (h|µ(q),diag(σ2(q))). Therefore, the attentionmodel extracts a context vector c(a,h) by iteratively at-tending to the answer tokens based on the stochastic vec-tor h ∼ pθ(h|q). In doing so the model is able to adaptto the ambiguity inherent in questions and obtain salientinformation through attention. Compared to its determin-istic counterpart (applying sq(|q|) as the question seman-tics), the stochastic units incorporated into NASM allowmulti-modal attention distributions. Further, by marginalis-ing over the latent variables, NASM is more robust againstoverfitting, which is important for small question answer-ing training sets.

In this model, the conditional distribution pθ(h|q) is:

πθ = gθ(fLSTMq (q)) = gθ(sq(|q|)) (10)

µθ = l1(πθ), logσθ = l2(πθ) (11)

For each question q, the neural network generates the cor-responding parameters µ and σ that parameterise the latentdistribution over question semantics h. Following Bah-danau et al. (2015), the attention model is defined as:

α(i) ∝ exp(W Tα tanh(W hh+W ssa(i))) (12)

c(a,h) =∑

isa(i)α(i) (13)

za(a,h) = tanh (W ac(a,h) +W nsa(|a|)) (14)

where α(i) is the normalised attention score at answer to-ken i, and the context vector c(a,h) is the weighted sum ofall the state outputs sa(i). We adopt zq(q), za(a,h) as thequestion and answer representations for predicting their re-latedness y. zq(q) is a deterministic vector that is equal tosq(|q|), while za(a,h) is a combination of the sequenceoutput sa(|a|) and the context vector c(a,h) (Eq. 14).For the prediction of pair relatedness y, we model the con-ditional probability distribution pθ(y|zq, za) by sigmoidfunction:

pθ(y = 1|zq, za) = σ(zTq Mza + b

)(15)

To maximise the log-likelihood log p(y|q,a) we use thevariational lower bound:

L=Eqφ(h)[log pθ(y|zq(q), za(a,h))]−DKL(qφ(h)||pθ(h|q))

6 log

∫pθ(y|zq(q), za(a,h))pθ(h|q)dh

= log p(y|q,a) (16)

Following the neural variational inference framework, weconstruct a deep neural network as the inference network

qφ(h|q,a,y) = N (h|µφ(q,a,y),diag(σ2φ(q,a,y))):

πφ = gφ(fLSTMq (q), f LSTM

a (a), fy(y))

= gφ(sq(|q|), sa(|a|), sy) (17)µφ = l3(πφ), logσφ = l4(πφ) (18)

where q and a are also modelled by LSTMs1, and therelatedness label y is modelled by a simple linear trans-formation into the vector sy . According to the joint rep-resentation πφ, we then generate the parameters µφ andσφ, which parameterise the variational distribution over thequestion semantics h. To emphasise, though both pθ(h|q)and qφ(h|q,a,y) are modelled as parameterised Gaussiandistributions, qφ(h|q,a,y) as an approximation only func-tions during inference by producing samples to computethe stochastic gradients, while pθ(h|q) is the generativedistribution that generates the samples for predicting thequestion-answer relatedness y.

Based on the samples h ∼ qφ(h|q,a,y), we use SGVBto optimise the lower bound (Eq.16). The model pa-rameters θ and the inference network parameters φ areupdated jointly using their stochastic gradients. In thiscase, similar to the NVDM, the Gaussian KL divergenceDKL[qφ(h|q,a,y))‖pθ(h|q)] can be analytically com-puted during training process.

5. Experiments5.1. Dataset & Setup for Document Modelling

We experiment with NVDM on two standard news cor-pora: the 20NewsGroups2 and the Reuters RCV1-v23. Theformer is a collection of newsgroup documents, consist-ing of 11,314 training and 7,531 test articles. The latteris a large collection from Reuters newswire stories with794,414 training and 10,000 test cases. The vocabulary sizeof these two datasets are set as 2,000 and 10,000.

To make a direct comparison with the prior work we fol-low the same preprocessing procedure and setup as Hinton& Salakhutdinov (2009), Larochelle & Lauly (2012), Sri-vastava et al. (2013), and Mnih & Gregor (2014). We trainNVDM models with 50 and 200 dimensional documentrepresentations respectively. For the inference network, weuse an MLP (Eq. 8) with 2 layers and 500 dimension recti-fier linear units, which converts document representationsinto embeddings. During training we carry out stochasticestimation by taking one sample for estimating the stochas-tic gradients, while in prediction we use 20 samples forpredicting document perplexity. The model is trained by

1In this case, the LSTMs for q and a are shared by the infer-ence network and the generative model, but there is no restrictionon using different LSTMs in the inference network.

2http://qwone.com/ jason/20Newsgroups3http://trec.nist.gov/data/reuters/reuters.html


Model Dim 20News RCV1LDA 50 1091 1437LDA 200 1058 1142RSM 50 953 988docNADE 50 896 742SBN 50 909 784fDARN 50 917 724fDARN 200 —- 598NVDM 50 836 563NVDM 200 852 550

(a) Perplexity on test dataset.

Word weapons medical companies define israel bookguns medicine expensive defined israeli books

weapon health industry definition arab referenceNVDM gun treatment company printf arabs guide

militia disease market int lebanon writingarmed patients buy sufficient lebanese pages

weapon treatment demand defined israeli readingshooting medecine commercial definition israelis read

NADE firearms patients agency refer arab booksassault process company make palestinian releventarmed studies credit examples arabs collection

(b) The five nearest words in the semantic space.

Table 1. For the experimental results in (a), LDA (Blei et al., 2003) is a traditional topic model that models documents by mixturesof topics, RSM (Hinton & Salakhutdinov, 2009) is an undirected topic model implemented by restricted Boltzmann machines, anddocNADE (Larochelle & Lauly, 2012) is a neural topic model based on autoregressive assumption. The models based on SigmoidBelief Networks (SBN) and Deep AutoRegressive Neural Network (DARN) structures are implemented by Mnih & Gregor (2014),which employs an MLP to build a Monte Carlo control variate estimator for stochastic estimation.

Adam (Kingma & Ba, 2015) and tuned by hold-out vali-dation perplexity. We alternately optimise the generativemodel and the inference network by fixing the parametersof one while updating the parameters of the other.

5.2. Experiments on Document Modelling

Table 1a presents the test document perplexity. The firstcolumn lists the models, and the second column showsthe dimension of latent variables used in the experiments.The final two columns present the perplexity achievedby each topic model on the 20NewsGroups and RCV1-v2datasets. In document modelling, perplexity is computedby exp(− 1

D

∑Ndn

1Nd

log p(Xd)), where D is the numberof documents,Nd represents the length of the dth documentand log p(X) = log

∫p(X|h)p(h)dh is the log probabil-

ity of the words in the document. Since log p(X) is in-tractable in the NVDM, we use the variational lower bound(which is an upper bound on perplexity) to compute theperplexity following Mnih & Gregor (2014).

While all the baseline models listed in Table 1a apply dis-crete latent variables, here NVDM employs a continuousstochastic document representation. The experimental re-sults indicate that NVDM achieves the best performanceon both datasets. For the experiments on RCV1-v2 dataset,the NVDM with latent variable of 50 dimension performseven better than the fDARN with 200 dimension. It demon-strates that our document model with continuous latentvariables has higher expressiveness and better generalisa-tion ability. Table 1b compares the 5 nearest words selectedaccording to the semantic vector learned from NVDM anddocNADE.

In addition to the perplexities, we also qualitatively eval-uate the semantic information learned by NVDM on the

Space Religion Encryption Sport Policyorbit muslims rsa goals bushlunar worship cryptography pts resourcessolar belief crypto teams charles

shuttle genocide keys league austinmoon jews pgp team billlaunch islam license players resolution

fuel christianity secure nhl mrnasa atheists key stats misc

satellite muslim escrow min piecejapanese religious trust buf marc

Table 2. The topics learned by NVDM on 20News.

20NewsGroups dataset with latent variables of 50 dimen-sion. We assume each dimension in the latent space repre-sents a topic that corresponds to a specific semantic mean-ing. Table 2 presents 5 randomly selected topics with 10words that have the strongest positive connection with thetopic. Based on the words in each column, we can deducetheir corresponding topics as: Space, Religion, Encryption,Sport and Policy. Although the model does not impose in-dependent interpretability on the latent representation di-mensions, we still see that the NVDM learns locally inter-pretable structure.

5.3. Dataset & Setup for Answer Sentence Selection

We experiment on two answer selection datasets, theQASent and the WikiQA datasets. QASent (Wang et al.,2007) is created from the TREC QA track, and the WikiQA(Yang et al., 2015) is constructed from Wikipedia, which isless noisy and less biased towards lexical overlap4. Table 3summarises the statistics of the two datasets.

4Yang et al. (2015) provide detailed explanation of the differ-ences between the two datasets.


Source Set Questions QA Pairs JudgementTrain 1,229 53,417 automatic

QASent Dev 82 1,148 manualTest 100 1,517 manualTrain 2,118 20,360 manual

WikiQA Dev 296 2,733 manualTest 633 6,165 manual

Table 3. Statistics of QASent and WikiQA. Judgement denoteswhether correctness was determined automatically or by humanannotators.

1 10 20 50 100Samples

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Sta

ndard

Devia

tion

1e−31.28

0.77

0.460.39

0.32

Figure 3. The standard deviations of MAP scores computed byrunning 10 NASM models on WikiQA with different numbers ofsamples.

Model QASent WikiQA

MAP MRR MAP MRR

Published ModelsPV 0.5213 0.6023 0.5110 0.5160Bigram-CNN 0.5693 0.6613 0.6190 0.6281Deep CNN 0.5719 0.6621 — —PV + Cnt 0.6762 0.7514 0.5976 0.6058WA 0.7063 0.7740 — —LCLR 0.7092 0.7700 0.5993 0.6068Bigram-CNN + Cnt 0.7113 0.7846 0.6520 0.6652Deep CNN + Cnt 0.7186 0.7826 — —

Our ModelsLSTM 0.6436 0.7235 0.6552 0.6747LSTM + Att 0.6451 0.7316 0.6639 0.6828NASM 0.6501 0.7324 0.6705 0.6914LSTM + Cnt 0.7228 0.7986 0.6820 0.6988LSTM + Att + Cnt 0.7289 0.8072 0.6855 0.7041NASM + Cnt 0.7339 0.8117 0.6886 0.7069

Table 4. Results of our models (LSTM, LSTM + Att, NASM)in comparison with other state of the art models on the QASentand WikiQA dataset. PV is the paragraph vector (Le & Mikolov,2014). Bigram-CNN is the simple convolutional model reportedin (Yu et al., 2014). Deep CNN is the deep convolutional modelfrom (Severyn, 2015). WA is a model based on word alignment(Wang & Ittycheriah, 2015). LCLR is the SVM-based classifiertrained using a set of features. Model + Cnt means that the resultis obtained from a combination of a lexical overlap feature andthe output from the distributional model.

In order to investigate the effectiveness of our NASMmodel we also implemented two strong baseline models —a vanilla LSTM model (LSTM) and an LSTM model with adeterministic attention mechanism (LSTM+Att). The for-mer directly applies the QA matching function (Eq. 15) onthe independent question and answer representations whichare the last state outputs sq(|q|) and sa(|a|) from the ques-tion and answer LSTM models. The latter adds an attentionmodel to learn pair-specific representation for prediction onthe basis of the vanilla LSTM. Moreover, LSTM+Att is thedeterministic counterpart of NASM, which has the sameneural network architecture as NASM. The only differenceis that it replaces the stochastic units h with determinis-tic ones, and no inference network is required to carry outstochastic estimation. Following previous work, for each ofour models we also add a lexical overlap feature by com-bining a co-occurrence word count feature with the proba-bility generated from the neural model. MAP and MRR areadopted as the evaluation metrics for this task.

To facilitate direct comparison with previous work we fol-low the same experimental setup as Yu et al. (2014) andSeveryn (2015). The word embeddings (K = 50) areobtained by running the word2vec tool (Mikolov et al.,2013) on the English Wikipedia dump and the AQUAINT5

corpus. We use LSTMs with 3 layers and 50 hidden units,

5https://catalog.ldc.upenn.edu/LDC2002T31

and apply 40% dropout after the embedding layer. For theconstruction of the inference network, we use an MLP (Eq.10) with 2 layers and tanh units of 50 dimension, and anMLP (Eq. 17) with 2 layers and tanh units of 150 dimen-sion for modelling the joint representation. During trainingwe carry out stochastic estimation by taking one samplefor computing the gradients, while in prediction we use 20samples to calculate the expectation of the lower bound.Figure 3 presents the standard deviation of NASM’s MAPscores while using different numbers of samples. Consid-ering the trade-off between computational cost and vari-ance, we chose 20 samples for prediction in all the exper-iments. The models are trained using Adam (Kingma &Ba, 2015), with hyperparameters selected by optimising theMAP score on the development set.

5.4. Experiments on Answer Sentence Selection

Table 4 compares the results of our models with currentstate-of-the-art models on both answer selection datasets.On the QASent dataset, our vanilla LSTM model outper-forms the deep CNN 6 model by approximately 7% on

6As stated in (Yih et al., 2013) that the evaluation scripts usedby previous work are noisy — 4 out of 72 questions in the test setare treated answered incorrectly. This makes the MAP and MRRscores ∼ 4% lower than the true scores. Since Severyn (2015) andWang & Ittycheriah (2015) use a cleaned-up evaluation scripts,we apply the original noisy scripts to re-evaluate their outputs in

https://catalog.ldc.upenn.edu/LDC2002T31


ALSTM the blue color of liquid oxygen in a dewar flask

ANASM the blue color of liquid oxygen in a dewar flask

Q3 what does a liquid oxygen plant look like

ALSTM the peso is subdivided into 100 centavos , represented by " _UNK_ "

ANASM the peso is subdivided into 100 centavos , represented by " _UNK_ "

Q2 how much is centavos in mexico

ALSTM the actress who played lolita , sue lyon , was fourteen at the time of filming .

ANASM the actress who played lolita , sue lyon , was fourteen at the time of filming .

Q1 how old was sue lyon when she made lolita

Figure 4. A visualisation of attention scores on answer sentences.

0 10 20 30 40Dimension

0

10

20

30

40

Sentence

'how'

'what'

'who'

'when'

'where'

Figure 5. Hinton diagrams of the log standard deviations.

MAP and 6% on MRR. The LSTM+Att performs slightlybetter than the vanilla LSTM model, and our NASM im-proves the results further. Since the QASent dataset is bi-ased towards lexical overlapping features, after combiningwith a co-occurrence word count feature, our best modelNASM outperforms all the previous models, including bothneural network based models and classifiers with a set ofhand-crafted features (e.g. LCLR). Similarly, on the Wik-iQA dataset, all of our models outperform the previous dis-tributional models by a large margin. By including a wordcount feature, our models improve further and achieve thestate-of-the-art. Notably, on both datasets, our two LSTM-based models have set strong baselines and NASM workseven better, which demonstrates the effectiveness of intro-ducing stochastic units to model question semantics in thisanswer sentence selection task.

In Figure 4, we compare the effectiveness of the latent at-tention mechanism (NASM) and its deterministic counter-part (LSTM+Att) by visualising the attention scores on theanswer sentences. For most of the negative answer sen-tences, neither of the two attention models can attend toreasonable words that are beneficial for predicting relat-edness. But for the correct answer sentences, such as theones in Figure 4, both attention models are able to capturecrucial information by attending to different parts of thesentence based on the question semantics. Interestingly,compared to the deterministic counterpart LSTM+Att, ourNASM assigns higher attention scores on the prominentwords that are relevant to the question, which forms a morepeaked distribution and in turn helps the model achieve bet-ter performance.

In order to have an intuitive observation on the latent dis-tributions, we present Hinton diagrams of their log stan-dard deviation parameters (Figure 5). In a Hinton diagram,the size of a square is proportional to a value’s magni-tude, and the colour (black/white) indicates its sign (pos-itive/negative). In this case, we visualise the parameters

order to make the results directly comparable with previous work.

of 50 conditional distributions pθ(h|q) with the questionsselected from 5 different groups, which start with ‘how’,‘what’, ‘who’, ‘when’ and ‘where’. All the log standard de-viations are initialised as zero before training. According toFigure 5, we can see that the questions starting with ‘how’have more white areas, which indicates higher variances ormore uncertainties are in these dimensions. By contrast, thequestions starting with ‘what’ have black squares in almostevery dimension. Intuitively, it is more difficult to under-stand and answer the questions starting with ‘how’ than theothers, while the ‘what’ questions commonly have explicitwords indicating the possible answers. To validate this, wecompute the stratified MAP scores based on different ques-tion type. The MAP of ’how’ questions is 0.524 which isthe lowest among the five groups. Hence empirically, ’how’questions are harder to ’understand and answer’.

6. DiscussionAs shown in the experiments, neural variational inferencebrings consistent improvements on the performance of bothNLP tasks. The basic intuition is that the latent distribu-tions grant the ability to sum over all the possibilities interms of semantics. From the perspective of optimisation,one of the most important reasons is that Bayesian learningguards against overfitting.

According to Eq. 5 in NVDM, since we adopt p(h)as a standard Gaussian prior, the KL divergence termDKL[qφ(h|X)‖p(h)] can be analytically computed as12 (K − ‖µ‖

2 − ‖σ‖2 + log |diag(σ2)|). It is not difficultto find that it actually acts as L2 regulariser when we up-date the µ. Similarly, in NASM (Eq. 16), we also have theKL divergence termDKL[qφ(h|q,a,y))‖pθ(h|q)]. Differ-ent from NVDM, it attempts to minimise the distance be-tween qφ(h|q,a,y)) and pθ(h|q) that are both conditionaldistributions. Because pθ(h|q) as well as qφ(h|q,a,y))are learned during training, the two distributions are mu-tually restrained while being updated. Therefore, NVDMsimply penalises the large µ and encourages qφ(h|X) to


approach the prior p(h) for every document X , but inNASM, pθ(h|q) acts like a moving baseline distributionwhich regularises the update of qφ(h|q,a,y)) for everydifferent conditions. In practice, we carry out early stop-ping by observing the prediction performance on develop-ment dataset for the question answer selection task. Us-ing the same learning rate and neural network structure,LSTM+Att reaches optimal performance and starts to over-fit on training dataset generally at the 20th iteration, whileNASM starts to overfit around the 35th iteration.

More interestingly, in the question answer selection exper-iments, NASM learns more peaked attention scores than itsdeterministic counterpart LSTM+Att. For the update pro-cess of LSTM+Att, we find there exists a relatively big vari-ance in the gradients w.r.t. question semantics (LSTM+Attapplies deterministic sq(|q|) while NASM applies stochas-tic h). This is because the training dataset is small and con-tains many negative answer sentences that brings no benefitbut noise to the learning of the attention model. In con-trast, for the update process of NASM, we observe morestable gradients w.r.t. the parameters of latent distributions.The optimisation of the lower bound on one hand max-imises the conditional log-likelihood (that the deterministiccounterpart cares about) and on the other hand minimisesthe KL-divergence (that regularises the gradients). Hence,each update of the lower bound actually keeps the gradientsw.r.t. µ from swinging heavily. Besides, since the valuesof σ are not very significant in this case, the distributionof attention scores mainly depends on µ. Therefore, thelearning of the attention model benefits from the regular-isation as well, and it explains the fact that NASM learnsmore peaked attention scores which in turn helps achieve abetter prediction performance.

Since the computations of NVDM and NASM can be par-allelised on GPU and only one sample is required dur-ing training process, it is very efficient to carry out theneural variational inference. Moreover, for both NVDMand NASM, all the parameters are updated by back-propagation. Thus, the increased computation time for thestochastic units only comes from the added parameters ofthe inference network.

7. Related WorkTraining an inference network to approximate the vari-ational distribution was first proposed in the context ofHelmholtz machines (Hinton & Zemel, 1994; Hinton et al.,1995; Dayan & Hinton, 1996), but applications of these di-rected generative models come up against the problem ofestablishing low variance gradient estimators. Recent ad-vances in neural variational inference mitigate this prob-lem by reparameterising the continuous random variables(Rezende et al., 2014; Kingma & Welling, 2014), using

control variates (Mnih & Gregor, 2014) or approximat-ing the posterior with importance sampling (Bornschein &Bengio, 2015). The instantiations of these ideas (Gregoret al., 2015; Kingma et al., 2014; Ba et al., 2015) havedemonstrated strong performance on the tasks of imageprocessing. The recent variants of generative auto-encoder(Louizos et al., 2015; Makhzani et al., 2015) are also verycompetitive. Tang & Salakhutdinov (2013) applies the sim-ilar idea of introducing stochastic units for expression clas-sification, but its inference is carried out by Monte CarloEM algorithm with the reliance on importance sampling,which is less efficient and lack of scalability.

Another class of neural generative models make use of theautoregressive assumption (Larochelle & Murray, 2011;Uria et al., 2014; Germain et al., 2015; Gregor et al.,2014). Applications of these models on document mod-elling achieve significant improvements on generating doc-uments, compared to conventional probabilistic topic mod-els (Hofmann, 1999; Blei et al., 2003) and also the RBMs(Hinton & Salakhutdinov, 2009; Srivastava et al., 2013).While these models that use binary semantic vectors, ourNVDM employs dense continuous document representa-tions which are both expressive and easy to train. The se-mantic word vector model (Maas et al., 2011) also employsa continuous semantic vector to generate words, but themodel is trained by MAP inference which does not permitthe calculation of the posterior distribution. A very similaridea to NVDM is Bowman et al. (2015), which employsVAE to generate sentences from a continuous space.

Apart from the work mentioned above, there is other in-teresting work on question answering with deep neural net-works. One of the popular streams is mapping factoid ques-tions with answer triples in the knowledge base (Bordeset al., 2014a;b; Yih et al., 2014). Moreover, Weston et al.(2015); Sukhbaatar et al. (2015); Kumar et al. (2015) fur-ther exploit memory networks, where long-term memoriesact as dynamic knowledge bases. Another attention-basedmodel (Hermann et al., 2015) applies the attentive networkto help read and comprehend for long articles.

8. ConclusionThis paper introduced a deep neural variational inferenceframework for generative models of text. We experimentedon two diverse tasks, document modelling and question an-swer selection tasks to demonstrate the effectiveness of thisframework, where in both cases our models achieve stateof the art performance. Apart from the promising results,our model also has the advantages of (1) simple, expres-sive, and efficient when training with the SGVB algorithm;(2) suitable for both unsupervised and supervised learningtasks; and (3) capable of generalising to incorporate anytype of neural network.


ReferencesAndrieu, Christophe, De Freitas, Nando, Doucet, Arnaud,

and Jordan, Michael I. An introduction to mcmc for ma-chine learning. Machine learning, 50(1-2):5–43, 2003.

Attias, Hagai. A variational bayesian framework for graph-ical models. In Proceedings of NIPS, 2000.

Ba, Jimmy, Grosse, Roger, Salakhutdinov, Ruslan, andFrey, Brendan. Learning wake-sleep recurrent attentionmodels. In Proceedings of NIPS, 2015.

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio,Yoshua. Neural machine translation by jointly learningto align and translate. In Proceedings of ICLR, 2015.

Beal, Matthew James. Variational algorithms for approxi-mate Bayesian inference. University of London, 2003.

Blei, David M, Ng, Andrew Y, and Jordan, Michael I. La-tent dirichlet allocation. The Journal of Machine Learn-ing Research, 3:993–1022, 2003.

Bordes, Antoine, Chopra, Sumit, and Weston, Jason. Ques-tion answering with subgraph embeddings. In Proceed-ings of EMNLP, 2014a.

Bordes, Antoine, Weston, Jason, and Usunier, Nicolas.Open question answering with weakly supervised em-bedding models. In Proceedings of ECML, 2014b.

Bornschein, Jorg and Bengio, Yoshua. Reweighted wake-sleep. In Proceedings of ICLR, 2015.

Bowman, Samuel R., Vilnis, Luke, Vinyals, Oriol, Dai, An-drew M., Jozefowicz, Rafal, and Bengio, Samy. Gen-erating sentences from a continuous space. CoRR,abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.

Dayan, Peter and Hinton, Geoffrey E. Varieties ofhelmholtz machine. Neural Networks, 9(8):1385–1403,1996.

Germain, Mathieu, Gregor, Karol, Murray, Iain, andLarochelle, Hugo. Made: Masked autoencoder for dis-tribution estimation. In Proceedings of ICML, 2015.

Gregor, Karol, Mnih, Andriy, and Wierstra, Daan. Deepautoregressive networks. In Proceedings of ICML, 2014.

Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra,Daan. Draw: A recurrent neural network for image gen-eration. In Proceedings of ICML, 2015.

Hermann, Karl Moritz, Kocisky, Tomas, Grefenstette, Ed-ward, Espeholt, Lasse, Kay, Will, Suleyman, Mustafa,and Blunsom, Phil. Teaching machines to read and com-prehend. In Proceedings of NIPS, 2015.

Hinton, Geoffrey E and Salakhutdinov, Ruslan. Replicatedsoftmax: an undirected topic model. In Proceedings ofNIPS, 2009.

Hinton, Geoffrey E and Zemel, Richard S. Autoencoders,minimum description length, and helmholtz free energy.In Proceedings of NIPS, 1994.

Hinton, Geoffrey E, Dayan, Peter, Frey, Brendan J, andNeal, Radford M. The wake-sleep algorithm for un-supervised neural networks. Science, 268(5214):1158–1161, 1995.

Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-term memory. Neural computation, 9(8):1735–1780,1997.

Hofmann, Thomas. Probabilistic latent semantic indexing.In Proceedings of SIGIR, 1999.

Jordan, Michael I, Ghahramani, Zoubin, Jaakkola,Tommi S, and Saul, Lawrence K. An introductionto variational methods for graphical models. Machinelearning, 37(2):183–233, 1999.

Kingma, Diederik P. and Ba, Jimmy. Adam: A method forstochastic optimization. In Proceedings of ICLR, 2015.

Kingma, Diederik P and Welling, Max. Auto-encodingvariational bayes. In Proceedings of ICLR, 2014.

Kingma, Diederik P, Mohamed, Shakir, Rezende,Danilo Jimenez, and Welling, Max. Semi-supervisedlearning with deep generative models. In Proceedingsof NIPS, 2014.

Kumar, Ankit, Irsoy, Ozan, Su, Jonathan, Bradbury, James,English, Robert, Pierce, Brian, Ondruska, Peter, Gulra-jani, Ishaan, and Socher, Richard. Ask me anything: Dy-namic memory networks for natural language process-ing. arXiv preprint arXiv:1506.07285, 2015.

Larochelle, Hugo and Lauly, Stanislas. A neural autore-gressive topic model. In Proceedings of NIPS, 2012.

Larochelle, Hugo and Murray, Iain. The neural autoregres-sive distribution estimator. In Proceedings of AISTATS,2011.

Le, Quoc V. and Mikolov, Tomas. Distributed represen-tations of sentences and documents. In Proceedings ofICML, 2014.

Louizos, Christos, Swersky, Kevin, Li, Yujia, Welling,Max, and Zemel, Richard. The variational fair auto en-coder. arXiv preprint arXiv:1511.00830, 2015.

http://arxiv.org/abs/1511.06349



Maas, Andrew L, Daly, Raymond E, Pham, Peter T, Huang,Dan, Ng, Andrew Y, and Potts, Christopher. Learningword vectors for sentiment analysis. In Proceedings ofACL, 2011.

Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, andGoodfellow, Ian J. Adversarial autoencoders. CoRR,abs/1511.05644, 2015. URL http://arxiv.org/abs/1511.05644.

Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Gre-gory S., and Dean, Jeffrey. Distributed representationsof words and phrases and their compositionality. In Pro-ceedings of NIPS, 2013.

Mnih, Andriy and Gregor, Karol. Neural variational infer-ence and learning in belief networks. In Proceedings ofICML, 2014.

Neal, Radford M. Probabilistic inference using markovchain monte carlo methods. Technical report: CRG-TR-93-1, 1993.

Rezende, Danilo J, Mohamed, Shakir, and Wierstra, Daan.Stochastic backpropagation and approximate inferencein deep generative models. In Proceedings of ICML,2014.

Severyn, Aliaksei. Modelling input texts: from Tree Ker-nels to Deep Learning. PhD thesis, University of Trento,2015.

Srivastava, Nitish, Salakhutdinov, RR, and Hinton, Geof-frey. Modeling documents with deep boltzmann ma-chines. In Proceedings of UAI, 2013.

Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason, andFergus, Rob. End-to-end memory networks. In Proceed-ings of NIPS, 2015.

Tang, Yichuan and Salakhutdinov, Ruslan R. Learningstochastic feedforward neural networks. In ProceedingsNIPS, 2013.

Uria, Benigno, Murray, Iain, and Larochelle, Hugo. A deepand tractable density estimator. In Proceedings of ICML,2014.

Wang, Mengqiu, Smith, Noah A, and Mitamura, Teruko.What is the jeopardy model? a quasi-synchronous gram-mar for qa. In Proceedings of EMNLP-CoNLL, 2007.

Wang, Zhiguo and Ittycheriah, Abraham. Faq-based ques-tion answering via word alignment. arXiv preprintarXiv:1507.02628, 2015.

Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Mem-ory networks. In Proceedings of ICLR, 2015.

Yang, Yi, Yih, Wen-tau, and Meek, Christopher. Wikiqa: Achallenge dataset for open-domain question answering.In Proceedings of EMNLP, 2015.

Yih, Wen-tau, Chang, Ming-Wei, Meek, Christopher, andPastusiak, Andrzej. Question answering using enhancedlexical semantic models. In Proceedings of ACL, 2013.

Yih, Wen-tau, He, Xiaodong, and Meek, Christopher. Se-mantic parsing for single-relation question answering. InProceedings of ACL, 2014.

Yu, Lei, Hermann, Karl Moritz, Blunsom, Phil, and Pul-man, Stephen. Deep Learning for Answer Sentence Se-lection. In NIPS Deep Learning Workshop, 2014.




A. t-SNE Visualisation of DocumentRepresentations

10 5 0 5 10 1510

5

0

5

10

(a) Neural Variational Document Model

10 5 0 5 1010

5

0

5

10

(b) Semantic Word Vector

Figure 6. t-SNE visualisation of the document representationsachieved by (a) NVDM and (b) SWV (Maas et al., 2011) on theheld-out test dataset of 20NewsGroups. The documents are col-lected from 20 different news groups, which correspond to thepoints with different colour in the figure.

B. Details of the Deep Neural NetworkStructures

B.1. Neural Variational Document Model

(1) Inference Network qφ(h|X):

λ = ReLU(W 1X + b1) (19)π = ReLU(W 2λ + b2) (20)

µ =W 3π + b3 (21)logσ =W 4π + b4 (22)

h ∼ N (µ(X),diag(σ2(X))) (23)

(2) Generative Model pθ(X|h):

ei = exp(−hTRxi + bxi) (24)pθ(xi|h) = ei∑|V |

j ej(25)

pθ(X|h) =∏Ni pθ(xi|h) (26)

(3) KL Divergence DKL[qφ(h|X)||p(h)]:

DKL = − 12 (K − ‖µ‖

2 − ‖σ‖2 + log |diag(σ2)|)(27)

The variational lower bound to be optimised:

L =Eqφ(h|X)

[∑N

i=1log pθ(xi|h)

]−DKL[qφ(h|X)||p(h)] (28)

≈∑L

l=1

∑N

i=1log pθ(xi|h(l))

+1

2(K − ‖µ‖2 − ‖σ‖2 + log |diag(σ2)|) (29)

B.2. Neural Answer Selection Model

(1) Inference Network qφ(h|q,a,y):

sq(|q|) = f LSTMq (q) (30)

sa(|a|) = f LSTMa (a) (31)

sy =W 5y + b5 (32)γ = sq(|q|)||sa(|a|)||sy (33)λφ = tanh(W 6γ + b6) (34)πφ = tanh(W 7λφ + b7) (35)µφ =W 8πφ + b8 (36)

logσφ =W 9πφ + b9 (37)h ∼ N (µφ(q,a,y),diag(σ

2φ(q,a,y))) (38)

(2) Generative Model

pθ(h|q):

λθ = tanh(W 1sq(|q|) + b1) (39)πθ = tanh(W 2λθ + b2) (40)µθ =W 3πθ + b3 (41)

logσθ =W 4πθ + b4 (42)

pθ(y|q,a,h):

e(i) =W Tα tanh(W hh+W ssa(i)) (43)

α(i) = e(i)∑j e(j)

(44)

c(a,h) =∑i sa(i)α(i) (45)

za(a,h) = tanh(W ac(a,h) +W nsa(|a|)) (46)zq(q) = sq(|q|) (47)

pθ(y = 1|q,a,h) = σ(zTqMza + b) (48)


(3) KL Divergence DKL[qφ(h|q,a,y)||pθ(h|q)]:

DKL =− 1

2(K + log |diag(σ2

φ)| − log |diag(σ2θ)|

− Tr(diag(σ2φ) diag

−1(σ2θ))

− (µφ − µθ)T diag−1(σ2θ)(µφ − µθ)) (49)

The variational lower bound to be optimised:

L = Eqφ(h|q,a,y)[log pθ(y|q,a,h)]−DKL[qφ(h|q,a,y)||pθ(h|q)] (50)

≈L∑l=1

[y log σ(zTqMz(l)a + b)

+ (1− y) log(1− σ(zTqMz(l)a + b))]

+1

2(K + log |diag(σ2

φ)| − log |diag(σ2θ)|

− Tr(diag(σ2φ) diag

−1(σ2θ))

− (µφ − µθ)T diag−1(σ2θ)(µφ − µθ)) (51)

C. Computational ComplexityThe computational complexity of NVDM for a trainingdocument is Cφ + Cθ = O(LK2 + KSV ). Here, Cφ =O(LK2) represents the cost for the inference network togenerate a sample, where L is the number of the layers inthe inference network and K is the average dimension ofthese layers. Besides, Cθ = O(KSV ) is the cost of re-constructing the document from a sample, where S is theaverage length of the documents and V represents the vol-ume of words applied in this document model, which isconventionally much lager than K.

The computational complexity of NASM for a trainingquestion-answer pair is Cφ+Cθ = O((L+S)K2+SW ).The inference network needs Cφ = 2SW + 2K +LK2 =O(LK2 + SW ). It takes 2SW + 2K to produce thejoint representation for a question-answer pair and its label,whereW is the total number of parameters of an LSTM andS is the average length of the sentences. Based on the jointrepresentation, an MLP spends LK2 to generate a sam-ple, where L is the number of layers and K represents theaverage dimension. The generative model requires Cθ =2SW+LK2+SK2+5K2+2K2 = O((L+S)K2+SW ).Similarly, it costs 2SW + LK2 to construct the genera-tive latent distribution , where 2SW can be saved if theLSTMs are shared by the inference network and the gener-ative model. Besides, the attention model takes SK2+5K2

and the relatedness prediction takes the last 2K2.

Since the computations of NVDM and NASM can be par-allelised in GPU and only one sample is required duringtraining process, it is very efficient to carry out the neu-ral variational inference. As NVDM is an instantiation ofvariational auto-encoder, its computational complexity isthe same as the deterministic auto-encoder. In addition, thecomputational complexity of LSTM+Att, the deterministiccounterpart of NASM, is alsoO((L+S)K2+SW ). Thereis only O(LK2) time increase by introducing an inferencenetwork for NASM when compared to LSTM+Att.

Neural Variational Inference for Text ProcessingNeural Variational Inference for Text Processing...

Documents

Transcript of Neural Variational Inference for Text ProcessingNeural Variational Inference for Text Processing...