Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian...

IntroductionBayesian Learning

Applications

Bayesian Learning and Variational Inference

Davide Bacciu

Dipartimento di InformaticaUniversità di Pisabacciu@di.unipi.it

Intelligent Systems for Pattern Recognition (ISPR)

Applications

Tractability ProblemDefinitionsVariational Bound

Outline and Motivations

Introduce the basic concepts of variational learning usefulfor both generative models and deep learningBayesian latent variable models

A class of generative models for which variational orapproximated methods are needed

Latent Dirichlet AllocationPossibly the simplest Bayesian latent variable modelMany applications in unsupervised text analytics, machinevision, ...

A very quick intro to variational EM

Applications

Problem Setup

Latent Variable Models

Latent variablesUnobserved RV that define an hiddengenerative process of observed dataExplain complex relation between a largenumber of observable variablesE.g. hidden states in HMM/CRF

Latent variable models likelihood

P(x) =

N∏i=1

P(xi |z)P(z)dz

Applications

Latent Space

Define a latent space where high-dimensional data can berepresented

AssumptionLatent variablesconditional and marginaldistributions are moretractable than the jointdistribution P(X ) (e.gK � N)

Applications

Tractability

Introducing hidden variables can produce couplingsbetween the distributions (i.e. one depending on the other)which can make their posterior intractableBayesian learning introduces priors which introduceintegrals in the posterior computations which are notalways analytically or computationally tractable

This lecture is about how we can approximate such intractableproblems

Variational view of EM (used in variational DL)

Applications

Kullback-Leibler (KL) Divergence

An information theoretic measure of closeness of twodistributions p and q

KL(q||p) = Eq

p(z|x)

]= 〈log q(z)〉q − 〈log p(z|x)〉q

Note:A specialized definition for our latent variable setting

If q high and p high⇒ happyIf q high and p low⇒ unhappyIf q low⇒ don’t care (due to expectation)

Its a divergence⇒ it is not symmetric

Applications

Jensen Inequality

Property of linear operators on convex/concave functions

Generalizes to∑i ai f (xi)∑

ai≥ f∑

i aixi∑ai

Applied in probabilitytheory

f (E[X ]) ≥ E[f (X )]

λf (x) + (1− λ)f (x) ≥ f (λx + (1− λ)x)

Applications

Bounding Log-Likelihood with Jensen

The log-likelihood for a model with a single hidden variable Zand parameters θ (assume single sample for simplicity) is

log P(x |θ) = log∫

zP(x , z|θ)dz = log

Q(z|φ)

Q(z|φ)P(x , z|θ)dz

which holds for Q(z|φ) 6= 0 with parameters φGiven the definition of expectation this rewrites as (Jensen)

log P(x |θ) = log EQ

[P(x , z)

]≥ EQ

P(x , z)

]= EQ [log P(x , z)]︸︷︷︸

Expectation of Joint Distribution

−EQ [log Q(z)]︸︷︷︸Entropy

= L(x , θ, φ)

Applications

How Good is this Lower Bound?

log P(x |θ)− L(x , θ, φ) =?

Inserting the definition of L(x , θ, φ)

log P(x)−∫

zQ(z) log

P(x , z)

Introducing Q(z) by marginalization (∫

zQ(z) = 1)

∫zQ(z) log P(x)−

Q(z) logP(x , z)

= KL(Q(z|φ)||P(z|x , θ))

Applications

Defining and Interpreting the Bound

We can assume the existence of a probability Q(z|φ) whichallows to bound the likelihood P(x |θ) from below usingL(x , θ, φ)

The term L(x , θ, φ) is called variational bound or evidencelower bound (ELBO)

The optimal bound is obtained for KL(Q(z|φ)||P(z|x , θ)) = 0,that is if we choose Q(z|φ) = P(z|x , θ)

Minimizing KL is equivalent to maximize the ELBO⇒ change asampling problem with an optimization problem

Applications

Variational View of Expectation Maximization

EM Learning ReformulatedMaximum likelihood learning with hidden variables can beapproached by maximization of the ELBO

maxθ,φ

N∑n=1

L(xn, θ, φ)

where θ are the model parameters and φ serve in Q(z|φ)

If P(z|x , θ) is tractable⇒ use it as Q(z|φ) (optimal ELBO)O.w. choose Q(z|φ) as a tractable family of distributions

find φ that minimize KL(Q(z|φ)||P(z|x , θ)), orfind φ that maximize L(·, φ)

Applications

Latent Dirichlet AllocationApproximated Inference

A Generative Model for Multinomial Data

A Bag of Words (BOW) representation of a document is theclassical example of multinomial data (for text, images,graphs,...)

A BOW dataset (corpora) is the N ×M term-document matrix

x11 . . . x1i . . . x1M. . . . . . . . . . . .xj1 . . . xji . . . xjM. . . . . . . . . . . .xN1 . . . xNi . . . xNM

N: number of vocabularyitems wj

M: number of documents di

xij = n(wj ,di): number ofoccurrences of wj in di

Applications

Documents as Mixtures of Latent Variables

Latent topic models consider documents (i.e. item containers)as a mixture of topics

A topic identifies a pattern in the co-occurrence ofmultinomial items wj within the documentsMixture of topics⇒ Associate an interpretation (topic) toeach item in a document, whose interpretation is then amixture of the items’ topics

x11 . . . x1i . . . x1M. . . . . . . . . . . . . . .xj1 . . . xji . . . xjM. . . . . . . . . . . . . . .xN1 . . . xNi . . . xNM

Applications

Latent Dirichlet Allocation (LDA)

LDA models a document as amixture of topics z

Assigning one topic z to each itemw with probability P(w |z, β)Pick one topic for the the wholedocument with probability P(z|θ)

Key point - Each document has itspersonal topic proportion θ sampledfrom a distribution

θ defines a multinomial distributionbut it is a random variable as well

Applications

LDA Distributions

P(w |z, β) multinomial item-topicdistributionP(z|θ) multinomial topic distributionwith document-specific parameter θP(θ|α) Dirichlet distribution withhyperparameter α

A distribution for vectors that sumto 1 (simplex)The elements of a multinomial arevector that sum to 1!

Applications

Dirichlet Distribution

Why a Dirichlet distribution?Conjugate prior to multinomial distributionIf the likelihood is multinomial with a Dirichlet prior thenposterior is Dirichlet

Dirichlet distribution

P(θ|α) =Γ(∑K

k=1 αk

k=1 Γ(αk )

K∏k=1

θαk−1k

Dirichlet parameter αk is a prior count of the k -th topicIt controls the mean shape and sparsity of multinomialparameters θ

Applications

Geometric Interpretation

LDA finds a set of K projection functions on the K -dimensionaltopic simplex

Applications

Effect of the α parameter

α = 0.001

Slide Credit - Blei at KDD 2011 Tutorial

Applications

LDA and Text Analysis

Applications

LDA Generative Process

For each of the M documentsChoose θ ∼ Dirichlet(α)

For each of the N itemsChoose a topic z ∼ Multinomial(θ)Pick an item wj with multinomialprobability P(wj |z, β)

Multinomial topic-item parameter matrix[β]K×V

βkj = P(wj = 1|zk = 1)

or P(wj = 1|z = k)

P(θ, z,w|α, β) = P(θ|α)N∏

P(zj |θ)P(wj |zj , β)

Applications

Learning in LDA

Marginal distribution (a.k.a. likelihood) of a document d = w

P(w|α, β) =

∫ ∑z

P(θ, z,w|α, β)dθ

∫P(θ|α)

N∏j=1

k∑zj=1

P(zj |θ)P(wj |zj , β)dθ

Given {w1, . . . ,wM}, find (α, β) maximizing

L(α, β) = logM∏

P(wi |α, β)

Learning with hidden variables⇒ Expectation-MaximizationKey problem is inferring latent variables posterior

P(θ, z|w, α, β) =P(θ, z,w|α, β)

P(w|α, β)

Applications

Posterior Inference

Optimal ELBO is achieved when Q(z) is equal to the latentvariable posterior

P(θ, z|w, α, β) =P(θ, z,w|α, β)

P(w|α, β)

Key problem is that computation of the posterior is nottractableComputation of the denominator is intractable due to thecouplings between β and θ in the summation over topics

P(w|α, β) =Γ(∑K

k=1 αk

k=1 Γ(αk )

∫ K∏k=1

θαk−1k

N∏j=1

K∑k=1

V∏v=1

(θkβkv )wvj

Applications

Approximating Parameter Inference in LDA

Variational InferenceMaximize the variational bound without using the optimalposterior solution

Write a Q(z|φ) function that is sufficiently similar to theposterior but tractableQ(z|φ) should be such that β and θ are no longer coupledFit φ parameter so that Q(z|φ) is close to P(w|α, β)according to KL

Variational LDA: Blei, Ng and Jordan, 2003Takes hours to converge (but it is an approximation)

Sampling ApproachConstruct a Markov chain on the hidden variables whoselimiting distribution is the posteriorSampling LDA: Griffiths and Steyvers, 2004Takes days to converge (but it is accurate)

Applications

Variational Inference

Key Idea

Assume that our distribution Q(z|φ) factorizes (it is tractable)→mean-field assumption

Q(z|φ) = Q(z1, . . . , zK |φ) =K∏

Q(zk |φk )

Can be made more general by factorizing on groups oflatent variablesDoes not contain the true posterior because hiddenvariables are dependentVariational inference

Optimize ELBO using Q(z|φ) factorized distributionCoordinate ascent inference - Iteratively optimize eachvariational distribution holding the others fixed

Applications

Variational LDA Distribution

Given Φ = {γ, φ, λ} as variational approximation parameters

Q(θ, z, β|Φ) = Q(θ|γ)N∏

Q(zn|φn)K∏

Q(βk |λk )

Then we have the model parameters Ψ = α, β of sampledistribution P(θ, z,w|α, β) = P(θ, z,w|Ψ)

Applications

Variational Expectation-Maximization

Find the Φ,Ψ that maximize the ELBO

L(w,Φ,Ψ) = EQ [log P(θ, z,w|Ψ)]− EQ [log Q(θ, z,Ψ|Φ)]

by alternate maximization1 repeat2 Fix Ψ: update variational parameters Φ∗ (E-STEP)3 Fix Φ = Φ∗: update model parameters Ψ∗ (M-STEP)4 until little likelihood improvement

Unlike EM, variational EM has no guarantee to reach a localmaximizer of L

Applications

LDA in Pattern RecognitionSoftwareConclusions

LDA Applications

Why using latent topic models?Organize large collections of documents by identifyingshared topicsUnderstanding the documents semantics (unsupervised)Documents are of different nature

TextImagesVideoRelational data (graphs, time-series, etc..)

In short: a model for collections of high-dimensionalvectors whose attributes are multinomial distributions

Applications

Understanding Image Collections

How can we apply latent topic analysis to visual documents?

We need a way to representvisual content as in text

Text ≡ collection of discreteitems⇒ wordsImage ≡ collection ofdiscrete items⇒ ?

Visual patchesFeature detectors to identifyrelevant image parts(MSER)Feature descriptors torepresent content (SIFT)How can I obtain a discretevocabulary for visual terms?

Applications

Building a Visterm Vocabulary

Given a dataset of images1 For each image I

Identify interesting points (MSER/SIFT/grid)Extract the corresponding descriptors (SIFT)

2 Concatenate the image descriptors in a 128× N matrix,where N is the total number of descriptors extracted

3 Cluster the descriptors in C groups to obtain a vocabularyof C visterms (k-means)

You know all the necessary techniques to build this system!

Applications

Representing Image as a Bag of Items

Each image I is a document and each visual patch inside itis an itemAssociate each patch to the nearest cluster/visterm cCount the occurrences of each dictionary visterm c in yourimageRepresent the image as a vector of visterm counts

Codebook

CollectionImage

Image Collection

Vector

k−means

Input Image

Local patches

Quantization

Applications

LDA Image Understanding

Assigning a topic to each visual patch

Applications

Unsupervised Semantic Segmentation

Combine latent topics with Markov random fieldsUse LDA to identify topics of some pixel patchesUse MRF to diffuse LDA topics and enforce coherentpixel-level semantic segmentation

Zhao, Fei-Fei and Xing, Image Segmentation with Topic Random Field, ECCV 2010

Applications

Dynamical Topic Models

LDA assumes that the document order does not countWhat if we want to track topic evolution over time?Tracking how language changes over timeVideos are image documents over time

Blei and Lafferty. Dynamic topic models, ICML 2006

Applications

Topic Evolution over Time

https://github.com/blei-lab/dtm

Applications

Topic Trends

https://github.com/blei-lab/dtm

Applications

Relational Topic Models

Using topic models with relational data (graphs)Community discovery and connectivity pattern profiles(Kemp, Griffiths, Tenenbaum, 2004)Joint content-connectivity analysis (Blei,Chang, 2010)

Applications

Variational Learning in Code

PyMC3 - Python library with particular focus on variationalalgorithms (not PyMC!)Edward - Python library with lots of variational inferencefrom the father of LDABayespy - Variational Bayesian inference forconjugate-exponential family onlyAutograd - Variational and deep learning with differentiationas native Python operator (no strange backend)Matlab does not have official support for variationallearning but standalone implementation of various models(check Variational-Bayes.org)LDA is implemented in many Python libraries: scikit-learn,pypi, gensim (efficient topic models)

Applications

Take Home Messages

Bayesian learning amounts to treating distributions asrandom variables sampled from another distribution

Add priors to ML distributionsLearn functions instead of point estimates

Latent Dirichlet AllocationBayesian model to organize collections of multinomial dataUnsupervised latent representation learning

Variational lower boundMaximizing a lower bound of an intractable likelihoodAlternatively estimate variational parameters and maximizew.r.t model parametersA fundamental concept to understand variational deeplearning

Applications

Next Lecture

Sampling MethodsIntroduction to sampling methodsAncestral samplingGibbs SamplingMCMC family and advanced methods

Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian...

Documents

Transcript of Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian...

Bayesian Model Comparisonwpenny/bdb/bmc.pdf · 2011-06-02 · Bayes rule for models Bayes factors Nonlinear Models Variational Laplace Free Energy Complexity Decompositions AIC and

Parallelization of Variational Inference for Bayesian ...

Variational Bayesian inference for stochastic …ensslin/Bayes_Forum/...2017/10/13 · Variational Bayesian inference for stochastic processes Manfred Opper, AI group, TU Berlin October

Variational Regularized Bayesian Estimation for Joint Blur … · 2008-10-27 · Variational Regularized Bayesian Estimation for Joint Blur Identiﬁcation and Edge-Driven Image Restoration

Regularized Variational Bayesian Learning of Echo State ...kulkarni/Papers/Journals/j083... · Regularized Variational Bayesian Learning of Echo State Networks with Delay&Sum Readout

Efﬁcient Variational Bayesian Approximation Method Based ... · PDF fileEfﬁcient Variational Bayesian Approximation Method Based on Subspace Optimization ... trodet@satie.ens-cachan.fr

Building Blocks for Variational Bayesian Learning of ...

Skew-Normal Variational Approximations for Bayesian Inference · Skew-Normal Variational Approximations for Bayesian Inference ... Raudenbush, Yang & Yosef, 2000; Rue, Martino ...

A Geometric Variational Approach to Bayesian Inference · A Geometric Variational Approach to Bayesian Inference Abhijoy Saha 1, Karthik Bharath2, Sebastian Kurtek 1Department of

Walsh-Hadamard Variational Inference for Bayesian Deep ......Bayesian neural networks [35, 38] represent a good example of models for which inference is intractable, and for which

Variational Bayesian inference - Kay Brodersen · Variational Bayesian inference is based on variational calculus. Variational calculus Standard calculus Newton, Leibniz, and others

VARIATIONAL BAYESIAN PHYLOGENETIC INFERENCE

Variational Bayesian learning of generative models - · PDF fileVariational Bayesian learning of generative models 71 distribution. We use a particular variational method known as

Collapsed Variational Bayesian Inference of the Author ...people.csail.mit.edu › ythomas › publications › 2016AuthorTopicCVB-… · Collapsed Variational Bayesian Inference

Variational Bayesian learning of generative models · 70 Variational Bayesian learning of generative models 3.1 Bayesian modeling and variational learning Unsupervised learning methods

Fast Variational Bayesian Linear State-Space Modelusers.ics.aalto.fi/jluttine/ecml2013/ecml2013_slides.pdf · Fast Variational Bayesian Linear State-Space Model Jaakko Luttinen Aalto

Variational Bayesian analysis for hidden Markov models

VINE: A VARIATIONAL INFERENCE-BASED BAYESIAN NEURAL ... · ComputingBased Realization of Spiking Neural Networks” also known as “VINE: A Variational Inference- Based-Bayesian

Variational Bayesian Logistic Regressionsrihari/CSE574/Chap4/4.5.2-VarBayesLogistic.pdfLinear Models for Classification • Overview 1.Discriminant Functions 2.Probabilistic Generative

Grid-less Variational Bayesian Inference of Line Spectral ...