Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian...

39
Introduction Bayesian Learning Applications Bayesian Learning and Variational Inference Davide Bacciu Dipartimento di Informatica Università di Pisa [email protected] Intelligent Systems for Pattern Recognition (ISPR)

Transcript of Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian...

Page 1: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Bayesian Learning and Variational Inference

Davide Bacciu

Dipartimento di InformaticaUniversità di [email protected]

Intelligent Systems for Pattern Recognition (ISPR)

Page 2: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Tractability ProblemDefinitionsVariational Bound

Outline and Motivations

Introduce the basic concepts of variational learning usefulfor both generative models and deep learningBayesian latent variable models

A class of generative models for which variational orapproximated methods are needed

Latent Dirichlet AllocationPossibly the simplest Bayesian latent variable modelMany applications in unsupervised text analytics, machinevision, ...

A very quick intro to variational EM

Page 3: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Tractability ProblemDefinitionsVariational Bound

Problem Setup

Latent Variable Models

Latent variablesUnobserved RV that define an hiddengenerative process of observed dataExplain complex relation between a largenumber of observable variablesE.g. hidden states in HMM/CRF

Latent variable models likelihood

P(x) =

∫z

N∏i=1

P(xi |z)P(z)dz

Page 4: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Tractability ProblemDefinitionsVariational Bound

Latent Space

Define a latent space where high-dimensional data can berepresented

AssumptionLatent variablesconditional and marginaldistributions are moretractable than the jointdistribution P(X ) (e.gK � N)

Page 5: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Tractability ProblemDefinitionsVariational Bound

Tractability

Introducing hidden variables can produce couplingsbetween the distributions (i.e. one depending on the other)which can make their posterior intractableBayesian learning introduces priors which introduceintegrals in the posterior computations which are notalways analytically or computationally tractable

This lecture is about how we can approximate such intractableproblems

Variational view of EM (used in variational DL)

Page 6: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Tractability ProblemDefinitionsVariational Bound

Kullback-Leibler (KL) Divergence

An information theoretic measure of closeness of twodistributions p and q

KL(q||p) = Eq

[log

q(z)

p(z|x)

]= 〈log q(z)〉q − 〈log p(z|x)〉q

Note:A specialized definition for our latent variable setting

If q high and p high⇒ happyIf q high and p low⇒ unhappyIf q low⇒ don’t care (due to expectation)

Its a divergence⇒ it is not symmetric

Page 7: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Tractability ProblemDefinitionsVariational Bound

Jensen Inequality

Property of linear operators on convex/concave functions

Generalizes to∑i ai f (xi)∑

ai≥ f∑

i aixi∑ai

Applied in probabilitytheory

f (E[X ]) ≥ E[f (X )]

λf (x) + (1− λ)f (x) ≥ f (λx + (1− λ)x)

Page 8: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Tractability ProblemDefinitionsVariational Bound

Bounding Log-Likelihood with Jensen

The log-likelihood for a model with a single hidden variable Zand parameters θ (assume single sample for simplicity) is

log P(x |θ) = log∫

zP(x , z|θ)dz = log

∫z

Q(z|φ)

Q(z|φ)P(x , z|θ)dz

which holds for Q(z|φ) 6= 0 with parameters φGiven the definition of expectation this rewrites as (Jensen)

log P(x |θ) = log EQ

[P(x , z)

Q(z)

]≥ EQ

[log

P(x , z)

Q(z)

]= EQ [log P(x , z)]︸ ︷︷ ︸

Expectation of Joint Distribution

−EQ [log Q(z)]︸ ︷︷ ︸Entropy

= L(x , θ, φ)

Page 9: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Tractability ProblemDefinitionsVariational Bound

How Good is this Lower Bound?

log P(x |θ)− L(x , θ, φ) =?

Inserting the definition of L(x , θ, φ)

log P(x)−∫

zQ(z) log

P(x , z)

Q(z)

Introducing Q(z) by marginalization (∫

zQ(z) = 1)

∫zQ(z) log P(x)−

∫z

Q(z) logP(x , z)

Q(z)=

= KL(Q(z|φ)||P(z|x , θ))

Page 10: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Tractability ProblemDefinitionsVariational Bound

Defining and Interpreting the Bound

We can assume the existence of a probability Q(z|φ) whichallows to bound the likelihood P(x |θ) from below usingL(x , θ, φ)

The term L(x , θ, φ) is called variational bound or evidencelower bound (ELBO)

The optimal bound is obtained for KL(Q(z|φ)||P(z|x , θ)) = 0,that is if we choose Q(z|φ) = P(z|x , θ)

Minimizing KL is equivalent to maximize the ELBO⇒ change asampling problem with an optimization problem

Page 11: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Tractability ProblemDefinitionsVariational Bound

Variational View of Expectation Maximization

EM Learning ReformulatedMaximum likelihood learning with hidden variables can beapproached by maximization of the ELBO

maxθ,φ

N∑n=1

L(xn, θ, φ)

where θ are the model parameters and φ serve in Q(z|φ)

If P(z|x , θ) is tractable⇒ use it as Q(z|φ) (optimal ELBO)O.w. choose Q(z|φ) as a tractable family of distributions

find φ that minimize KL(Q(z|φ)||P(z|x , θ)), orfind φ that maximize L(·, φ)

Page 12: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

A Generative Model for Multinomial Data

A Bag of Words (BOW) representation of a document is theclassical example of multinomial data (for text, images,graphs,...)

A BOW dataset (corpora) is the N ×M term-document matrix

X =

x11 . . . x1i . . . x1M. . . . . . . . . . . .xj1 . . . xji . . . xjM. . . . . . . . . . . .xN1 . . . xNi . . . xNM

N: number of vocabularyitems wj

M: number of documents di

xij = n(wj ,di): number ofoccurrences of wj in di

Page 13: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

Documents as Mixtures of Latent Variables

Latent topic models consider documents (i.e. item containers)as a mixture of topics

A topic identifies a pattern in the co-occurrence ofmultinomial items wj within the documentsMixture of topics⇒ Associate an interpretation (topic) toeach item in a document, whose interpretation is then amixture of the items’ topics

X =

x11 . . . x1i . . . x1M. . . . . . . . . . . . . . .xj1 . . . xji . . . xjM. . . . . . . . . . . . . . .xN1 . . . xNi . . . xNM

Page 14: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

Latent Dirichlet Allocation (LDA)

LDA models a document as amixture of topics z

Assigning one topic z to each itemw with probability P(w |z, β)Pick one topic for the the wholedocument with probability P(z|θ)

Key point - Each document has itspersonal topic proportion θ sampledfrom a distribution

θ defines a multinomial distributionbut it is a random variable as well

Page 15: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

LDA Distributions

P(w |z, β) multinomial item-topicdistributionP(z|θ) multinomial topic distributionwith document-specific parameter θP(θ|α) Dirichlet distribution withhyperparameter α

A distribution for vectors that sumto 1 (simplex)The elements of a multinomial arevector that sum to 1!

Page 16: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

Dirichlet Distribution

Why a Dirichlet distribution?Conjugate prior to multinomial distributionIf the likelihood is multinomial with a Dirichlet prior thenposterior is Dirichlet

Dirichlet distribution

P(θ|α) =Γ(∑K

k=1 αk

)∏K

k=1 Γ(αk )

K∏k=1

θαk−1k

Dirichlet parameter αk is a prior count of the k -th topicIt controls the mean shape and sparsity of multinomialparameters θ

Page 17: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

Geometric Interpretation

LDA finds a set of K projection functions on the K -dimensionaltopic simplex

Page 18: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

Effect of the α parameter

α = 0.001

Slide Credit - Blei at KDD 2011 Tutorial

Page 19: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

LDA and Text Analysis

Page 20: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

LDA Generative Process

For each of the M documentsChoose θ ∼ Dirichlet(α)

For each of the N itemsChoose a topic z ∼ Multinomial(θ)Pick an item wj with multinomialprobability P(wj |z, β)

Multinomial topic-item parameter matrix[β]K×V

βkj = P(wj = 1|zk = 1)

or P(wj = 1|z = k)

P(θ, z,w|α, β) = P(θ|α)N∏

j=1

P(zj |θ)P(wj |zj , β)

Page 21: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

Learning in LDA

Marginal distribution (a.k.a. likelihood) of a document d = w

P(w|α, β) =

∫ ∑z

P(θ, z,w|α, β)dθ

=

∫P(θ|α)

N∏j=1

k∑zj=1

P(zj |θ)P(wj |zj , β)dθ

Given {w1, . . . ,wM}, find (α, β) maximizing

L(α, β) = logM∏

i=1

P(wi |α, β)

Learning with hidden variables⇒ Expectation-MaximizationKey problem is inferring latent variables posterior

P(θ, z|w, α, β) =P(θ, z,w|α, β)

P(w|α, β)

Page 22: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

Posterior Inference

Optimal ELBO is achieved when Q(z) is equal to the latentvariable posterior

P(θ, z|w, α, β) =P(θ, z,w|α, β)

P(w|α, β)

Key problem is that computation of the posterior is nottractableComputation of the denominator is intractable due to thecouplings between β and θ in the summation over topics

P(w|α, β) =Γ(∑K

k=1 αk

)∏K

k=1 Γ(αk )

∫ K∏k=1

θαk−1k

N∏j=1

K∑k=1

V∏v=1

(θkβkv )wvj

Page 23: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

Approximating Parameter Inference in LDA

Variational InferenceMaximize the variational bound without using the optimalposterior solution

Write a Q(z|φ) function that is sufficiently similar to theposterior but tractableQ(z|φ) should be such that β and θ are no longer coupledFit φ parameter so that Q(z|φ) is close to P(w|α, β)according to KL

Variational LDA: Blei, Ng and Jordan, 2003Takes hours to converge (but it is an approximation)

Sampling ApproachConstruct a Markov chain on the hidden variables whoselimiting distribution is the posteriorSampling LDA: Griffiths and Steyvers, 2004Takes days to converge (but it is accurate)

Page 24: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

Variational Inference

Key Idea

Assume that our distribution Q(z|φ) factorizes (it is tractable)→mean-field assumption

Q(z|φ) = Q(z1, . . . , zK |φ) =K∏

k=1

Q(zk |φk )

Can be made more general by factorizing on groups oflatent variablesDoes not contain the true posterior because hiddenvariables are dependentVariational inference

Optimize ELBO using Q(z|φ) factorized distributionCoordinate ascent inference - Iteratively optimize eachvariational distribution holding the others fixed

Page 25: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

Variational LDA Distribution

Given Φ = {γ, φ, λ} as variational approximation parameters

Q(θ, z, β|Φ) = Q(θ|γ)N∏

n=1

Q(zn|φn)K∏

k=1

Q(βk |λk )

Then we have the model parameters Ψ = α, β of sampledistribution P(θ, z,w|α, β) = P(θ, z,w|Ψ)

Page 26: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

Latent Dirichlet AllocationApproximated Inference

Variational Expectation-Maximization

Find the Φ,Ψ that maximize the ELBO

L(w,Φ,Ψ) = EQ [log P(θ, z,w|Ψ)]− EQ [log Q(θ, z,Ψ|Φ)]

by alternate maximization1 repeat2 Fix Ψ: update variational parameters Φ∗ (E-STEP)3 Fix Φ = Φ∗: update model parameters Ψ∗ (M-STEP)4 until little likelihood improvement

Unlike EM, variational EM has no guarantee to reach a localmaximizer of L

Page 27: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

LDA Applications

Why using latent topic models?Organize large collections of documents by identifyingshared topicsUnderstanding the documents semantics (unsupervised)Documents are of different nature

TextImagesVideoRelational data (graphs, time-series, etc..)

In short: a model for collections of high-dimensionalvectors whose attributes are multinomial distributions

Page 28: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

Understanding Image Collections

How can we apply latent topic analysis to visual documents?

We need a way to representvisual content as in text

Text ≡ collection of discreteitems⇒ wordsImage ≡ collection ofdiscrete items⇒ ?

Visual patchesFeature detectors to identifyrelevant image parts(MSER)Feature descriptors torepresent content (SIFT)How can I obtain a discretevocabulary for visual terms?

Page 29: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

Building a Visterm Vocabulary

Given a dataset of images1 For each image I

Identify interesting points (MSER/SIFT/grid)Extract the corresponding descriptors (SIFT)

2 Concatenate the image descriptors in a 128× N matrix,where N is the total number of descriptors extracted

3 Cluster the descriptors in C groups to obtain a vocabularyof C visterms (k-means)

You know all the necessary techniques to build this system!

Page 30: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

Representing Image as a Bag of Items

Each image I is a document and each visual patch inside itis an itemAssociate each patch to the nearest cluster/visterm cCount the occurrences of each dictionary visterm c in yourimageRepresent the image as a vector of visterm counts

...

Codebook

CollectionImage

Image Collection

Vector

k−means

Input Image

Local patches

Quantization

BOW

Page 31: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

LDA Image Understanding

Assigning a topic to each visual patch

Page 32: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

Unsupervised Semantic Segmentation

Combine latent topics with Markov random fieldsUse LDA to identify topics of some pixel patchesUse MRF to diffuse LDA topics and enforce coherentpixel-level semantic segmentation

Zhao, Fei-Fei and Xing, Image Segmentation with Topic Random Field, ECCV 2010

Page 33: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

Dynamical Topic Models

LDA assumes that the document order does not countWhat if we want to track topic evolution over time?Tracking how language changes over timeVideos are image documents over time

Blei and Lafferty. Dynamic topic models, ICML 2006

Page 34: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

Topic Evolution over Time

https://github.com/blei-lab/dtm

Page 35: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

Topic Trends

https://github.com/blei-lab/dtm

Page 36: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

Relational Topic Models

Using topic models with relational data (graphs)Community discovery and connectivity pattern profiles(Kemp, Griffiths, Tenenbaum, 2004)Joint content-connectivity analysis (Blei,Chang, 2010)

Page 37: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

Variational Learning in Code

PyMC3 - Python library with particular focus on variationalalgorithms (not PyMC!)Edward - Python library with lots of variational inferencefrom the father of LDABayespy - Variational Bayesian inference forconjugate-exponential family onlyAutograd - Variational and deep learning with differentiationas native Python operator (no strange backend)Matlab does not have official support for variationallearning but standalone implementation of various models(check Variational-Bayes.org)LDA is implemented in many Python libraries: scikit-learn,pypi, gensim (efficient topic models)

Page 38: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

Take Home Messages

Bayesian learning amounts to treating distributions asrandom variables sampled from another distribution

Add priors to ML distributionsLearn functions instead of point estimates

Latent Dirichlet AllocationBayesian model to organize collections of multinomial dataUnsupervised latent representation learning

Variational lower boundMaximizing a lower bound of an intractable likelihoodAlternatively estimate variational parameters and maximizew.r.t model parametersA fundamental concept to understand variational deeplearning

Page 39: Bayesian Learning and Variational Inference · for bothgenerative models and deep learning Bayesian latent variable models A class of generative models for which variational or approximated

IntroductionBayesian Learning

Applications

LDA in Pattern RecognitionSoftwareConclusions

Next Lecture

Sampling MethodsIntroduction to sampling methodsAncestral samplingGibbs SamplingMCMC family and advanced methods