@let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... ·...

65
Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline Bayes’ rule Data likelihood Priors MCMC inference Bayesian dictionary learning Summary Main references Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Department of Information, Risk, and Operations Management Department of Statistics and Data Sciences The University of Texas at Austin Machine Learning Summer School, Austin, TX January 07, 2015 1 / 63

Transcript of @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... ·...

Page 1: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Parametric Bayesian Models: Part I

Mingyuan Zhou and Lizhen Lin

Department of Information, Risk, and Operations ManagementDepartment of Statistics and Data Sciences

The University of Texas at Austin

Machine Learning Summer School, Austin, TXJanuary 07, 2015

1 / 63

Page 2: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Outline for Part I

• Bayes’ rule, likelihood, prior, posterior

• Hierarchical Bayesian models

• Gibbs sampling

• Sparse factor analysis

• Dictionary learning and sparse coding• Sparse priors on the factor scores

• Spike-and-slab sparse prior• Bayesian Lasso shrinkage prior

• Bayesian dictionary learning• Image denoising and inpainting• Introduce covariate dependence• Matrix completion

Documents

Wor

ds

P N×X

Count Matrix

= P K×Φ

Topics

Wor

ds

Documents

Top

ics

K N×Θ

ImagesP N×X = P K×

Φ

DictionarySparse codes

K N×Θ

2 / 63

Page 3: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Outline for Part II

• Bayesian modeling of count data• Poisson, gamma, and negative binomial distributions• Bayesian inference for the negative binomial distribution• Regression analysis for counts

• Latent variable models for discrete data• Latent Dirichlet allocation• Poisson factor analysis

Documents

Wor

ds

P N×X

Count Matrix

= P K×Φ

Topics

Wor

ds

Documents

Top

ics

K N×Θ

ImagesP N×X = P K×

Φ

DictionarySparse codes

K N×Θ

• Relational network analysis

3 / 63

Page 4: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Topics that will not be covered• Mixture models (except for topic models and stochastic

blockmodels)• Hidden Markov models• Classification, naive Bayes• Markov chain Monte Carlo (MCMC) inference beyond Gibbs

sampling• Metropolis-Hastings, rejection sampling, slice sampling,

etc.• Variational Bayes inference• Model selection• Bayesian nonparametrics

• Gaussian processes• Completely random measures, gamma process, beta

process• Normalized random measures, Dirichlet process• Chinese restaurant process, Indian buffet process, negative

binomial process• Hierarchical Dirichlet process, gamma-negative binomial

process, beta-negative binomial process4 / 63

Page 5: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Bayes’ rule

• In equation:

P(θ|X ) =P(X |θ)P(θ)

P(X )=

P(X |θ)P(θ)∫P(X |θ)P(θ)dθ

If θ is discrete, then∫f (θ)dθ is replaced with

∑f (θ).

• In words:

Posterior of θ given X =Conditional Likelihood× Prior

Marginal Likelihood

5 / 63

Page 6: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

The i .i .d . assumption

• Usually X = {x1, . . . , xn} represents the data and θrepresents the model parameters.

• One usually assumes that {xi}i are independent andidentically distributed (i.i.d) conditioning on θ.

• Under the conditional i.i.d. assumption:

• P(X |θ) =∏n

i=1 P(xi |θ).• The data in X are exchangeable, which means that

P(x1, . . . , xn) = P(xσ(1), . . . , xσ(n)) for any randompermutation σ of the data indices 1, 2, . . . , n.

6 / 63

Page 7: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Marginal likelihood and predictivedistribution

• Marginal likelihood:

P(X ) =

∫P(X ,θ)dθ =

∫P(X |θ)P(θ)dθ

• Predictive distribution of a new data point xn+1:

P(xn+1|X ) =

∫P(xn+1|θ)P(θ|X )dθ (under i.i.d. assumption)

• The integrals are usually difficult to calculate. A popularapproach is using Monte Carlo integration.• Construct a Markov chain to draw S random samples{θ(s)}1,S from P(θ|X ).

• Approximate the integral as

P(xn+1|X ) ≈S∑

s=1

P(xn+1|θ(s))

S

7 / 63

Page 8: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Selecting an appropriate datalikelihood P(X |θ)

Selecting an appropriate conditional likelihood P(X |θ) todescribe your data. Some common choices:• Real-valued: normal distribution x ∼ N (µ, σ2)

P(x |µ, σ2) =1√

2πσ2exp

[− (x − µ)2

2σ2

]• Real-valued vector: multivariate normal distribution

x ∼ N (µ,Σ)• Gaussian maximum likelihood and least squares:

Finding a µ that minimizes the least squares objective functionn∑

i=1

(xi − µ)2

is the same as finding a µ that maximizes the Gaussianlikelihood

n∏i=1

1√2πσ2

exp

[− (xi − µ)2

2σ2

]8 / 63

Page 9: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

• Binary data: Bernoulli distribution x ∼ Bernoulli(p)

P(x |p) = px(1− p)1−x , x ∈ {0, 1}

• Count data: non-negative integers• Poisson distribution x ∼ Pois(λ)

P(x |λ) =λxe−λ

x!, x ∈ {0, 1, . . .}

• Negative binomial distribution x ∼ NB(r , p)

P(x |r , p) =Γ(n + r)

n!Γ(r)pn(1− p)r , x ∈ {0, 1, . . .}

9 / 63

Page 10: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

• Positive real-valued:• Gamma distribution

• x ∼ Gamma(k, θ), where k is the shape parameter and θis the scale parameter:

P(x |k, θ) =θ−k

Γ(k)xk−1e−

xθ , x ∈ (0,∞)

• Or x ∼ Gamma(α, β), where α = k is the shapeparameter and β = θ−1 is the rate parameter:

P(x |α, β) =βα

Γ(α)xα−1e−βx , x ∈ (0,∞)

• Truncated normal distribution

10 / 63

Page 11: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

• Categorical: (x1, . . . , xk) ∼ Multinomial(n, p1, . . . , pk)

P(x1, . . . , xk |n, p1, . . . , pk) =n!∏n

i=1 xi !px1

1 . . . pxkk

where xi ∈ {0, . . . , n} and∑k

i=1 xi = n.

• Ordinal, ranking

• Vector, matrix, tensor

• Time series

• Tree, graph, network, etc

11 / 63

Page 12: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Constructing an appropriate priorP(θ)

• Construct an appropriate prior P(θ) to impose priorinformation, regularize the joint likelihood, and help deriveefficient inference.

• Informative and non-informative priors:One may set the hyper-parameters of the prior distributionto reflect different levels of prior beliefs.

• Conjugate priors

• Hierarchical priors

12 / 63

Page 13: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Conjugate priors

If the prior P(θ) is conjugate to the likelihood P(X |θ), thenthe posterior P(θ|X ) and the prior P(θ) are in the same family.

• Conjugate priors are widely used to construct hierarchicalBayesian models.

• Although conjugacy is not required for MCMC inference, ithelps develop closed-form Gibbs sampling updateequations.

13 / 63

Page 14: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

• Example (i): beta is conjugate to Bernoulli.

xi |p ∼ Bernoulli(p), p ∼ Beta(β0, β1)

• Conditional likelihood:

P(x1, . . . , xn|p) =n∏

i=1

pxi (1− p)1−xi

• Prior:

P(p|β0, β1) =Γ(β0 + β1)

Γ(β0)Γ(β1)pβ0−1(1− p)β1−1

• Posterior:

P(p|X , β0, β1) ∝

{n∏

i=1

pxi (1− p)1−xi

}{pβ0−1(1− p)β1−1

}

(p|x1, . . . , xn, β0, β1) ∼ Beta

(β0 +

n∑i=1

xi , β1 + n −n∑

i=1

xi

)• Both the prior and and posterior of p are beta distributed.

14 / 63

Page 15: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Flip a coin 10 times, observe 8 heads and 2 tails. Is this a faircoin?

• Model 1: xi |p ∼ Bernoulli(p), p ∼ Beta(2, 2)• Black is the prior probability density function:

p ∼ Beta(2, 2)

• Red is the posterior probability density function:

(p|x1, . . . , x10) ∼ Beta(10, 4)

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

p

Pro

babi

lity

dens

ity fu

nctio

n

15 / 63

Page 16: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Flip a coin 10 times, observe 8 heads and 2 tails. Is this a faircoin?

• Model 2: xi |p ∼ Bernoulli(p), p ∼ Beta(50, 50)• Black is the prior probability density function:

p ∼ Beta(50, 50)

• Red is the posterior probability density function:

(p|x1, . . . , x10) ∼ Beta(58, 52)

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

12

p

Pro

babi

lity

dens

ity fu

nctio

n

16 / 63

Page 17: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Flip 100 times, observe 80 heads and 20 tails. Is this a faircoin?

• Model 2: xi |p ∼ Bernoulli(p), p ∼ Beta(50, 50)• Black is the prior probability density function:

p ∼ Beta(50, 50)

• Red is the posterior probability density function:

(p|x1, . . . , x100) ∼ Beta(130, 70)

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

12

p

Pro

babi

lity

dens

ity fu

nctio

n

17 / 63

Page 18: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Data, prior, and posterior

• The data is the same:• The data would have a stronger influence on the posterior

if the prior is weaker.

• The prior is the same:• More observations usually reduce the uncertainty for the

posterior.

18 / 63

Page 19: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

• Example (ii): the gamma distribution is the conjugateprior for the precision parameter of the normal distribution.

xi |µ, ϕ ∼ N (µ, ϕ−1), ϕ ∼ Gamma(α, β)

• Conditional likelihood:

P(x1, . . . , xn|µ, ϕ) ∝ ϕ−n/2 exp

[−ϕ

n∑i=1

(xi − µ)2/2

]

• Prior:P(ϕ|α, β) ∝ ϕα−1e−βϕ

• Posterior:

P(ϕ|−) ∝{ϕ−n/2e−ϕ

∑ni=1(xi−µ)2/2

}{ϕα−1e−βϕ

}(ϕ|−) ∼ Gamma

(α +

n

2, β +

n∑i=1

(xi − µ)2

2

)• Both the prior and and posterior of ϕ are gamma

distributed.

19 / 63

Page 20: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

• Example (iii): xi ∼ N (µ, ϕ−1), µ ∼ N (µ0, ϕ−10 )

• Example (iv): xi ∼ Poisson(λ), λ ∼ Gamma(α, β)

• Example (v): xi ∼ NegBino(r , p), p ∼ Beta(α0, α1)

• Example (vi): xi ∼ Gamma(α, β), β ∼ Gamma(α0, β0)

• Example (vii):

(xi1, . . . , xik) ∼ Multinomial(ni , p1, . . . , pk),

(p1, . . . , pk) ∼ Dirichlet(α1, . . . , αk) =Γ(∑k

j=1 αj)∏kj=1 Γ(αj)

k∏j=1

pαj−1j

20 / 63

Page 21: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Hierarchical priors

• One may construct a complex prior distribution using ahierarchy of simple distributions as

P(θ) =

∫. . .

∫P(θ|αt)P(αt |αt−1) . . .P(α1)dα1 . . . dαt

• Draw θ from P(θ) using a hierarchical model:

θ|αt , . . . ,α1 ∼ P(θ|αt)

αt |αt−1, . . . ,α1 ∼ P(αt |αt−1)

· · ·α1 ∼ P(α1)

21 / 63

Page 22: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

• Example (i): beta-negative binomial distribution1

n|λ ∼ Pois(λ), λ|r , p ∼ Gamma

(r ,

p

1− p

), p ∼ Beta(α, β)

P(n|r , α, β) =

∫∫Pois(n;λ)Gamma

(λ; r ,

p

1− p

)Beta(p;α, β)dλdp

P(n|r , α, β) =Γ(r + n)

n!Γ(r)

Γ(β + r)Γ(α + n)Γ(α + β)

Γ(α + β + r + n)Γ(α)Γ(β), n ∈ {0, 1, . . .}

• A complicated probability mass function for a discreterandom variable arises from a simple beta-gamma-Poissonmixture.

1Here p/(1− p) represents the scale parameter of the gammadistribution

22 / 63

Page 23: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

• Example (ii): Student’s t-distribution

x |ϕ ∼ N (0, ϕ−1), ϕ ∼ Gamma(α, β)

P(x) =

∫N (x ; 0, ϕ−1)Gamma(ϕ;α, β)dϕ

=Γ(α + 1

2 )√

2βπΓ(α)

(1 +

x2

)−α− 12

If α = β = ν/2, then P(x) = tν(x) is the Student’st-distribution with ν degree of freedom

23 / 63

Page 24: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

• Example (iii): Laplace distribution (e.g., Park and Casella,JASA 2008)

x |η ∼ N (0, η), η ∼ Exp(γ2/2), γ > 0

P(x) =

∫N (x ; 0, η)Exp(η; γ2/2)dη =

γ

2e−γ|x |

P(x) is the probability density function of the Laplacedistribution, and hence

x ∼ Laplace(0, γ−1)

• The Student’s t and Laplace distributions are two widelyused sparsity-promoting priors.

24 / 63

Page 25: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Black: x ∼ N [0, (√

2)2]Red: x ∼ t0.5

Blue: x ∼ Laplace(0, 2)

−10 −5 0 5 10

0.00

0.10

0.20

0.30

x

Pro

babi

lity

dens

ity fu

nctio

n

25 / 63

Page 26: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Black: x ∼ N [0, (√

2)2]Red: x ∼ t0.5

Blue: x ∼ Laplace(0, 2)

2 4 6 8 10

−14

−10

−6

−2

x

Log

prob

abili

ty d

ensi

ty fu

nctio

n

26 / 63

Page 27: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

Priors and regularizations• Different priors can be matched to different regularizations as

− lnP(θ|X ) = − lnP(X |θ)− lnP(θ) + C ,

where C is a term that is not related to θ.• Assume that the data are generated as xi ∼ N (µ, 1) and the

goal is to find a maximum a posteriori probability (MAP)estimate of µ.• If µ ∼ N (0, ϕ−1), then the MAP estimate is the same as

argminµ

n∑i=1

(xi − µ)2 + ϕµ2

• If µ ∼ tν , then the MAP estimate is the same as

argminµ

n∑i=1

(xi − µ)2 + (ν + 1) ln(1 + ν−1µ2)

• If µ ∼ Laplace(0, γ−1), then the MAP estimate is thesame as

argminµ

n∑i=1

(xi − µ)2 + γ|µ|

27 / 63

Page 28: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

Conjugate priors

Hierarchicalpriors

Priors andregularizations

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

A typical advantage of solving a hierarchical Bayesian modelover solving a related regularized objective function:

• The regularization parameters, such as ϕ, ν and γ in thelast slide, often have to be cross-validated.

• In a hierarchical Bayesian model, we usually impose(possibly conjugate) priors on these parameters and infertheir posteriors given the data.

• If we impose non-informative priors, then we let the dataspeak for themselves.

28 / 63

Page 29: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Gibbs sampling

Posteriorrepresentation

Bayesiandictionarylearning

Summary

Mainreferences

Inference via Gibbs sampling

• Gibbs sampling:• The simplest Markov chain Monte Carlo (MCMC)

algorithm.• A special case of the Metropolis-Hastings algorithm.• Widely used for statistical inference.

• For a multivariate distribution P(x1, . . . , xn) that isdifficult to sample from, if it is simpler to sample each ofits variables conditioning on all the others, then we mayuse Gibbs sampling to obtain samples from thisdistribution as• Initialize (x1, . . . , xn) at some values.• For s = 1 : S

For i = 1 : nSample xi conditioning on the others from

P(xi |x1, . . . , xi−1, xi+1, . . . , xn)End

End

29 / 63

Page 30: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Gibbs sampling

Posteriorrepresentation

Bayesiandictionarylearning

Summary

Mainreferences

• A complicated multivariate distribution (Zhou and Walker,2014):

p(z1, . . . , zn|n, γ0, a, p) =γ l0p−al∑n

`=0 γ`0p−a`Sa(n, `)

l∏k=1

Γ(nk − a)

Γ(1− a),

where zi are categorical random variables, l is the number ofdistinct values in {z1, . . . , zn}, nk =

∑ni=1 δ(zi = k), and

Sa(n, `) are generalized Stirling numbers of the first kind.

• Gibbs sampling is easy:

• Initialize (z1, . . . , zn) at some values.• For s = 1 : S

For i = 1 : nSample zi from

P(zi = k |z1, . . . , zi−1, zi+1, . . . , zn, n, γ0, a, p)

{n−ik − a, for k = 1, . . . , l−i ;

γ0p−a, if k = l−i + 1.

EndEnd

30 / 63

Page 31: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Gibbs sampling

Posteriorrepresentation

Bayesiandictionarylearning

Summary

Mainreferences

Gibbs sampling in a hierarchalBayesian model

• Full joint likelihood of the hierarchical Bayesian model:

P(X ,θ,αt , . . . ,α1) = P(X |θ)P(θ|αt)P(αt |αt−1) . . .P(α1)

• Exact posterior inference is often intractable. We useGibbs sampling for approximate inference.

• Assume in the hierarchical Bayesian model that:• P(θ|αt) is conjugate to P(X |θ);• P(αt |αt−1) is conjugate to P(θ|αt);• P(αj |αj−1) is conjugate to P(αj+1|αj) for

j ∈ {1, . . . , t − 1}.

31 / 63

Page 32: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Gibbs sampling

Posteriorrepresentation

Bayesiandictionarylearning

Summary

Mainreferences

• In each MCMC iteration, Gibbs sampling proceeds as• Sample θ from

P(θ|X ,αt) ∝ P(X |θ)P(θ|αt);• For j ∈ {1, . . . , t − 1}, sample αj from

P(αj |αj+1,αj−1) ∝ P(αj+1|αj)P(αj |αj−1).

• If θ = (θ1, . . . , θV ) is a vector and P(θ|X ,αt) is difficultto sample from, then one may further consider sampling θas• for v ∈ {1, . . . ,V }, sample θv from

P(θv |θ−v ,X ,αt) ∝ P(X |θ−v , θv )P(θv |θ−v ,αt)

32 / 63

Page 33: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Gibbs sampling

Posteriorrepresentation

Bayesiandictionarylearning

Summary

Mainreferences

Data augmentation andmarginalization

What if P(αj |αj−1) is not conjugate to P(αj+1|αj)?

• Use other MCMC algorithms such as the Metropolis-Hastingsalgorithm.

• Marginalization: suppose P(αj |αj−1) is conjugate toP(αj+2|αj), then one may sample αj in closed formconditioning on αj+2 and αj−1.

• Augmentation: suppose ` is an auxiliary variable such that

P(`,αj+1|αj) = P(`|αj+1,αj)P(αj+1|αj) = P(αj+1|`,αj)P(`|αj),

and P(αj |αj−1) is conjugate to P(`|αj), then one can sample `from P(`|αj+1,αj) and then sample αj in closed formconditioning on ` and αj−1.

• We will provide an example on how to use marginalization andaugmentation to derive closed-form Gibbs sampling updateequations in Part II of this lecture.

33 / 63

Page 34: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Gibbs sampling

Posteriorrepresentation

Bayesiandictionarylearning

Summary

Mainreferences

Posterior representation withMCMC samples

• In MCMC algorithms, the posteriors of model parametersare represented using collected posterior samples.

• To collect S posterior samples, one often consider(SBurnin + g ∗ S) Gibbs sampling iterations:• Discard the first SBurnin samples;• Collect a sample per g ≥ 1 iterations after the burn-in

period.

One may also consider multiple independent Markovchains.

• MCMC Diagnostics:• Inspecting the traceplots of important model parameters• Convergence• Mixing• Autocorrelation• Effective sample size• ...

34 / 63

Page 35: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Gibbs sampling

Posteriorrepresentation

Bayesiandictionarylearning

Summary

Mainreferences

• With S posterior samples of θ, one can approximately• calculate the posterior mean of θ using

S∑s=1

θ(s)

S

• calculate∫f (θ)P(θ|X ) using

S∑s=1

f (θ(s))

S

• calculate P(xn+1|X ) =∫P(xn+1|θ)P(θ|X )dθ using

S∑s=1

P(xn+1|θ(s))

S

35 / 63

Page 36: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Introduction to dictionary learningand sparse coding

• The input is a data matrix X ∈ RP×N = {x1, . . . , xN},each column of which is a P dimensional data vector.

• Typical examples:• A movie rating matrix, with P movies and N users.• A matrix constructed from 8× 8 image patches, with

P = 64 pixels and N patches.

• The data matrix is usually incomplete and corrupted bynoises.

• A common task is to recover the original complete andnoise-free data matrix.

36 / 63

Page 37: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

• A powerful approach is to learn a dictionary D ∈ RP×K

from the corrupted X, with the constraint that a datavector is sparsely represented under the dictionary.

• The number of columns K of the dictionary could belarger than P, which means that the dictionary could beover-complete.

• A learned dictionary could provide a much betterperformance than an “off-the-shelf” or handcrafteddictionary.

• The original complete and noise-free data matrix isrecovered with the product of the learned dictionary andsparse representations.

Documents

Wor

ds

P N×X

Count Matrix

= P K×Φ

Topics

Wor

ds

Documents

Top

ics

K N×Θ

ImagesP N×X = P K×

Φ

DictionarySparse codes

K N×Θ

37 / 63

Page 38: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Optimization based methods

• X ∈ RP×N is the data matrix, D ∈ RP×K is the dictionary,and W ∈ RK×N is the sparse-code matrix.

• Objective function:

minD,W{||X−DW||F} subject to ∀i , ||w i ||0 ≤ T0

• A common approach to solve this objective function:• Sparse coding state: update sparse codes W while fixing

the dictionary D;• Dictionary learning state: update the dictionary D while

fixing the sparse codes W;• Iterate until convergence.

38 / 63

Page 39: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

• Sparse coding stage: Fix dictionary D, update sparsecodes W.

• minw i ||w i ||0 subject to ||x i −Dw i ||22 ≤ Cσ2

• or minw i ||x i −Dw i ||22 subject to ||w i ||0 ≤ T0

• Dictionary update stage: Fix sparse codes W (or sparsitypatterns), update dictionary D.• Method of optimal direction (MOD) (fix the sparse codes):

D = XWT (WWT )−1

• K-SVD (fix the sparsity pattern, rank-1 approximation):

d kw k: ≈ X−∑m 6=k

dmwm:

39 / 63

Page 40: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

• Restrictions of optimization based dictionary learningalgorithms:• Have to assume a prior knowledge of noise variance,

sparsity level or regularization parameters;• Nontrivial to handle data anomalies such as missing data;• May require sufficient noise free training data to pretrain

the dictionary;• Only point estimates are provided.• Have to tune the number of dictionary atoms.

• We will solve all restrictions except for the last one using aparametric Bayesian model.

• The last restriction could be solved by making the modelbe nonparametric, which will be briefly discussed.

40 / 63

Page 41: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Sparse factor analysis(spike-and-slab sparse prior)

• Hierarchical Bayesian model (Zhou et al, 2009, 2012):

x i = D(z i � s i ) + εi , εi ∼ N (0, γ−1ε IP)

d k ∼ N (0,P−1IP), s i ∼ N (0, γ−1s IK )

zik ∼ Bernoulli(πk), πk ∼ Beta(c/K , c(1− 1/K ))

γs ∼ Gamma(c0, d0), γε ∼ Gamma(e0, f0)

where z i � s i = (zi1si1, . . . , ziK siK )T .Note if zik = 0, then the sparse code ziksik is exactly zero.

• Data are partially observed:

y i = Σix i

where Σi is the projection matrix on the data, withΣiΣ

Ti = I||Σi ||0

41 / 63

Page 42: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

• Full joint likelihood:

P(Y,Σ,D,Z,S,π, γs , γε)

=N∏i=1

N (y i ; ΣiD(z i � s i ), γ−1ε I||Σ||0)N (s i ; 0, γ−1

s IK )

K∏k=1

N (d k ; 0,P−1IP)Beta(πk ; c/K , c(1− 1/K ))

N∏i=1

K∏k=1

Bernoulli(zik ;πk)

Gamma(γs ; c0, d0),Gamma(γε; e0, f0)

42 / 63

Page 43: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

• Gibbs sampling (details can be found in Zhou et al., IEEETIP 2012)• Sample zik from Bernoulli• Sample sik from Normal• Sample πk from Beta• Sample d k from Multivariate Normal• Sample γs from Gamma• Sample γε from Gamma

43 / 63

Page 44: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

• Logarithm of the posterior

− log p(Θ|X,H) =γε2

N∑i=1

‖x i −D(s i � z i )‖22

+P

2

K∑k=1

‖d k‖22 +

γs2

N∑i=1

‖s i‖22

− log fBeta−Bern({z i}Ni=1;H)

− log Gamma(γε|H)− log Gamma(γs |H)

+ Const.

where Θ represent the set of model parameters and Hrepresents the set of hyper-parameters.

• The sparse factor model tries to minimize the least squares ofthe data fitting errors while encouraging the representations ofthe data under the learned dictionary to be sparse.

44 / 63

Page 45: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Handling data anomalies

• Missing data• full data: x i , observed: y i = Σix i , missing: Σ̄ix i

N (x i ; D(s i � z i ), γ−1ε IP) = N (ΣT

i y i ; ΣTi ΣiD(s i � z i ),Σ

Ti Σiγ

−1ε IP)

N (Σ̄Ti Σ̄ix i ; Σ̄

Ti Σ̄iD(s i � z i ), Σ̄

Ti Σ̄iγ

−1ε IP)

• Spiky noise (outliers)

x i = D(s i � z i ) + εi + v i �mi

v i ∼ N (0, γ−1v IP), mip ∼ Bernoulli(π′ip), π′ip ∼ Beta(a0, b0)

• Recovered datax̂ i = D(s i � z i )

45 / 63

Page 46: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

How to select K?

• As K →∞, one can show that the parametric sparsefactor analysis model using the spike-and-slab priorbecomes a nonparametric Bayesian model governed by thebeta-Bernoulli process, or the Indian buffet process if thebeta process is marginalized out. This point will not befurther discussed in this lecture.

• We set K to be large enough, making the parametricmodel be a truncated version of the beta process factoranalysis model. As long as K is large enough, the obtainedresults would be similar.

46 / 63

Page 47: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Sparse factor analysis(Bayesian Lasso shrinkage prior)

• Hierarchical Bayesian model (Xing et al., SIIMS 2012):

x i ∼ N (Ds i , α−1IP), sik ∼ N (0, α−1ηik)

d k ∼ N (0,P−1IP), ηik ∼ Exp(γik/2)

α ∼ Gamma(a0, b0), γik ∼ Gamma(a1, b1)

• Marginalizing out ηik leads to

P(sik |α, γik) =

√αγik

2exp(−√αγik |sik |)

• This Bayesian Lasso shrinkage prior based sparse factormodel does not correspond to a nonparametric Bayesianmodel as K →∞. Thus the number of dictionary atomsK needs to be carefully set.

47 / 63

Page 48: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

• Logarithm of the posterior

− log p(Θ|X,H) =α

2

N∑i=1

‖x i −Ds i‖22

+P

2

K∑k=1

‖d k‖22

+N∑i=1

K∑k=1

√αγik |sik |

− log f (α, {γik}i,k ;H)

• This model tries to minimize the least squares of the data fittingerrors while encouraging the representations s i to be sparseusing L1 penalties.

48 / 63

Page 49: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Bayesian dictionary learning

• Automatically decide the sparsity level for each image patch.• Automatically decide the noise variance.• Simple to handle data anomalies.• Insensitive to initialization, does not requires a pertained

dictionary.• Assumption: image patches are fully exchangeable.

80% pixels missing at random Learned dictionary Recovered image (26.90 dB)

49 / 63

Page 50: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Image denoising

Noisy imageKSVD Denoising

mismatched variance

KSVD Denoising

matched varianceBPFA Denoising Dictionaries

50 / 63

Page 51: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Image denoising

51 / 63

Page 52: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Image inpainting

Left to right: corrupted image (80% pixels missing at random),restored image, original image

0 8 16 24 32 40 48 56 645

10

15

20

25

30

Learning round

PS

NR

52 / 63

Page 53: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Hyperspectral image inpainting

150× 150× 210 hyperspectral urban image95% voxels missing at random

6

Corrupted image, 10.9475dB

spec

tral

ban

d_10

0

Original imageRestored image,25.8839dB

Corrupted image, 10.9475dBsp

ectr

al b

and_

1Original imageRestored image,25.8839dB

Hyperspectral image inpainting

150*150*210 hyperspectral urban image

95% missing

Zhou 2011

6

Corrupted image, 10.9475dB

spec

tral

ban

d_10

0

Original imageRestored image,25.8839dB

Corrupted image, 10.9475dB

spec

tral

ban

d_1

Original imageRestored image,25.8839dB

Hyperspectral image inpainting

150*150*210 hyperspectral urban image

95% missing

Zhou 2011

6

Corrupted image, 10.9475dB

spec

tral

ban

d_10

0

Original imageRestored image,25.8839dB

Corrupted image, 10.9475dB

spec

tral

ban

d_1

Original imageRestored image,25.8839dB

Hyperspectral image inpainting

150*150*210 hyperspectral urban image

95% missing

Zhou 2011

53 / 63

Page 54: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Hyperspectral image inpainting

150× 150× 210 hyperspectral urban image95% voxels missing at random

6

Corrupted image, 10.9475dB

spec

tral

ban

d_10

0

Original imageRestored image,25.8839dB

Corrupted image, 10.9475dBsp

ectr

al b

and_

1Original imageRestored image,25.8839dB

Hyperspectral image inpainting

150*150*210 hyperspectral urban image

95% missing

Zhou 2011

6

Corrupted image, 10.9475dB

spec

tral

ban

d_10

0

Original imageRestored image,25.8839dB

Corrupted image, 10.9475dB

spec

tral

ban

d_1

Original imageRestored image,25.8839dB

Hyperspectral image inpainting

150*150*210 hyperspectral urban image

95% missing

Zhou 2011

6

Corrupted image, 10.9475dB

spec

tral

ban

d_10

0

Original imageRestored image,25.8839dB

Corrupted image, 10.9475dB

spec

tral

ban

d_1

Original imageRestored image,25.8839dB

Hyperspectral image inpainting

150*150*210 hyperspectral urban image

95% missing

Zhou 2011

53 / 63

Page 55: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Hyperspectral image inpainting

150× 150× 210 hyperspectral urban image95% voxels missing at random

6

Corrupted image, 10.9475dB

spec

tral

ban

d_10

0

Original imageRestored image,25.8839dB

Corrupted image, 10.9475dBsp

ectr

al b

and_

1Original imageRestored image,25.8839dB

Hyperspectral image inpainting

150*150*210 hyperspectral urban image

95% missing

Zhou 2011

6

Corrupted image, 10.9475dB

spec

tral

ban

d_10

0

Original imageRestored image,25.8839dB

Corrupted image, 10.9475dB

spec

tral

ban

d_1

Original imageRestored image,25.8839dB

Hyperspectral image inpainting

150*150*210 hyperspectral urban image

95% missing

Zhou 2011

6

Corrupted image, 10.9475dB

spec

tral

ban

d_10

0

Original imageRestored image,25.8839dB

Corrupted image, 10.9475dB

spec

tral

ban

d_1

Original imageRestored image,25.8839dB

Hyperspectral image inpainting

150*150*210 hyperspectral urban image

95% missing

Zhou 201153 / 63

Page 56: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Hyperspectral image inpainting

845× 512× 106 hyperspectral image98% voxels missing at random

Hyperspectral image inpainting

100 200 300 400 500

100

200

300

400

500

600

700

800

100 200 300 400 500

100

200

300

400

500

600

700

800

100 200 300 400 500

100

200

300

400

500

600

700

800

100 200 300 400 500

100

200

300

400

500

600

700

800

Spectral band 50 Spectral band 90

Original Restored Original Restored

Zhou 2011

54 / 63

Page 57: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Exchangeable assumption is oftennot true

• Image patches spatially nearby tend to share similarfeatures

• Left: patches are treated as exchangeable.Right: spatial covariate dependence is considered

55 / 63

Page 58: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Covariate dependent dictionarylearning (Zhou et al., 2011)

Idea: encouraging data nearby in the covariate space to sharesimilar features.

BP atom usage dHBP atom usage

BP dictionary dHBP dictionary Dictionary atom activation probability map

dHBP recoveryOriginaldHBP recovery

BP recoveryObserved

Image Interpolation: BP vs. dHBP

56 / 63

Page 59: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Observed (20%) BP recovery

dHBP recovery Original

dHBP recovery

Image Interpolation: BP vs. dHBP

57 / 63

Page 60: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Image Interpolation: BP vs. dHBP

Observed (20%) BP recovery

dHBP recovery Original

dHBP recovery

58 / 63

Page 61: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Spiky Noise Removal: BP vs. dHBP

BP denoised imageNoisy image (WGN + Sparse

Spiky noise)BP dictionary

dHBP denoised imageOriginal image dHBP dictionary

59 / 63

Page 62: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Spiky Noise Removal: BP vs. dHBP

BP denoised imageNoisy image (WGN + Sparse

Spiky noise)

dHBP denoised imageOriginal image dHBP dictionary

BP dictionary

60 / 63

Page 63: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Introduction todictionarylearning andsparse coding

Optimizationbased methods

Spike-and-slabsparse factoranalysis

Bayesian Lassosparse factoranalysis

Example results

Covariatedependentdictionarylearning

Summary

Summary

Mainreferences

Summary for Bayesian dictionarylearning

• A generative approach for data recovery from redundantnoisy and incomplete observations.

• A single baseline model applicable for all: gray-scale, RGB,and hyperspectral image denoising and inpainting.

• Automatically inferred noise variance and sparsity level.• Dictionary learning and reconstruction on the data under

test.• Incorporate covariate dependence.• Code available online for reproducible research.• In a sampling based algorithm, the spike-and-slab sparse

prior allows the representations to be exactly zero, whereasa shrinkage prior would not permit exactly zeros; fordictionary learning, the sparse-and-slab prior is often foundto be more robust, be easier to compute, and performsbetter.

61 / 63

Page 64: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

• Understand your data

• Define data likelihood

• Construct prior

• Derive inference using Bayes’ rule

• Implement in Matlab, R, Python, C/C++, ...

• Interpret model output

62 / 63

Page 65: @let@token Parametric Bayesian Models: Part I › sites › default › files › legacy... · 2020-05-20 · Parametric Bayesian Models: Part I Mingyuan Zhou and Lizhen Lin Outline

ParametricBayesian

Models: Part I

MingyuanZhou andLizhen Lin

Outline

Bayes’ rule

Datalikelihood

Priors

MCMCinference

Bayesiandictionarylearning

Summary

Mainreferences

M. Aharon, M. Elad, and A. M. Bruckstein.

K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation.IEEE Trans. Signal Processing, 2006.

M. Elad and M. Aharon.

Image denoising via sparse and redundant representations over learned dictionaries.IEEE Trans. Image Processing, 2006.

T.L. Griffiths and Z. Ghahramani.

Infinite latent feature models and the Indian buffet process.In Proc. Advances in Neural Information Processing Systems, pages 475–482, 2005.

R. Thibaux and M. I. Jordan.

Hierarchical beta processes and the Indian buffet process.In Proc. International Conference on Artificial Intelligence and Statistics, 2007.

P. Trevor and G. Casella.

The Bayesian lasso.Journal of the American Statistical Association, 2008.

Z. Xing, M. Zhou, A. Castrodad, G. Sapiro and L. Carin.

Dictionary learning for noisy and incomplete hyperspectral images.SIAM Journal on Imaging Sciences, 2012

M. Zhou, H. Chen, J. Paisley, L. Ren, G. Sapiro, and L. Carin.

Non-parametric Bayesian dictionary learning for sparse image representations.In NIPS, 2009.

M. Zhou, H. Chen, J. Paisley, L. Ren, L. Li, Z. Xing, D. Dunson, G. Sapiro, and L. Carin.

Nonparametric Bayesian dictionary learning for analysis of noisy and incomplete images.IEEE TIP, 2012.

M. Zhou, H. Yang, G. Sapiro, D. Dunson, and L. Carin.

Dependent hierarchical beta process for image interpolation and denoising.In AISTATS, 2011.

63 / 63