Variational Inference for the Indian Buffet...

Variational Inference for the Indian Buffet Process

Finale Doshi-Velez† Kurt T. Miller† Jurgen Van Gael† Yee Whye TehCambridge University UC Berkeley Cambridge University Gatsby Unit

† Authors contributed equally

Introduction

Motivating example

We are interested in extracting unobserved features from observed data.For example:

• Latent classes ⇒ Mixture models• Latent features ⇒ Latent feature models

Finale Doshi-Velez, Kurt T. Miller, Jurgen Van Gael, Yee Whye Teh Variational Inference for the Indian Buffet Process 2

Introduction

Motivating example

• Latent classes ⇒ Mixture models

• Latent features ⇒ Latent feature models

Introduction

Motivating example

• Latent classes ⇒ Mixture models• Latent features ⇒ Latent feature models

Introduction

Linear Gaussian Latent Feature ModelWe will focus on one example of a latent feature model:

X Z A+= !. . .

Observation for object i Features for object i

Observed Unobserved

• N = Number of data points

• D = Dimension of observed data

• K = Number of latent features

Introduction

Linear Gaussian Latent Feature Model

We will focus on one example of a latent feature model:

X Z A+= !. . .

Observed Unobserved

• N = Number of data points

• D = Dimension of observed data

• K = Number of latent features

Introduction

Linear Gaussian Model Latent Feature Model

Goal: Infer Z and A given data X.

Approach: Bayes’ rule:

p(Z,A|X) ∝ p(X|Z,A)p(A)︸︷︷︸Model specific

× p(Z)︸︷︷︸Prior on binary matrices

In the linear Gaussian model, we use

• p(X|Z,A) ∼ N (ZA, σ2nI)

• p(A) ∼ N (0, σ2AI)

• p(Z) ∼?X

Introduction

• p(A) ∼ N (0, σ2AI)

• p(Z) ∼?X

Introduction

• p(A) ∼ N (0, σ2AI)

• p(Z) ∼?X

The Indian Buffet Process

The Indian Buffet Process - Stick-breaking construction

• First generate v1, v2, · · · i.i.d.∼ Beta(α, 1).

• Let πi =∏ij=1 vj .

• Sample znk ∼ Bernoulli(πk).

· · ·v1 v2 v3 v4 v5 v6 v7 v8 v9

(Teh et al, 2007)

· · ·v1 v2 v3 v4 v5 v6 v7 v8 v9 !1 !2 !3 !4

!!5 !6 !7 !8 !9 · · ·

!1 !2 !3 !4 !5 !6 !7 !8 !9 · · ·

Full Linear Gaussian Latent Feature Model

Model:

• p(A) ∼ N (0, σ2AI)

• p(Z) ∼ IBP(α)X

Given X, how do we do inference on Z and A?

• Even for finite K, there are 2NK possible Z.

• Many local optima.

Full Linear Gaussian Latent Feature Model

Model:

• p(A) ∼ N (0, σ2AI)

• p(Z) ∼ IBP(α)X

Given X, how do we do inference on Z and A?

• Even for finite K, there are 2NK possible Z.

• Many local optima.

Inference in the Linear Gaussian Model

N = 100, D = 500,K = 25

0 5 10 15 20 25 30−7

Time sampler run (minutes)

N = 500, D = 500,K = 25

0 5 10 15 20 25 30

Time sampler run (minutes)P

Collapsed Gibbs

N = 100, D = 500,K = 25

0 5 10 15 20 25 30−7

N = 500, D = 500,K = 25

0 5 10 15 20 25 30

Uncollapsed GibbsCollapsed Gibbs

N = 100, D = 500,K = 25

0 5 10 15 20 25 30−7

N = 500, D = 500,K = 25

0 5 10 15 20 25 30

VariationalUncollapsed GibbsCollapsed Gibbs

Variational Inference for the IBP

Mean Field Variational Inference

Approximate p(Z,A|X) with distribution q(Z,A) from a family Q that is“close” to p(Z,A|X).

How do we define “close”? We will attempt to find

q(Z,A) = arg minq∈Q

D(q(Z,A)||p(Z,A|X)).

Mean Field Variational Inference

Approximate p(Z,A|X) with distribution q(Z,A) from a family Q that is“close” to p(Z,A|X).

How do we define “close”? We will attempt to find

q(Z,A) = arg minq∈Q

How do we choose Q?

p(Z,A|X) is a distribution over infinitely many features.

Trick (Blei and Jordan, 2004): Let Q be a truncated family where weassume that Z is nonzero in at most the first K columns.

Why can we do this? Intuitively, the probability πk that znk is onedecreases exponentially quickly.

Truncation boundMore formally, let mK(X) be the marginal of X when Z and A areintegrated out when we truncate the stick-breaking construction atcolumn K.

Then we can show

∫|mK(X)−m∞(X)|dX ≤ 1− exp

(−Nα

0 10 20 30 40 50 60 700

Truncation BoundTrue Distance

This is the first such bound for the IBP and can serve as a guideline forhow to choose K for the family Q.

Truncation boundMore formally, let mK(X) be the marginal of X when Z and A areintegrated out when we truncate the stick-breaking construction atcolumn K.

Then we can show

∫|mK(X)−m∞(X)|dX ≤ 1− exp

(−Nα

0 10 20 30 40 50 60 700

Truncation BoundTrue Distance

This is the first such bound for the IBP and can serve as a guideline forhow to choose K for the family Q.

How do we choose Q?

We let our family Q be the parameterized family (introducing thestick-breaking variables v)

q(Z,A, v) = qν(Z)qφ(A)qτ (v)

True distribution:

Variational distribution:

• qνnk(znk) = Bernoulli(znk; νnk)

• qφk(Ak·) = N (Ak·; φ̄k,Φk)

• qτk(vk) = Beta(vk; τk1, τk2)

Inference

Inference now reduces to finding variational parameters (τ, φ,Φ, ν) suchthat q ∈ Q is “close” to p.

(τ, φ,Φ, ν) = arg minq∈Q

This is not a convex optimization, so we can only hope to find a localoptimum.

⇒ Parameter updates done iteratively.

Inference

Inference now reduces to finding variational parameters (τ, φ,Φ, ν) suchthat q ∈ Q is “close” to p.

(τ, φ,Φ, ν) = arg minq∈Q

This is not a convex optimization, so we can only hope to find a localoptimum.

⇒ Parameter updates done iteratively.

Parameter updates

Many calculations are straightforward exponential family calculations.

The only nontrivial calculation is Ev,Z [log p(Znk|v)] which requiresevaluating

k∏m=1

We provide an efficient way to lower bound this term.

Results

Results: Synthetic data

N = 100, D = 500,K = 25

0 5 10 15 20 25 30−7

N = 500, D = 500,K = 25

0 5 10 15 20 25 30

VariationalUncollapsed GibbsCollapsed Gibbs

Results

Results: Real data

2 data sets:

• Yale faces data set: linear Gaussian model, N = 721, D = 1024(32× 32 images)

• Speech data set: iICA model, N = 245, D = 10

0 50 100 150 200 250−30

Results

Results: Real data

Faces data set: N = 721, D = 1024

5 10 250

Large D, N - Variational helps

Speech data set: N = 245, D = 10

2 5 90

Uncollapsed GibbsVariational

Small N , D - Variational does not help

Summary

• We present the first variational inference algorithm for the IBP.

• For large N and D, it finds better local optima than the samplers.

• We also present the first truncation bound for the IBP.

Code will be available soon from our websites.

Questions?

Variational Inference for the Indian Buffet...

Documents

Transcript of Variational Inference for the Indian Buffet...

Lecture 14: Approximate Inference Sampling Methods · Lecture 14: Approximate Inference Sampling Methods Theo Rekatsinas 1. Approaches to inference 2 •Exact inference algorithms

Variational Implicit Solvation: Empowering Mathematics and ...bli/presentations/UCI_October2012.pdfVariational Implicit Solvation: Empowering Mathematics and Computation to Understand

1 Inference Rules and Proofs Z: Inference Rules and Proofs.

Exact Inference: Complexitysrihari/CSE574/Chap8/Ch8-PGM-Inference… · Machine Learning Srihari 2 Topics 1. What is Inference? 2. Complexity Classes 3. Exact Inference 1. Variable

High Feed Radius Milling Cutter AJX Expansion …...14 type Insert size ool rigidity low Minor edge Insert clamp bridges are standard (except AJX 06, 08 type). Rigid insert clamping

Statistical Inference Two Statistical Tasks 1. Description 2. Inference.

Variational Methods, Multisymplectic Geometry and ...marsden/bib/2001/02-MaPeShWe2001/MaPeShWe2001.pdfVariational Methods, Multisymplectic Geometry and Continuum Mechanics Jerrold

Unit 5: Inference for categorical variables Lecture 1: Inference ...Unit 5: Inference for categorical variables Lecture 1: Inference for proportions - theoretical Statistics 101 Mine

Origo: causal inference by compression · 2018. 6. 29. · Origo: causal inference by compression 287 – a practical framework for causal inference based on MDL, – a causal inference

Inference. Overview The MC-SAT algorithm Knowledge-based model construction Lazy inference Lifted inference.

Direct Inference and Inverse Inference Teddy … Inference and Inverse Inference Teddy Seidenfeld The Journal of ... section 11 I consider Ian Hacking's novel ... Direct Inference

Variational principles in dissipative electro-magneto ...rosato+kiefer11.pdfVariational principles in dissipative electro-magneto-mechanics: A framework for the macro-modeling of functional

Week 7 Inference for regression - 2. Inference for prediction

4 : Exact Inference: Variable Elimination 1 Probabilistic Inference

Lecture 5 Fuzzy expert systems: Fuzzy inference Mamdani fuzzy inference Mamdani fuzzy inference Sugeno fuzzy inference Sugeno fuzzy inference Case study.

Name Definitions: Evidence Inference Period Inference vs ...

Ajx Rivets Handbook

Path optimization by a variational reaction coordinate ...chem.wayne.edu/schlegel/Pub_folder/386.pdfVariational Reaction Coordinate (VRC) method is a novel approach that finds the

Unit 5: Inference for categorical variables Lecture 1: Inference ...tjl13/s101/slides/unit5lec1H.pdfUnit 5: Inference for categorical variables Lecture 1: Inference for proportions

Making Inferences. What is an Inference? Inference – A conclusion made based on evidence and reasoning. Evidence + Reasoning = Inference.