MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

66
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes Michael I. Jordan INRIA University of California, Berkeley Acknowledgments: Brian Kulis, Tamara Broderick May 11, 2013

description

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes. Michael I. Jordan INRIA University of California, Berkeley. May 11, 2013. Acknowledgments : Brian Kulis , Tamara Broderick. Statistical Inference and Big Data. - PowerPoint PPT Presentation

Transcript of MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Page 1: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Michael I. JordanINRIA

University of California, Berkeley

Acknowledgments: Brian Kulis, Tamara Broderick

May 11, 2013

Page 2: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Statistical Inference and Big Data

• Two major needs: models with open-ended complexity and scalable algorithms that allow those models to be fit to data

Page 3: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Statistical Inference and Big Data

• Two major needs: models with open-ended complexity and scalable algorithms that allow those models to be fit to data

• In Bayesian inference the focus is on the models– burgeoning literature on Bayesian nonparametrics

provides stochastic processes for representing flexible data structures

– but the algorithmic choices are limited

Page 4: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Statistical Inference and Big Data

• Two major needs: models with open-ended complexity and scalable algorithms that allow those models to be fit to data

• In Bayesian inference the focus is on the models– burgeoning literature on Bayesian nonparametrics

provides stochastic processes for representing flexible data structures

– but the algorithmic choices are limited• So Big Data research hasn’t made much use of

Bayes, and is instead optimization-based– but the model choices tend to be limited

Page 5: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Bayesian Nonparametric Modeling

• Examples of stochastic processes used in Bayesian nonparametrics include distributions on:– directed trees of unbounded depth and unbounded fan-out– partitions– grammars– Markov processes with unbounded state spaces– infinite-dimensional matrices– functions (smooth and non-smooth)– copulae– distributions

• Power laws arise naturally in these distributions• Hierarchical modeling uses these stochastic processes as

building blocks

Page 6: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The Optimization Perspective

• Write down a loss function and a regularizer• Find scalable algorithms that minimize the sum of

these terms• Prove something about these algorithms

Page 7: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The Optimization Perspective

• Write down a loss function and a regularizer• Find scalable algorithms that minimize the sum of

these terms• Prove something about these algorithms• The connection to model-based inference is often in

the analysis, but is sometimes in the design– e.g., Bayesian ideas are sometimes used to inspire the

design of the regularizer

Page 8: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The Optimization Perspective

• Write down a loss function and a regularizer• Find scalable algorithms that minimize the sum of

these terms• Prove something about these algorithms• The connection to model-based inference is often in

the analysis, but is sometimes in the design– e.g., Bayesian ideas are sometimes used to inspire the

design of the regularizer• Where does the loss function come from?

Page 9: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The Optimization Perspective

• Write down a loss function and a regularizer• Find scalable algorithms that minimize the sum of

these terms• Prove something about these algorithms• The connection to model-based inference is often in

the analysis, but is sometimes in the design– e.g., Bayesian ideas are sometimes used to inspire the

design of the regularizer• Where does the loss function come from?

– Gauss, Huber, Fisher, …

Page 10: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The Optimization Perspective

• Write down a loss function and a regularizer• Find scalable algorithms that minimize the sum of

these terms• Prove something about these algorithms• The connection to model-based inference is often in

the analysis, but is sometimes in the design– e.g., Bayesian ideas are sometimes used to inspire the

design of the regularizer• Where does the loss function come from?

– Gauss, Huber, Fisher, …• It’s all very parametric, and the transition to

nonparametrics is a separate step

Page 11: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

This Talk

• Bayesian nonparametrics meets optimization– flexible, scalable modeling framework– gives rise to new loss functions and regularizers that are

naturally nonparametric– no recourse to MCMC, SMC, etc

Page 12: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

This Talk

• Bayesian nonparametrics meets optimization– flexible, scalable modeling framework– gives rise to new loss functions and regularizers that are

naturally nonparametric– no recourse to MCMC, SMC, etc

• Inspiration: the venerable, scalable K-means algorithm can be derived as the limit of an EM algorithm for fitting a mixture model

Page 13: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

This Talk

• Bayesian nonparametrics meets optimization– flexible, scalable modeling framework– gives rise to new loss functions and regularizers that are

naturally nonparametric– no recourse to MCMC, SMC, etc

• Inspiration: the venerable, scalable K-means algorithm can be derived as the limit of an EM algorithm for fitting a mixture model

• We do something similar in spirit, taking limits of various Bayesian nonparametric models:– Dirichlet process mixtures– hierarchical Dirichlet process mixtures– beta processes and hierarchical beta processes

Page 14: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

K-means Clustering

• Represent the data set in terms of K clusters, each of which is summarized by a prototype

• Each data is assigned to one of K clusters– Represented by allocations such that

for all data indices i we have

• Example: 4 data points and 3 clusters

Page 15: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

K-means Clustering

• Cost function: the sum-of-squared distances from each data point to its assigned prototype:

• The K-means algorithm is coordinate descent on this cost function

Page 16: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Coordinate Descent

• Step 1: Fix values for and minimize w.r.t– assign each data point to the nearest prototype

• Step 2: Fix values for and minimize w.r.t – this gives

• Iterate these two steps • Convergence guaranteed since there are a finite

number of possible settings for the allocations• It can only find local minima, so we should start the

algorithm with many different initial settings

Page 17: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Page 18: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Page 19: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Page 20: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Page 21: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Page 22: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Page 23: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Page 24: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Page 25: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Page 26: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

From Gaussian Mixtures to K-means

• A Gaussian mixture model:

• Set the mixing proportions to• Write down the EM algorithm for fitting this model• Take to zero and recover the K-means algorithm

– the E step of EM is Step 1 of K-means– the M step of EM is Step 2 of K-means

Page 27: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The K in K-means

• What if K is not known?– a challenging model selection problem– the algorithm itself is silent on the problem

• The Gaussian mixture model perspective brings the tools of Bayesian model selection to bear in principle, but not in the limit

• How about starting with Dirichlet process mixtures and Chinese restaurant processes instead of finite mixture models?

Page 28: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Chinese Restaurant Process (CRP)

• A random process in which customers sit down in a Chinese restaurant with an infinite number of tables– first customer sits at the first table– th subsequent customer sits at a table drawn from

the following distribution:

– where is the number of customers currently at table and where denotes the state of the restaurant after customers have been seated

Page 29: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The CRP and Clustering

• Data points are customers; tables are mixture components– the CRP defines a prior distribution on the partitioning of the

data and on the number of tables• This prior can be completed with:

– a likelihood---e.g., associate a parameterized probability distribution with each table

– a prior for the parameters---the first customer to sit at table chooses the parameter vector, , for that table from a prior

• We want to write out all of these probabilities and then take a scale parameter to zero

Page 30: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

CRP Prior, Gaussian Likelihood, Conjugate Prior

Page 31: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The CRP Prior

• Let denote a partition of the integers 1 through N, and let

• Then, under the CRP, we have:

• This function (the EPPF) is a function only of the cardinalities of the partition; this implies exchangeability

Page 32: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The Joint Probability

• Encode the partition with allocation variables • The joint probability of the allocations and the

data is the product of the EPPF and the usual mixture model likelihood, where is now random

• I.e., we obtain a joint probability

• And we can find the MAP estimate of a clustering via:

Page 33: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Small Variance Asymptotics

• Now let the likelihood be Gaussian and take the variance to zero

• We do this analytically by picking a rate constant and reparameterizing:

• And letting go to zero we get:

• I.e., a penalized form of the K-means objective

Page 34: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Coordinate Descent: DP Means

• Reassign a point to the cluster corresponding to the closest mean, unless the closest cluster has squared Euclidean distance greater than . In this case, start a new cluster.

• Given the cluster assignments, perform Gibbs moves on all the means, which amounts to sampling from the posterior based on and all observations in a cluster.

Page 35: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The CRP and Exchangeability

• The CRP is a distribution on partitions; it is an exchangeable distribution on partitions

• By De Finetti’s theorem, there must exist an underlying random measure such that the CRP is obtained by integrating out that random measure

• That random measure turns out to be the Dirichlet process (e.g., Blackwell & MacQueen, 1972)

Page 36: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The De Finetti Theorem

• An infinite sequence of random variables is called infinitely exchangeable if

the distribution of any finite subsequence is invariant to permutation

• Theorem: infinite exchangeability if and only if

for some random measure

Page 37: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Random Measures and Their Marginals

• The De Finetti random measure is known for a number of interesting combinatorial stochastic processes:– Dirichlet process => Chinese restaurant process (Polya

urn)– Beta process => Indian buffet process– Hierarchical Dirichlet process => Chinese restaurant

franchise– HDP-HMM => infinite HMM– Nested Dirichlet process => nested Chinese restaurant

process

Page 38: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Completely Random Measures

• Completely random measures are measures on a set that assign independent mass to nonintersecting subsets of– e.g., Poisson processes, gamma processes, beta

processes, compound Poisson processes and limits thereof• The Dirichlet process is not a completely random measure

– but it's a normalized gamma process• Completely random processes are discrete wp1 (up to a

possible deterministic continuous component)• Completely random measures are random measures, not

necessarily random probability measures

(Kingman, 1967)

Page 39: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Completely Random Measures

• Assigns independent mass to nonintersecting subsets of

(Kingman, 1967)

x

x

x

xx

x

x

x

x

x

x

x

xx

x

Page 40: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Completely Random Measures

• Consider a Poisson random measure on with rate function specified as a product measure

• Sample from this Poisson process and connect the samples vertically to their coordinates in

(Kingman, 1967)

xxx

xx

xx

x

x

Page 41: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Gamma Process

• The gamma process is a CRM for which the rate function is given as follows (on ):

• Draw a sample from a Poisson random measure with this rate measure

• And the resulting random measure can be written simply as:

Page 42: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Dirichlet Process

• The Dirichlet process is a normalized gamma process

x

x

x

xx

x

x

x

x

x

x

x

xx

x

Page 43: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Dirichlet Process

• The Dirichlet process is a normalized gamma process

x

x

x

xx

x

x

xx

xx

xxx

x

Page 44: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Dirichlet Process Marginals

• Consider the following hierarchy:

• The variables are clearly exchangeable

• The partition structure that they induce is exactly the Chinese restaurant process

Page 45: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Dirichlet Process Mixture Models

Page 46: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Multiple Estimation Problems

• We often face multiple, related estimation problems• E.g., multiple Gaussian means:

• Maximum likelihood:• Maximum likelihood often doesn't work very well

– want to “share statistical strength”

Page 47: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Hierarchical Bayesian Approach

• The Bayesian or empirical Bayesian solution is to view the parameters as random variables, related via an underlying variable

• Given this overall model, posterior inference yields shrinkage---the posterior mean for each combines data from all of the groups

Page 48: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Hierarchical Modeling

• The plate notation:

• Equivalent to:

Page 49: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Hierarchical Dirichlet Process Mixtures

Page 50: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Application: Protein Modeling

• A protein is a folded chain of amino acids• The backbone of the chain has two degrees of

freedom per amino acid (phi and psi angles)• Empirical plots of phi and psi angles are called

Ramachandran diagrams

Page 51: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Application: Protein Modeling

• We want to model the density in the Ramachandran diagram to provide an energy term for protein folding algorithms

• We actually have a linked set of Ramachandran diagrams, one for each amino acid neighborhood

• We thus have a linked set of density estimation problems

Page 52: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Protein Folding (cont.)

• We have a linked set of Ramachandran diagrams, one for each amino acid neighborhood

Page 53: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Protein Folding (cont.)

Page 54: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Chinese Restaurant Franchise (CRF)

Global menu:

Restaurant 1:

Restaurant 2:

Restaurant 3:

Page 55: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Small Variance Asymptotics

• Define group-specific allocation variables for groups

• As before, let the variance in the likelihood go to zero; the resulting posterior is asymptotically:

• Leads to a simple coordinate descent algorithm with two levels of clustering decisions

Page 56: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Dirichlet Processes vs. Beta Processes

• The Dirichlet process yields a classification of each data point into a single class (a “cluster”)

• The beta process allows each data point to belong to multiple classes (a “feature vector”)

• Essentially, the Dirichlet process yields a single coin with an infinite number of sides

• Essentially, the beta process yields an infinite collection of coins with mostly small probabilities, and the Bernoulli process tosses those coins to yield a binary feature vector

Page 57: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Beta Processes

• For the beta process, the rate function is given as follows (on the space ):

• And the resulting random measure can be written simply as:

degenerate Beta(0,c) distribution Base measure

Page 58: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Beta Processes

Page 59: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Beta Process and Bernoulli Process

Page 60: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Indian Buffet Process (IBP)

• Indian restaurant with infinitely many dishes in a buffet line

• Customers through enter the restaurant– the first customer samples dishes– the th customer samples a previously sampled dish

with probability then samples new dishes

(Griffiths & Ghahramani, 2002)

Page 61: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Beta Process Marginals

• Theorem: The beta process is the De Finetti mixing measure underlying the Indian buffet process (IBP)

(Thibaux & Jordan, 2007)

Page 62: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Hierarchical Beta Processes

• A hierarchical beta process is a beta process whose base measure is itself random and drawn from a beta process

Page 63: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Towards the BP-Means Algorithm

• Let denote the Indian buffet matrix• We again need to compute

• And perform small-variance asymptotics on

• To obtain a cost function to which we can apply coordinate descent

Page 64: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The Exchangeable Feature Probability Function (EFPF)

• How do we compute ? • Broderick, Jordan and Pitman (2013) develop a

general approach to computing such exchangeable feature probability functions (EFPFs)

• Applied to the Indian buffet process, it yields:

Page 65: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The Likelihood

• What about ? • Many possible choices; a popular one is:

Page 66: MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Small-Variance Asymptotics

• Now take to zero; this yields

• This yields a coordinate descent algorithm that we refer to as “BP-Means”