Expectation-Maximization & Belief Propagation

Expectation-MaximizationExpectation-Maximization& Belief Propagation& Belief Propagation

Alan YuilleAlan Yuille

Dept. Statistics UCLADept. Statistics UCLA

Goal of this Talk.Goal of this Talk.

• The goal is to introduce the The goal is to introduce the Expectation-Maximization (EM) and Expectation-Maximization (EM) and Belief Propagation (BP) algorithms.Belief Propagation (BP) algorithms.

• EM is one of the major algorithms EM is one of the major algorithms used for inference for models where used for inference for models where there are hidden/missing/latent there are hidden/missing/latent variables.variables.

1. Chair

Example: Geman and Example: Geman and GemanGeman

Images are piecewise Images are piecewise smoothsmooth

Assume that images are smooth except at Assume that images are smooth except at sharp discontinuities (edges). sharp discontinuities (edges). Justification from the statistics of real Justification from the statistics of real images (Zhu & Mumford).images (Zhu & Mumford).

Graphical Model & PotentialGraphical Model & Potential

The potential. If the gradient in ubecomes too large, then the line process is activated and the smoothness is cut.

The Graphical Model.An undirected graph.Hidden Markov Model.

The Posterior DistributionThe Posterior Distribution

• We apply Bayes rule to get a We apply Bayes rule to get a posterior distribution:posterior distribution:

Line Process: Off and OnLine Process: Off and On

• Illustration of Line Processes.Illustration of Line Processes.

Edge

No Edge

Choice of Task.Choice of Task.

What do we want to estimate?What do we want to estimate?

Expectation Maximization.Expectation Maximization.

Expectation-MaximizationExpectation-Maximization

Back to the Geman & Geman Back to the Geman & Geman modelmodel

Image ExampleImage Example

Neural Networks and the Neural Networks and the BrainBrain• An early variant of this An early variant of this

algorithm was formulated as a algorithm was formulated as a Hopfield network.Hopfield network.

• Koch, Marroquin, Yuille (1987) Koch, Marroquin, Yuille (1987)

• It is just possible that a variant It is just possible that a variant of this algorithm is of this algorithm is implemented in V1 – Prof. Tai implemented in V1 – Prof. Tai Sing Lee (CMU).Sing Lee (CMU).

EM for Mixture of two EM for Mixture of two GaussiansGaussians

• A mixture model is of form:A mixture model is of form:

EM for a Mixture of two EM for a Mixture of two GaussiansGaussians• Each observation has been generated by Each observation has been generated by

one of two Gaussians. But we do not know one of two Gaussians. But we do not know the parameters (i.e. mean and variance) of the parameters (i.e. mean and variance) of the Gaussians and we do not know which the Gaussians and we do not know which Gaussian generated each observation.Gaussian generated each observation.

• Colours indicate the assignment of points Colours indicate the assignment of points to clusters (red and blue). Intermediates to clusters (red and blue). Intermediates (e.g. purple) represent probabilistic (e.g. purple) represent probabilistic assignments. The ellipses represent the assignments. The ellipses represent the current parameters values of each cluster.current parameters values of each cluster.

Expectation-Maximization: Expectation-Maximization: SummarySummary

• We can apply EM to any inference We can apply EM to any inference problem with hidden variables.problem with hidden variables.

• The following limitations apply:The following limitations apply:

• (1) (1) Can we perform the E and M Can we perform the E and M steps?steps? For the image problem, the E For the image problem, the E step was analytic and the M step step was analytic and the M step required solving linear equations.required solving linear equations.

• (2) (2) Does the algorithm converge to Does the algorithm converge to the global maximum of P(u|d)?the global maximum of P(u|d)? This is This is true for some problems, but not for true for some problems, but not for all.all.

Expectation Maximization: Expectation Maximization: SummarySummary

• For an important class of problems – EM For an important class of problems – EM has a nice symbiotic relationship with has a nice symbiotic relationship with dynamic programming (see next lecture).dynamic programming (see next lecture).

• Mathematically, the EM algorithm falls Mathematically, the EM algorithm falls into a class of optimization techniques into a class of optimization techniques known as Majorization (Statistics) and known as Majorization (Statistics) and Variational Bounding (Machine Learning). Variational Bounding (Machine Learning). Majorization (De Leeuw) is considerably Majorization (De Leeuw) is considerably older…older…

Belief Propagation (BP) and Belief Propagation (BP) and Message PassingMessage Passing

• BP is an inference algorithm that is exact for BP is an inference algorithm that is exact for graphical models defined on trees. It is similar graphical models defined on trees. It is similar to dynamic programming (see next lecture).to dynamic programming (see next lecture).

• It is often known as “loopy BP” when applied It is often known as “loopy BP” when applied to graphs with closed loops.to graphs with closed loops.

• Empirically, it is often a successful Empirically, it is often a successful approximate algorithm for graphs with closed approximate algorithm for graphs with closed loops. But it tends to degrade badly when the loops. But it tends to degrade badly when the number of closed loops increases.number of closed loops increases.

BP and Message ParsingBP and Message Parsing

• We define a distribution (undirected graph)We define a distribution (undirected graph)

• BP comes in two forms: (I) sum-product, and BP comes in two forms: (I) sum-product, and (II) max-product. (II) max-product.

• Sum product (Pearl) is used for estimating Sum product (Pearl) is used for estimating the marginal distributions of the variables x.the marginal distributions of the variables x.

Message Passing: Sum Message Passing: Sum ProductProduct• Sum-product proceeds by passing Sum-product proceeds by passing

messages between nodes.messages between nodes.

Message Parsing: Max Message Parsing: Max ProductProduct

• The max-product algorithm The max-product algorithm (Gallager) also uses messages but it (Gallager) also uses messages but it replaces the replaces the sumsum by a by a max.max.

• The update rule is:The update rule is:

Beliefs and MessagesBeliefs and Messages• We construct “beliefs” – estimates of the We construct “beliefs” – estimates of the

marginal probabilities – from the messages:marginal probabilities – from the messages:

• For graphical models defined on trees (i.e.no For graphical models defined on trees (i.e.no closed loops):closed loops):

• (i) sum-product will converge to the (i) sum-product will converge to the marginals of the distribution P(x).marginals of the distribution P(x).

(ii) max-product converges to the maximum (ii) max-product converges to the maximum probability states of P(x).probability states of P(x).

But this is not very special, because other But this is not very special, because other algorithms do this – see next lecture.algorithms do this – see next lecture.

Loopy BPLoopy BP

• The major interest in BP is that it The major interest in BP is that it performs well empirically when performs well empirically when applied to graphs with closed loops.applied to graphs with closed loops.

• But:But:• (i) (i) convergence is not guaranteedconvergence is not guaranteed (the (the

algorithm can oscillate)algorithm can oscillate)• (ii) (ii) the resulting beliefs are only the resulting beliefs are only

approximations to the correct approximations to the correct marginalsmarginals..

Bethe Free EnergyBethe Free Energy

• There is one major theoretical result There is one major theoretical result (Yedidia et al).(Yedidia et al).

• The fixed points of BP correspond to The fixed points of BP correspond to extrema of the Bethe free energy.extrema of the Bethe free energy.

• The Bethe free energy is one of a set The Bethe free energy is one of a set of approximations to the free energy.of approximations to the free energy.

BP without messages.BP without messages.

• Use the beliefs to construct local Use the beliefs to construct local approximations B(.) to the approximations B(.) to the distribution.distribution.

• Update beliefs by repeated Update beliefs by repeated marginalizationmarginalization

BP without messagesBP without messages

• Local approximations (consistent on Local approximations (consistent on trees).trees).

Another Viewpoint of BPAnother Viewpoint of BP

• There is also a relationship between There is also a relationship between BP and Markov Chain Monte Carlo BP and Markov Chain Monte Carlo (MCMC).(MCMC).

• BP is like a deterministic form of the BP is like a deterministic form of the Gibbs sampler.Gibbs sampler.

• MCMC will be described in later MCMC will be described in later lectures.lectures.

Summary of BPSummary of BP

• BP gives exact results on trees (similar to BP gives exact results on trees (similar to dynamic programming).dynamic programming).

• BP gives surprisingly good approximate BP gives surprisingly good approximate results on graphs with loops. No results on graphs with loops. No guarantees of convergence, but fixed guarantees of convergence, but fixed points of BP correspond to extrema of the points of BP correspond to extrema of the Bethe Free energy.Bethe Free energy.

• BP can be formulated without messages.BP can be formulated without messages.• BP is like a deterministic version of the BP is like a deterministic version of the

Gibbs sampler in MCMC.Gibbs sampler in MCMC.

Expectation-Maximization & Belief Propagation

Documents

Transcript of Expectation-Maximization & Belief Propagation