Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf ·...

26
Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau Dr. Marcelo Pereyra http://www.macs.hw.ac.uk/mp71/ Maxwell Institute for Mathematical Sciences, Heriot-Watt University June 2017, Heriot-Watt, Edinburgh. Joint work with Alain Durmus and Eric Moulines M. Pereyra (MI — HWU) LMS 17 0 / 25

Transcript of Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf ·...

Page 1: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Efficient Bayesian computation by proximal Markovchain Monte Carlo: when Langevin meets Moreau

Dr. Marcelo Pereyrahttp://www.macs.hw.ac.uk/∼mp71/

Maxwell Institute for Mathematical Sciences, Heriot-Watt University

June 2017, Heriot-Watt, Edinburgh.

Joint work with Alain Durmus and Eric Moulines

M. Pereyra (MI — HWU) LMS 17 0 / 25

Page 2: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Outline

1 Bayesian inference in imaging inverse problems

2 Proximal Markov chain Monte Carlo

3 Experiments

4 Conclusion

M. Pereyra (MI — HWU) LMS 17 1 / 25

Page 3: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Imaging inverse problems

We are interested in an unknown x ∈ Rd .

We measure y , related to x by a statistical model p(y ∣x).

The recovery of x from y is ill-posed or ill-conditioned, resulting insignificant uncertainty about x .

For example, in many imaging problems

y = Ax +w ,

for some operator A that is rank-deficient, and additive noise w .

M. Pereyra (MI — HWU) LMS 17 2 / 25

Page 4: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

The Bayesian framework

We use priors to reduce uncertainty and deliver accurate results.

Given the prior p(x), the posterior distribution of x given y

p(x ∣y) = p(y ∣x)p(x)/p(y)

models our knowledge about x after observing y .

In this talk we consider that p(x ∣y) is log-concave; i.e.,

p(x ∣y) = exp{−φ(x)}/Z ,

where φ(x) is a convex function and Z = ∫ exp{−φ(x)}dx .

M. Pereyra (MI — HWU) LMS 17 3 / 25

Page 5: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Inverse problems in mathematical imaging

More precisely, we consider models of the form

p(x ∣y)∝ exp{−f (x) − g(x)} (1)

where f (x) and g(x) are lower semicontinuous convex functions fromRd → (−∞,+∞] and f is Lf -Lipschitz differentiable. For example,

f (x) = 12σ2 ∥y −Ax∥22

for some observation y ∈ Rp and linear operator A ∈ Rp×n, and

g(x) = α∥Bx∥† + 1S(x)

for some norm ∥ ⋅ ∥†, dictionary B ∈ Rn×n, and convex set S. Often, g ∉ C1.

M. Pereyra (MI — HWU) LMS 17 4 / 25

Page 6: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Maximum-a-posteriori (MAP) estimation

The predominant Bayesian approach in imaging is MAP estimation

x̂MAP = argmaxx∈Rd

p(x ∣y),

= argminx∈Rd

f (x) + g(x),(2)

that can be computed efficiently by “proximal” convex optimisation.

For example, the proximal gradient algorithm

xm+1 = proxL−1

g {xm + L−1∇f (xm)},

with proxλg(x) = argmaxu∈RN g(u) − 12λ ∣∣u − x ∣∣2 converges at rate O(1/m).

However, x̂MAP provides very little about p(x ∣y).

M. Pereyra (MI — HWU) LMS 17 5 / 25

Page 7: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Illustrative example: image resolution enhancement

Recover x ∈ Rd from low resolution and noisy measurements

y = Hx +w ,

where H is a circulant blurring matrix. We use the Bayesian model

p(x ∣y)∝ exp (−∥y −Hx∥2/2σ2 − β∥x∥1). (3)

y x̂MAP Uncertainty estimates?

Figure : Resolution enhancement of the Molecules image of size 256×256 pixels.

M. Pereyra (MI — HWU) LMS 17 6 / 25

Page 8: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Illustrative example: tomographic image reconstruction

Recover x ∈ Rd from partially observed and noisy Fourier measurements

y = ΦFx +w ,

where Φ is a mask and F is the 2D Fourier operator. We use the model

p(x ∣y)∝ exp (−∥y −ΦFx∥2/2σ2 − β∥∇dx∥1−2), (4)

where ∇d is the 2d discrete gradient operator and ∥ ⋅ ∥1−2 the `1 − `2 norm.

y x̂MAP Possible solution?

Figure : Tomographic reconstruction of the Shepp-Logan phantom image.

M. Pereyra (MI — HWU) LMS 17 7 / 25

Page 9: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Modern Bayesian computation

Recent surveys on Bayesian computation...

25th anniversary special issue on Bayesian computationP. Green, K. Latuszynski, M. Pereyra, C. P. Robert, ”Bayesian computation: a perspective onthe current state, and sampling backwards and forwards”, Statistics and Computing, vol. 25,no. 4, pp 835-862, Jul. 2015.

Special issue on “Stochastic simulation and optimisationin signal processing”M. Pereyra, P. Schniter, E. Chouzenoux, J.-C. Pesquet, J.-Y. Tourneret, A. Hero, and S.McLaughlin, “A Survey of Stochastic Simulation and Optimization Methods in Signal Pro-cessing” IEEE Sel. Topics in Signal Processing, in press.

M. Pereyra (MI — HWU) LMS 17 8 / 25

Page 10: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Outline

1 Bayesian inference in imaging inverse problems

2 Proximal Markov chain Monte Carlo

3 Experiments

4 Conclusion

M. Pereyra (MI — HWU) LMS 17 9 / 25

Page 11: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Inference by Markov chain Monte Carlo integration

Monte Carlo integrationGiven a set of samples X1, . . . ,XM distributed according to p(x ∣y), weapproximate posterior expectations and probabilities

1

M

M

∑m=1

h(Xm)→ E{h(x)∣y}, as M →∞

Guarantees from CLTs, e.g., 1√M∑M

m=1 h(Xm) ∼ N [E{h(x)∣y},Σ].

Markov chain Monte Carlo:Construct a Markov kernel Xm+1∣Xm ∼ K(⋅∣Xm) such that the Markovchain X1, . . . ,XM has p(x ∣y) as stationary distribution.

MCMC simulation in high-dimensional spaces is very challenging.

M. Pereyra (MI — HWU) LMS 17 10 / 25

Page 12: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Unadjusted Langevin algorithm

Suppose for now that p(x ∣y) ∈ C1. Then, we can generate samples bymimicking a Langevin diffusion process that converges to p(x ∣y) as t →∞,

X ∶ dXt =1

2∇ log p (Xt ∣y)dt + dWt , 0 ≤ t ≤ T , X(0) = x0.

where W is the n-dimensional Brownian motion.

Because solving Xt exactly is generally not possible, we use an EulerMaruyama approximation and obtain the “unadjusted Langevin algorithm”

ULA ∶ Xm+1 = Xm + δ∇ log p(Xm∣y) +√

2δZm+1, Zm+1 ∼ N (0, In)

ULA is remarkably efficient when p(x ∣y) is sufficiently regular.

M. Pereyra (MI — HWU) LMS 17 11 / 25

Page 13: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Unadjusted Langevin algorithm

However, our interest is in high-dimensional models of the form

p(x ∣y)∝ exp{−f (x) − g(x)}

with f ,g l.s.c. convex, ∇f Lf -Lipschitz continuous, and g ∉ C1.

Unfortunately, such models are beyond the scope of ULA, which mayperform poorly if p(x ∣y) is not Lipchitz differentiable.

Idea: Regularise p(x ∣y) to enable efficiently Langevin sampling.

M. Pereyra (MI — HWU) LMS 17 12 / 25

Page 14: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Approximation of p(x ∣y)

Moreau-Yoshida approximation of p(x ∣y) (Pereyra, 2015):

Let λ > 0. We propose to approximate p(x ∣y) with the density

pλ(x ∣y) = exp[−f (x) − gλ(x)]∫Rd exp[−f (x) − gλ(x)]dx

,

where gλ is the Moreau-Yoshida envelope of g given by

gλ(x) = infu∈Rd

{g(u) − (2λ)−1∥u − x∥22},

and where λ controls the approximation error involved.

M. Pereyra (MI — HWU) LMS 17 13 / 25

Page 15: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Moreau-Yoshida approximations

Key properties (Pereyra, 2015; Durmus et al., 2017):

1 ∀λ > 0, pλ defines a proper density of a probability measure on Rd .

2 Convexity and differentiability:pλ is log-concave on Rd .

pλ ∈ C1 even if p not differentiable, with

∇ log pλ(x ∣y) = −∇f (x) + {proxλg (x) − x}/λ,

and proxλg (x) = argmaxu∈RN g(u) − 12λ

∣∣u − x ∣∣2.

∇ log pλ is Lipchitz continuous with constant L ≤ Lf + λ−1.

3 Approximation error between pλ(x ∣y) and p(x ∣y):

limλ→0 ∥pλ − p∥TV = 0.

If g is Lg -Lipchitz, then ∥pλ − p∥TV ≤ λL2g .

M. Pereyra (MI — HWU) LMS 17 14 / 25

Page 16: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Illustration

Examples of Moreau-Yoshida approximations:

p(x)∝ exp (−∣x ∣) p(x)∝ exp (−x4) p(x)∝ 1[−0.5,0.5](x)

Figure : True densities (solid blue) and approximations (dashed red).

M. Pereyra (MI — HWU) LMS 17 15 / 25

Page 17: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Proximal ULA

We approximate X with the “regularised” auxiliary Langevin diffusion

Xλ ∶ dXλt =

1

2∇ log pλ (Xλ

t ∣y)dt + dWt , 0 ≤ t ≤ T , Xλ(0) = x0,

which targets pλ(x ∣y). Remark: we can make Xλ arbitrarily close to X.

Finally, an Euler Maruyama discretisation of Xλ leads to the(Moreau-Yoshida regularised) proximal ULA

MYULA ∶ Xm+1 = (1 − δλ)Xm − δ∇f {Xm} + δ

λ proxλg{Xm} +√

2δZm+1,

where we used that ∇gλ(x) = {x − proxλg(x)}/λ.

M. Pereyra (MI — HWU) LMS 17 16 / 25

Page 18: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Convergence results

Non-asymptotic estimation error bound

Theorem 2.1 (Durmus et al. (2017))

Let δmaxλ = (L1 + 1/λ)−1. Assume that g is Lipchitz continuous. Then,

there exist δε ∈ (0, δmaxλ ] and Mε ∈ N such that ∀δ < δε and ∀M ≥ Mε

∥δx0QMδ − p∥TV < ε + λL2

g ,

where QMδ is the kernel assoc. with M iterations of MYULA with step δ.

Note: δε and Mε are explicit and tractable. If f + g is strongly convexoutside some ball, then Mε scales with order O(d log(d)) (otherwise atworse O(d5)). See Durmus et al. (2017) for other convergence results.

M. Pereyra (MI — HWU) LMS 17 17 / 25

Page 19: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Outline

1 Bayesian inference in imaging inverse problems

2 Proximal Markov chain Monte Carlo

3 Experiments

4 Conclusion

M. Pereyra (MI — HWU) LMS 17 18 / 25

Page 20: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Sparse image deblurring

Bayesian credible region C∗α = {x ∶ p(x ∣y) ≥ γα} with

P [x ∈ Cα∣y] = 1 − α, and p(x ∣y)∝ exp (−∥y −Hx∥2/2σ2 − β∥x∥1)

y x̂MAP uncertainty estimates

Figure : Live-cell microscopy data (Zhu et al., 2012). Uncertainty analysis(±78nm × ±125nm) in close agreement with the experimental precision ±80nm.

Computing time 4 minutes. M = 105 iterations. Estimation error 0.2%..

M. Pereyra (MI — HWU) LMS 17 19 / 25

Page 21: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Sparse image deblurring

Estimation of reg. param. β by marginal maximum likelihood

β̂ = argmaxβ∈R+

p(y ∣β), with p(y ∣β)∝ ∫ exp (−∥y −Hx∥2/2σ2 − β∥x∥1)dx

y x̂MAP Reg. param β

Figure : Maximum marginal likelihood estimation of regularisation parameter β.

Computing time 0.75 secs..

M. Pereyra (MI — HWU) LMS 17 20 / 25

Page 22: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Bayesian model selection

p(Mk ∣y) = p(Mk)∫ p(x , y ∣Mk)dx/p(y) with

p(x , y ∣M1)∝ exp [−(∥y −H1x∥2/2σ2) − βTV (x)],

p(x , y ∣M2)∝ exp [−(∥y −H2x∥2/2σ2) − βTV (x)].

Boat image deblurring experiment (comp. time 30 minutes p/model):

observation y(5 × 5 uniform blur, BSNR 40dB)

x̂M1 (PSNR 34dB)p(M1∣y) = 0.96

x̂M2 (PSNR 33dB)p(M2∣y) = 0.04

M. Pereyra (MI — HWU) LMS 17 21 / 25

Page 23: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Uncertainty quantification of MRI tomographic image

Bayesian credible region C∗α = {x ∶ p(x ∣y) ≥ γα} with

P [x ∈ Cα∣y] = 1 − α, and p(x ∣y)∝ exp (−∥y −ΦFx∥2/2σ2 − β∥∇dx∥1−2),

x̂MAP (tumour intensity 0.30) min. tumour intensity 0.27 max. tumour intensity 0.33

Figure : Shepp-Logan experiment: uncertainty in tumour intensity 10%.

Computing time 1 minute. M = 105 iterations. Estimation error 3%. .

M. Pereyra (MI — HWU) LMS 17 22 / 25

Page 24: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Outline

1 Bayesian inference in imaging inverse problems

2 Proximal Markov chain Monte Carlo

3 Experiments

4 Conclusion

M. Pereyra (MI — HWU) LMS 17 23 / 25

Page 25: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Conclusion

The challenges facing modern image processing require a paradigmshift, and a new wave of analysis and computation methodologies.

Great potential for synergy between Bayesian and variationalapproaches at algorithmic, methodological, and theoretical levels.

MYULA delivers reliable and computationally efficient approximateinferences, with good control of accuracy vs. computing-time.

M. Pereyra (MI — HWU) LMS 17 24 / 25

Page 26: Efficient Bayesian computation by proximal Markov chain Monte …mp71/slides/mpereyra.pdf · 2017-06-29 · Modern Bayesian computation Recent surveys on Bayesian computation... 25th

Thank you!

Bibliography:

Durmus, A., Moulines, E., and Pereyra, M. (2017). Efficient Bayesian computation byproximal Markov chain Monte Carlo: when Langevin meets Moreau. SIAM J. ImagingSci. to appear.

Pereyra, M. (2015). Proximal Markov chain Monte Carlo algorithms. Statistics andComputing. open access paper, http://dx.doi.org/10.1007/s11222-015-9567-4.

Zhu, L., Zhang, W., Elnatan, D., and Huang, B. (2012). Faster STORM usingcompressed sensing. Nat. Meth., 9(7):721–723.

M. Pereyra (MI — HWU) LMS 17 25 / 25