Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian...

23
Advanced Lectures on Bayesian Analysis Alan Heavens Imperial Centre for Inference and Cosmology (ICIC) Imperial College, London [email protected] November 23, 2016 Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 1 / 23

Transcript of Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian...

Page 1: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Advanced Lectures on Bayesian Analysis

Alan Heavens

Imperial Centre for Inference and Cosmology (ICIC)Imperial College, London

[email protected]

November 23, 2016

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 1 / 23

Page 2: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Overview

1 General linear models

2 Wiener filtering

3 Messenger Fields

4 Further Reading

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 2 / 23

Page 3: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

General linear models

Many problems are linear problems, where the measured data arelinear combinations of the parameters of interest. i.e.

y = Ax+ n

where A is a matrix and n is noise.

Note that A may not be square (and hence not invertible).

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 3 / 23

Page 4: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Map making

An example of this is map-making in the CMB. x would represent the pixeltemperatures, y is the ‘time-ordered data’, and A is a very sparse matrix of1s and 0s, being 1 if the telescope is pointing at the pixel and zerootherwise.The model can be extended so that x can contain anything else on whichthe data depend linearly, which might involve calibration uncertainties, orsystematic e↵ects.Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 4 / 23

Page 5: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Generalised Linear Models

From a Bayesian perspective, we are most interested in the posteriordistribution of x, given the data y, but let us think about making amap - i.e. we want an estimator for x, given the data.

We further assume that we know the noise covariance matrix N, i.e.:

hni = 0; hnnTi ⌘ N.

N will have o↵-diagonal terms if the noise is correlated.

Later we also assume that we know the signal power spectrum, or,equivalently, correlation function:

hxi = 0; hxnTi = 0; hxxTi ⌘ S.

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 5 / 23

Page 6: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Generalised Linear ModelsWe are after the posterior (conditional on A, assumed known). If weassume nothing about the signal covariance, we want

p(x|y,N) / p(y|x,N) p(x|N)Since the signal is not dependent on the noise properties,p(x|N) = p(x) and we take it to be uniform.Hence, marginalising over the noise n,

p(x|y,N) / p(y|x,N) /Z

p(y,n|x,N)dn /Z

p(y|x,N) p(n|N) dn.

In the last integral, we use the linear model,p(y|x,N) = �(y � Ax � n), so

p(x|y,N) / p(n = y � Ax|N) / exp

�1

2(y � Ax)TN�1(y � Ax)

The last equation applies if the noise is gaussian,p(n|N) = exp(�n

TN�1n/2)/

p|2⇡N|.

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 6 / 23

Page 7: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Generalised Linear Models

p(x|y,N) / exp

�1

2(y � Ax)TN�1(y � Ax)

The maximum (i.e. the maximum likelihood [ML] estimate) of thisdistribution is given by di↵erentiating w.r.t. an element of x, yieldingtwo identical terms that give

ATN�1(y � Ax) = 0

which we solve to give the ML estimate:

xML = Wy

whereW = (ATN�1A)�1ATN�1.

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 7 / 23

Page 8: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Generalised Linear Models

xML = Wy; W = (ATN�1A)�1ATN�1.

There are several things to note about this estimate:

First, WA = I, which means that the error in x is independent of thevalue of the field:

✏ ⌘ xML � x

✏ = W(Ax+ n) � x

✏ = (WA � I)x+Wn = Wn.

The second is that it minimizes �2 = (y � Ax)TN�1(y � Ax)(evidently)

Third is (exercise) that it minimizes the mean square error h|✏|2isubject to WA = I.

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 8 / 23

Page 9: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

These are potentially desirable properties, so from a frequentistpoint-of-view, this is a useful estimator. For gaussian n, it alsohappens to be the ML estimator of x.

You may like to show that the noise covariance in the map is

h✏✏Ti = (ATN

�1A)�1.

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 9 / 23

Page 10: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Wiener filteringSo far we have not exploited any knowledge of the two-pointproperties of the signal.

Let us suppose that we know S.

Then we can compute the posterior for x given y, N and S as well.

The treatment is similar:

p(x|y,N, S) / p(y|x,N, S) p(x|N, S)

Since the signal is not dependent on the noise properties,p(x|N, S) = p(x|S) but now we assume it is gaussian:

p(x|S) = 1p|2⇡S|

exp

�1

2x

TS�1x

�.

Again, we use the linear model, p(y|x,N) = �(y � Ax � n), so

p(x|y,N, S) / exp

�1

2(y � Ax)TN�1(y � Ax) � 1

2x

TS�1x

�.

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 10 / 23

Page 11: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Generalised Linear Models

p(x|y,N, S) / exp

�1

2(y � Ax)TN�1(y � Ax) � 1

2x

TS�1x

�.

The quadratic (in x) in the exponent (⇥2) can be manipulated to

x

T(ATN�1A + S�1)x � x

TATN�1y � y

TN�1Ax+ y

TN�1y

and we can complete the square :where the Wiener filtered map xWF is:

(x � xWF)T(ATN�1A + S�1)(x � xWF)

where ensuring agreement with the terms linear in x requires theWiener filtered map to be

xWF = WWF y, WWF = (ATN�1A + S�1)�1ATN�1.

This is the maximum posterior estimate of x, given gaussian noiseand signal.

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 11 / 23

Page 12: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Generalised Linear Models

From an estimator point-of-view, this also

minimizes h|✏|2i, without the condition WA = I,

The reconstruction error is no longer independent of x. It tends tosuppress peaks.

Note that from a Bayesian perspective, the complete output of theexperiment is the full posterior, not an estimator.

If one wants to do inference with the map, one should also includethe gaussian uncertainty around the Wiener filter solution, withcovariance matrix

CWF = (ATN�1A + S�1)�1.

We can draw samples from the posterior for x, since it is amultivariate Gaussian N (xWF,CWF)

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 12 / 23

Page 13: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Wiener filtered images

Figure: Wiener-filtered map (d) from Pogrebnyak & Lukin (2003)Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 13 / 23

Page 14: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Summary

There are powerful linear algebra tools for linear models - a lot isknown

Solutions that are most probable from a Bayesian perspective coincidewith estimator-based solutions that are optimised subject to certainconditions, provided that the fields are Gaussian

From a Bayesian point-of-view, the most probable (maximum a

posteriori, or MAP) solution is not the whole story; we want and needthe full posterior

For Gaussian fields we can sample from the posterior if we cancompute S�1 and (ATN�1A)�1

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 14 / 23

Page 15: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Messenger Fields

Sometimes we do not know, or cannot compute, some of theconditional distributions in BHMs.

One elegant trick is Data Augmentation, where we introduceadditional latent variables, with conditional distributions that we cansample from

These extra variables are sometimes called Messenger Fields

Example: Gaussian fields. we know that the posterior for the field is aGaussian field, with mean given by the Wiener-filtered map, andknown covariance (we take the response matrix to be A = I forsimplicity):

x ⇠ N (xWF ,CWF )

xWF = (N�1 + S�1)�1N�1y, CWF = (N�1 + S�1)�1.

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 15 / 23

Page 16: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Messenger Fields (Elsner, Wandelt 2012)

Typical cosmology case: maps are large, so N�1 and S�1 are onlycomputable if they are diagonal

N is typically diagonal in pixel space (uncorrelated noise)

S is typically diagonal in Fourier basis (homogeneity or isotropy)

There is no basis in which N and S are both diagonal (unless N / I)

Conclusion: we cannot compute xWF = (N�1 + S�1)�1N�1y or

CWF = (N�1 + S�1)�1.

Solution: make the problem harder - introduce another large(fictitious) map, the Messenger Field t.

t contains the white noise contribution of N, so its covariance matrixT / I and is diagonal in both bases.

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 16 / 23

Page 17: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Forward Model or Generative ModelNotation change: (s ! x; d ! y; C ! S)

P (d|s, N)P (d|s, N)

NN

dd

P (C)P (C)

CC

P (s|C)P (s|C)

ss

TT

P (t|s, T)P (t|s, T)

P�d|t, N̄

�P

�d|t, N̄

tt N̄̄N

dd

P (C)P (C)

CC

P (s|C)P (s|C)

ss

Cannot sample from s conditioned on the data d because we cannotcompute the inverses of C and N in the same basis.Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 17 / 23

Page 18: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Messenger Fields

P (d|s, N)P (d|s, N)

NN

dd

P (C)P (C)

CC

P (s|C)P (s|C)

ss

TT

P (t|s, T)P (t|s, T)

P�d|t, N̄

�P

�d|t, N̄

tt N̄̄N

dd

P (C)P (C)

CC

P (s|C)P (s|C)

ss

N̄ is the non-white noise. Top half is diagonal in Fourier space (C, T);bottom half is diagonal in pixel space (N̄, T).

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 18 / 23

Page 19: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Messenger Fields in Weak lensingAlsing, AFH et al (2016a). ⇠ 130, 000 parameters; Gibbs sampling

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 19 / 23

Page 20: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Messenger Fields in Weak lensingAlsing, AFH et al (2016b). ⇠ 130, 000 parameters; Gibbs sampling

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 20 / 23

Page 21: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Mega BHM, including more levels of the hierarchy

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 21 / 23

Page 22: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Summary

There are powerful linear algebra tools for linear models - a lot isknown

Solutions that are most probable from a Bayesian perspective coincidewith estimator-based solutions that are optimised subject to certainconditions, provided that the fields are Gaussian

From a Bayesian point-of-view, the most probable (maximum a

posteriori, or MAP) solution is not the whole story; we want and needthe full posterior

For Gaussian fields we can sample from the posterior if we cancompute S�1 and (ATN�1A)�1

If we cannot, introducing extra latent variables (data augmentation;

Messenger Fields) may allow solution

Very large dimensional parameter spaces (e.g. ⇠ 106) can in somecases be sampled with Gibbs sampling or HMC

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 22 / 23

Page 23: Advanced Lectures on Bayesian Analysisbayes/16/lp/Heavens_Lecture_5_GLM.pdf · From a Bayesian perspective, we are most interested in the posterior distribution of x, given the data

Further Reading

Bayes in the Sky (Roberto Trotta, arXiV::0803.4089)

Bayesian Data Analysis (Andrew Gelman et al., CRC Press)

Information Theory, Inference and Learning Algorithms (DavidMackay, CUP)

Berkeley course on Bayesian Modeling and Inference (Michael I.Jordan). This is an excellent resource, on which I have drawn forsome bits of the material. Making these publicly available isacknowledged and appreciated.

http://www.cs.berkeley.edu/~jordan/courses/

260-spring10/lectures/

Alan Heavens (ICIC, Imperial College) Advanced Topics November 23, 2016 23 / 23