Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix...

20
Prox versus Prior? Who cares—Just learn the damn thing! Philip Schniter and Jeremy Vila Supported in part by NSF-I/UCRC grant IIP-0968910, by NSF grant CCF-1018368, and by DARPA/ONR grant N66001-10-1-4090. BASP Frontiers — Jan ’13

Transcript of Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix...

Page 1: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Prox versus Prior?

Who cares—Just learn the damn thing!

Philip Schniter and Jeremy Vila

Supported in part by NSF-I/UCRC grant IIP-0968910, by NSF grant CCF-1018368, and byDARPA/ONR grant N66001-10-1-4090.

BASP Frontiers — Jan ’13

Page 2: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Compressive Sensing

Goal: recover signal x from noisy sub-Nyquist measurements

y = Ax+w x ∈ RN y,w ∈ R

M M < N.

where x is K-sparse with K<M , or compressible.

With sufficient sparsity and appropriate conditions on the mixingmatrix A (e.g. RIP, nullspace), accurate recovery of x is possibleusing polynomial-complexity algorithms.

A common approach (LASSO) is to solve the convex problem

minx

‖y −Ax‖22 + α‖x‖1

where α can be tuned in accordance with sparsity and SNR.

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 2 / 20

Page 3: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Phase Transition Curves (PTC)

The PTC identifies ratios (MN ,KM ) for which perfect noiseless recovery

of K-sparse x occurs (as M,N,K → ∞ under i.i.d sub-Gaussian A).

Suppose {xn} are drawn i.i.d.

pX(xn) = λfX(xn)+(1−λ)δ(xn)

with known λ , K/N .

LASSO’s PTC is invariant tofX(·). Thus, LASSO is robustin the face of unknown fX(·).

MMSE-reconstruction’s PTCis far better than Lasso’s, butrequires knowing prior fX(·). 0.2 0.4 0.6 0.8

0

0.2

0.4

0.6

0.8

1

M/N (undersampling)

K/M

(sparsity)

MMSE reconstruct

theoretical LASSO

empirical AMP

Wu and Verdu, “Optimal phase transitions in compressed sensing,” arXiv Nov. 2011.

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 3 / 20

Page 4: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Motivations

For practical compressive sensing. . .

want minimal MSE

– distributions are unknown ⇒ can’t formulate MMSE estimator– but there is hope:

various algs seen to outperform Lasso for specific signal classes– really, we want a universal algorithm: good for all signal classes

want fast runtime

– especially for large signal-length N (i.e., scalable).

want to avoid algorithmic tuning parameters,

– who wants to tweak an algorithm when you can ski in the alps!

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 4 / 20

Page 5: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Proposed Approach: “EM-GM-AMP”

Model the signal and noise using flexible distributions:

– i.i.d Bernoulli Gaussian-mixture (GM) signal

p(xn) = λL∑

l=1

ωl N (xn; θl, φl) + (1− λ)δ(xn) ∀n

– i.i.d Gaussian noise with variance ψ (but easy to generalize)

Learn the prior parameters q , {λ, ωl, θl, φl, ψ}Ll=1

– treat as deterministic and use expectation-maximization (EM)

Exploit the learned priors in near-MMSE signal reconstruction

– use the approximate message passing (AMP) algorithm– AMP also provides EM with everything it needs to know

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 5 / 20

Page 6: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Approximate Message Passing (AMP)

AMP methods infer x from y = Ax+w using loopy beliefpropagation with carefully constructed approximations.

The original AMP [Donoho, Maleki, Montanari ’09] solves the LASSOproblem assuming i.i.d sub-Gaussian matrix A.

The Bayesian AMP [Donoho, Maleki, Montanari ’10] framework tacklesMMSE inference under any factorized signal prior, AWGN, and i.i.d A.

The generalized AMP [Rangan ’10] framework tackles MAP or MMSEinference under any factorized signal prior & likelihood, and generic A.

AMP is a form of iterative thresholding, requiring only twoapplications of A per iteration and ≈ 25 iterations. Very fast!

Rigorous large-system analyses (under i.i.d sub-Gaussian A) haveestablished that (G)AMP follows a state-evolution trajectory withcertain optimalities [Bayati, Montanari ’10], [Rangan ’10].

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 6 / 20

Page 7: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

AMP Heuristics (Sum-Product)pX(x1)

pX(x2)

pX(xN )

x1

x2

xN

p1→1(x1)

pM←N (xN )

N (y1; [Ax]1, ψ)

N (y2; [Ax]2, ψ)

N (yM ; [Ax]M , ψ)

......

...

1 Message from yi node to xj node:

pi→j(xj) ∝

{xr}r 6=j

N(yi;

≈ N via CLT︷ ︸︸ ︷∑

rairxr , ψ

)∏

r 6=jpi←r(xr)

zi

N (yi; zi, ψ)N(zi; zi(xj), ν

zi (xj)

)∼ N

To compute zi(xj), νzi (xj), the means and variances of {pi←r}r 6=j suffice,

thus Gaussian message passing!

Remaining problem: we have 2MN messages to compute (too many!).

2 Exploiting similarity among the messages{pi←j}

Mi=1

, AMP employs a Taylor-seriesapproximation of their difference whoseerror vanishes as M→∞ for dense A (andsimilar for {pi←j}Ni=1

as N→∞).Finally, need to compute only O(M+N)messages!

pX(x1)

pX(x2)

pX(xN )

x1

x2

xN

p1→1(x1)

pM←N (xN )

N (y1; [Ax]1, ψ)

N (y2; [Ax]2, ψ)

N (yM ; [Ax]M , ψ)

......

...

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 7 / 20

Page 8: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

The GAMP Algorithm

Require: Matrix A, Bayes ∈ {0, 1}, initializations x0, ν0x

t = 0, s−1 = 0, ∀mn : Smn = |Amn|2

repeat

νtp = Sνt

x, pt = Axt − st−1.νtp (gradient step)

if Bayes then∀m : ztm = E(Z|P ; ptm, ν

tpm), νtzm = var(Z|P ; ptm, ν

tpm),

else

∀m : ztm = proxνtpm

fz(ptm), νtzm = νtpmprox′νt

pmfz(ptm)

end if

νts = (1− νt

z./νtp)./ν

tp, st = (zt − pt)./νt

p (dual update)νtr = 1./(STνt

s), rt = xt + νtrA

Tst (gradient step)if Bayes then∀n : xt+1n = E(X|R; rtn, ν

trn), νt+1xn

= var(X|R; rtn, νtrn),

else

∀n : xt+1n = proxνtrn

fx(rtn), νt+1xn

= νtrnprox′νtrn

fx(rtn)

end if

t← t+1until Terminated

Note connections to primal-dual, ADMM, split-Bregman, proximal FB splitting, DR...

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 8 / 20

Page 9: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Expectation-Maximization

We use expectation-maximization (EM) to learn the signal and noise

prior parameters q , {λ,ω,θ,φ, ψ}

Alternate between

E: maximizing a lower bound on ln p(y; q) for fixed q

M: maximizing q for fixed lower bounding function.

Lower bound via posterior approx∏N

n=1p(xn|y; q

i) ≈ p(x|y; qi).

ln p(y; q) =

N∑

n=1

xn

p(xn|y; qi) ln p(xn; q)

︸ ︷︷ ︸

lower bound

+D(px|y ‖ px|y

)

︸ ︷︷ ︸

≥0

Incremental maximization: λ, θ1, . . . , θL, φ1, . . . , φL,ω, ψ

All quantities needed for the EM updates are provided by AMP!

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 9 / 20

Page 10: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Parameter Initialization

Initialization matters; EM can get stuck in a local max. Our approach:

initialize the sparsity λ according to the theoretical LASSO PTC.

initialize the noise and active-signal variances using known energies‖y‖2

2, ‖A‖2F and SNR0 = 20 dB (or user-supplied value):

ψ0 =‖y‖2

2

(SNR0 + 1)M, (σ2)0 =

‖y‖22−Mψ0

λ0‖A‖2F

fix L = 3 and initialize the GM parameters (ω,θ,φ) as the best fit toa uniform distribution with variance σ2.

As other options, we provide

a “heavy tailed” mode that forces zero-mean GM components.

a model-order-selection procedure to learn L.

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 10 / 20

Page 11: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Examples of Learned Signal-pdfs

Example comparingEM-GM-AMP-learnedpriors to the actualpriors used togenerate the datarealization.

The top correspondsto an i.i.dtriangle-mixture signaland the bottom ani.i.d Student-t signal.

−15 −10 −5 0 5 10 150

0.1

0.2

0.3

0.4

pdf (

x)

tagST

StudentT

−15 −10 −5 0 5 10 150

0.2

0.4

0.6

0.8

pmf (

x)

True pdfEstimated pdfEstimated pmf

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.01

0.02

0.03

0.04

0.05

pdf (

x)tagtri

Trimix

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

pmf (

x)

True pdfTrue pmfEstimated pdfEstimated pmf

True and EM-GM-AMP-learned signal prior pdfs

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 11 / 20

Page 12: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Empirical PTCs: Bernoulli-Gaussian signals

We now evaluate noiseless reconstruction performance viaphase-transition curves constructed using N=1000-length i.i.dsignals, i.i.d Gaussian A, and 100 realizations.

We see all variants ofEM-GM-AMP performingsignificantly better thanLASSO for i.i.dBernoulli-Gaussian signals.

Perhaps not a faircomparison because the trueprior is matched to ourmodel.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

M/N

K/M

EM-GM-AMP

EM-GM-AMP-3EM-BG-AMP

genie GM-AMP

LASSO thyLaplacian-AMP

Empirical noiseless Bernoulli-Gaussian PTCs

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 12 / 20

Page 13: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

PTCs for Bernoulli and Bernoulli-Rademacher signals

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

M/N

K/M

EM-GM-AMPEM-GM-AMP-3EM-BG-AMP

genie GM-AMP

LASSO thyLaplacian-AMP

Empirical noiseless Bernoulli PTCs

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

M/N

K/M

EM-GM-AMP

EM-GM-AMP-3EM-BG-AMP

genie GM-AMP

LASSO thyLaplacian-AMP

Empirical noiseless Bernoulli-Rademacher PTCs

For these signals, we see EM-GM-AMP performing. . .

significantly better than LASSO,

with model-order-selection enabled, as good as genie-aided GM-AMP

significantly better than EM-BG-AMP with the i.i.d BR signal.

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 13 / 20

Page 14: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Noisy Recovery: Bernoulli-Rademacher (±1) signals

We now compare the normalized MSE of EM-GM-AMP to severalstate-of-the-art algorithms (SL0, T-MSBL, BCS, Lasso via SPGL1)for the task of noisy i.i.d signal recovery under i.i.d Gaussian A.

For this, we fixed N=1000, K=100, SNR=25dB and varied M .

For these i.i.d BR signals, we seeEM-GM-AMP outperformingthe other algorithms for allundersampling ratios M/N .

Notice that the EM-BG-AMPalgorithm cannot accuratelymodel the Bernoulli-Rademacherprior. 0.2 0.25 0.3 0.35 0.4 0.45 0.5

−35

−30

−25

−20

−15

−10

−5

0

OMP

SP

M/N

NMSE[dB]

EM-GM-AMPEM-GM-AMP-3

EM-BG-AMP

BCS

T-MSBL

SL0genie Lasso

Noisy Bernoulli-Rademacher recovery NMSE.

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 14 / 20

Page 15: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Noisy Recovery: Bernoulli-Gaussian and Bernoulli signals

0.2 0.25 0.3 0.35 0.4 0.45 0.5

−25

−20

−15

−10

−5

OMPSP

M/N

NMSE[dB]

EM-GM-AMP

EM-GM-AMP-L

EM-BG-AMP

BCS

T-MSBL

SL0

genie Lasso

Noisy Bernoulli-Gaussian recovery NMSE.

0.2 0.25 0.3 0.35 0.4 0.45 0.5

−35

−30

−25

−20

−15

−10

−5

0

OMPSP

M/N

NMSE[dB]

EM-GM-AMP

EM-GM-AMP-3

EM-BG-AMP

BCS

T-MSBL

SL0

genie Lasso

Noisy Bernoulli recovery NMSE.

For i.i.d Bernoulli-Gaussian and i.i.d Bernoulli signals, EM-GM-AMPagain dominates the other algorithms.

We attribute the excellent performance of EM-GM-AMP to its abilityto learn and exploit the true signal prior.

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 15 / 20

Page 16: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Noisy Recovery of Heavy-Tailed signals

0.3 0.35 0.4 0.45 0.5 0.55 0.6

−10

−9

−8

−7

−6

−5

OMPSP

M/N

NMSE[dB]

EM-GM-AMP

EM-GM-AMP-4

EM-BG-AMP

BCS

T-MSBL

SL0

genie Lasso

Noisy Student-t recovery NMSE.

0.2 0.25 0.3 0.35 0.4 0.45 0.5

−28

−26

−24

−22

−20

OMPSP

M/N

NMSE[dB]

EM-GM-AMP

EM-GM-AMP-4

EM-BG-AMP

BCS

T-MSBL

SL0

genie Lasso

Noisy log-normal recovery NMSE.

In its “heavy tailed” mode, EM-GM-AMP again uniformlyoutperforms all other algorithms.

Rankings among other algorithms differ across signal types. (CompareOMP and SL0 performances.)

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 16 / 20

Page 17: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Runtime versus signal-length N

We fix M/N=0.5, K/N=0.1, SNR=25dB, and average 50 trials.

103

104

105

106

10−1

100

101

102

OMP

SP

N

Runtime[sec]

EM-GM-AMP

EM-BG-AMPEM-GM-AMP-3

BCST-MSBL

SL0

SPGL1

EM-GM-AMP fft

EM-BG-AMP fft

EM-GM-AMP-3 fft

SPGL1 fft

Noisy Bernoulli-Rademacher recovery time.

103

104

105

106

−40

−35

−30

−25

−20

SPOMP

NNMSE[dB]

EM-GM-AMP

EM-BG-AMP

EM-GM-AMP-3

BCST-MSBL

SL0SPGL1

EM-GM-AMP fftEM-BG-AMP fft

EM-GM-AMP-3 fft

SPGL1 fft

Noisy Bernoulli-Rademacher recovery NMSE.

For all N > 1000, EM-GM-AMP has the fastest runtime!

EM-GM-AMP can also leverage fast operators for A (e.g., FFT).

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 17 / 20

Page 18: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Extension to structured sparsity (Justin Ziniel)

Recovery of an audio signal sparsified via DCT Ψ and compressivelysampled via i.i.d Gaussian Φ (so that A = ΦΨ).

Exploit persistence of support across time via Markov chains andturbo AMP.

. . . . . .

. . .

. . .

. . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

....

.

.

....

t

g(0)1

g(0)m

g(0)M

x(0)1

x(0)n

x(0)N

f(0)1

f(0)n

f(0)N

s(0)1

s(0)n

s(0)N

s(1)1

s(1)n

s(1)N

s(T )1

s(T )n

s(T )N

θ(0)1

θ(0)n

θ(0)N

θ(1)1

θ(1)n

θ(1)N

θ(T )1

θ(T )n

θ(T )N

h(0)1

h(1)1

h(T−1)1

h(0)N

h(1)N

h(T−1)N

d(0)1

d(1)1

d(T−1)1

d(0)N

d(1)N

d(T−1)N

AMP

algorithm M/N = 1/5 M/N = 1/3 M/N = 1/2EM-GM-AMP-3 -9.04 dB 8.77 s -12.72 dB 10.26 s -17.17 dB 11.92 s

turbo EM-GM-AMP-3 -12.34 dB 9.37 s -16.07 dB 11.05 s -20.94 dB 12.96 s

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 18 / 20

Page 19: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

Conclusions

We proposed a sparse reconstruction alg that uses EM and AMP tolearn and exploit the GM-signal prior and AWGN variance.

Advantages of EM-GM-AMP: for signal length N & 1000,...

State-of-the-art NMSE performance with all tested i.i.d priors.State-of-the-art complexityMinimal tuning: choose between “sparse” or “heavy-tailed” modes.

Ongoing related work:

Extensions beyond AWGN (e.g., phase retrieval, binary classification).Universal learning/exploitation of structured sparsity.Extensions to matrix completion, dict learning, robust PCA, NNMF.

EM-AMP Theory:

Asymptotic consistency under a matched prior with certainidentifiability conditions:Kamilov, Rangan, Fletcher, Unser, “Approximate message passing with consistentparameter estimation and applications to sparse learning,” NIPS 2012.

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 19 / 20

Page 20: Prox versus Prior? [3mm] Who cares Just learn the damn thing!schniter/pdf/basp13_em_slides.pdffix L=3and initialize the GM parameters (ω,θ,φ)as the best fit to a uniform distribution

EM-GM-AMP is integrated into GAMPmatlab:http://sourceforge.net/projects/gampmatlab/

Interface and examples athttp://ece.osu.edu/~vilaj/EMGMAMP/EMGMAMP.html

Thanks!

Philip Schniter and Jeremy Vila (OSU) EM-GM-AMP BASP Frontiers — Jan ’13 20 / 20