Fast Robustness Quantiﬁcation with Variational...

Fast Robustness Quantification with Variational Bayes

ITT Career Development Assistant Professor,

Tamara Broderick

With: Ryan Giordano, Rachael Meager, Jonathan Huggins, Michael I. Jordan

• Bayesian inference • Complex, modular models; posterior distribution

• Have to express prior beliefs in a distribution: challenges • Time-consuming; subjective; complex models

p(✓|x) /✓ p(x|✓)p(✓)

Bayes Theorem

p(✓|x) /✓ p(x|✓)p(✓)

Bayes Theorem

p(✓|x) /✓ p(x|✓)p(✓)

Bayes Theorem

p(✓|x) /✓ p(x|✓)p(✓)

Bayes Theorem

Robustness quantification

p(✓|x) /✓ p(x|✓)p(✓)

Bayes Theorem

• Robustness • Global & local • Rarely used • Approximation,

MCMC • Our solution: linear

response variational Bayes

p(✓|x) /✓ p(x|✓)p(✓)

Bayes Theorem

p(✓|x) /✓ p(x|✓)p(✓)

Bayes Theorem

p(✓|x) /✓ p(x|✓)p(✓)

Bayes Theorem

p(✓|x) /✓ p(x|✓)p(✓)

• Robustness • Global & local • Rarely used • Approximation

Bayes Theorem

p(✓|x) /✓ p(x|✓)p(✓)

Bayes Theorem

p(✓|x) /✓ p(x|✓)p(✓)

Bayes Theorem

p(✓|x) /✓ p(x|✓)p(✓)

Bayes Theorem

• Variational Bayes as an alternative to MCMC • Challenges of VB • Accurate uncertainties from VB • Accurate robustness quantification from VB • Big idea: derivatives/perturbations are easy in VB

• Variational Bayes (VB) • Approximation for

posterior • Minimize Kullback-Liebler

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

• VB practical success • point estimates and prediction • fast

p(✓|x)q(✓)

q⇤(✓)

Variational Bayes

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

p(✓|x)q(✓)

q⇤(✓)

Variational Bayes

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

p(✓|x)q(✓)

q⇤(✓)

Variational Bayes

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

q⇤(✓)

Variational Bayes

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

q⇤(✓)

Variational Bayes

q(✓)

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

q⇤(✓)

Variational Bayes

p(✓|x)

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

q⇤(✓)

p(✓|x)

q⇤(✓)

Variational Bayes

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

q⇤(✓)

Variational Bayes

p(✓|x)

q⇤(✓)

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

q⇤(✓)

Variational Bayes

p(✓|x)

q⇤(✓)

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

q⇤(✓)

Variational Bayes

p(✓|x)

q⇤(✓)

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

q⇤(✓)

Variational Bayes

p(✓|x)

q⇤(✓)

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

q⇤(✓)

[Broderick, Boyd, Wibisono, Wilson, Jordan 2013]

Variational Bayes

p(✓|x)

q⇤(✓)

(KL) divergence:

p(✓|x)

KL(qkp(·|x))

• VB practical success • point estimates and prediction • fast, streaming, distributed

q⇤(✓)

[Broderick, Boyd, Wibisono, Wilson, Jordan 2013]

Variational Bayes

p(✓|x)

q⇤(✓)

• Variational Bayes

• Mean-field variational Bayes (MFVB)

• Underestimates variance (sometimes severely)

• No covariance estimates

What about uncertainty?

[Bishop 2006]

q(✓) =JY

q(✓j)

KL(q||p(·|x)) =Z

✓q(✓) log

q(✓)

p(✓|x)d✓

q(✓) =JY

q(✓j)

[Bishop 2006]

q(✓) =JY

q(✓j)

✓2p(✓|x)

[Bishop 2006]

q(✓) =JY

q(✓j)

KL(q||p(·|x)) =Z

✓q(✓) log

q(✓)

p(✓|x)d✓

✓2p(✓|x)

[Bishop 2006]

q(✓) =JY

q(✓j)

KL(q||p(·|x)) =Z

✓q(✓) log

q(✓)

p(✓|x)d✓

✓2p(✓|x)

[Bishop 2006]

q(✓) =JY

q(✓j)

KL(q||p(·|x)) =Z

✓q(✓) log

q(✓)

p(✓|x)d✓

✓2p(✓|x)

q⇤(✓)

[Bishop 2006]

q(✓) =JY

q(✓j)

KL(q||p(·|x)) =Z

✓q(✓) log

q(✓)

p(✓|x)d✓

✓2p(✓|x)

q⇤(✓)

[Bishop 2006]

q(✓) =JY

q(✓j)

KL(q||p(·|x)) =Z

✓q(✓) log

q(✓)

p(✓|x)d✓

✓2p(✓|x)

q⇤(✓)

q(✓) =JY

q(✓j)

KL(q||p(·|x)) =Z

✓q(✓) log

q(✓)

p(✓|x)d✓

[MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011]

p(✓|x)

q⇤(✓)

q(✓) =JY

q(✓j)

KL(q||p(·|x)) =Z

✓q(✓) log

q(✓)

p(✓|x)d✓

[MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011]

p(✓|x)

q⇤(✓)

[Dunson 2014; Bardenet, Doucet, Holmes 2015]4

• Cumulant-generating function

• True posterior covariance vs MFVB covariance

• “Linear response” !

• The LRVB approximation

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

log pt(✓) := log p(✓|x) + t

T✓ � C(t), MFVB q⇤t

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

p(✓|x)

[Bishop 2006]5

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

⌃ =d

p(·|x)(t)

��t=05

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

[Bishop 2006]

p(✓|x)

q⇤(✓)

V :=d2

dtT dtCq⇤(t)

��t=0

Linear response

mean =d

dtC(t)

��t=0

⌃ :=d2

dtT dtC

p(·|x)(t)

��t=0

⌃ =d

dtTEpt✓

��t=0

dtTEq⇤t ✓

��t=0

=: ⌃

[Bishop 2006]

C(t) := logEetT ✓

p(✓|x)

q⇤(✓)

• LRVB covariance estimate

• Suppose exponential family with mean parametrization

LRVB estimator⌃ :=

dtTEq⇤t ✓

��t=0

dtTEq⇤t ✓

��t=0

dtTEq⇤t ✓

��t=0

= (I � V H)�1V

LRVB estimator

✓@2KL

��m=m⇤

◆�1

⌃ :=d

dtTEq⇤t ✓

��t=0

= (I � V H)�1V

LRVB estimator

✓@2KL

��m=m⇤

◆�1

⌃ :=d

dtTEq⇤t ✓

��t=0

= (I � V H)�1V

LRVB estimator

✓@2KL

��m=m⇤

◆�1

⌃ :=d

dtTEq⇤t ✓

��t=0

= (I � V H)�1V

• Symmetric and positive definite at local min of KL

• The LRVB assumption:

LRVB estimator

✓@2KL

��m=m⇤

◆�1

⌃ :=d

dtTEq⇤t ✓

��t=0

= (I � V H)�1V

• The LRVB assumption: Ept✓ ⇡ Eq⇤t ✓

LRVB estimator

✓@2KL

��m=m⇤

◆�1

⌃ :=d

dtTEq⇤t ✓

��t=0

= (I � V H)�1V

• The LRVB assumption: Ept✓ ⇡ Eq⇤t ✓p(✓|x)

q⇤(✓)

[Bishop 2006]

LRVB estimator

✓@2KL

��m=m⇤

◆�1

⌃ :=d

dtTEq⇤t ✓

��t=0

= (I � V H)�1V

• The LRVB assumption: Ept✓ ⇡ Eq⇤t ✓p(✓|x)

q⇤(✓)

• LRVB estimate is exact when MFVB gives exact mean (e.g. multivariate normal)

[Bishop 2006]

LRVB estimator

✓@2KL

��m=m⇤

◆�1

⌃ :=d

dtTEq⇤t ✓

��t=0

= (I � V H)�1V

Microcredit Experiment• Simplified from Meager (2015) • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia) • Nk businesses in kth site (~900 to ~17K) • Profit of nth business at kth site:

!• Priors and hyperpriors:

yknindep⇠ N (µk + Tkn⌧k,�

profit

profit1 if microcredit

✓µk

◆iid⇠ N

✓✓µ⌧

◆, C

✓µk

◆iid⇠ N

✓✓µ⌧

◆, C

��2k

iid⇠ �(a, b)

✓µk

◆iid⇠ N

✓✓µ⌧

◆, C

◆ ✓µ⌧

◆iid⇠ N

✓✓µ0

◆,⇤�1

��2k

iid⇠ �(a, b)

7C ⇠ Sep&LKJ(⌘, c, d)

Microcredit Experiment

Microcredit Experiment• One set of 2500

MCMC draws: 45 minutes

• All of MFVB optimization, LRVB uncertainties, all sensitivity measures: 58 seconds!

• Many other models and data sets: Mixture models, generalized linear mixed models, etc

LRVB,!MFVB

Robustness quantification• Bayes Theorem

p(✓|x)/✓ p(x|✓)p(✓)

Robustness quantification• Bayes Theoremp↵(✓) := p(✓|x,↵)

/✓ p(x|✓)p(✓|↵)

• Sensitivity

Bayes Theorem

/✓ p(x|✓)p(✓|↵)

• Sensitivity

Bayes Theorem

/✓ p(x|✓)p(✓|↵)

• Sensitivity

S :=dEp↵ [g(✓)]

��↵

�↵

Bayes Theorem

/✓ p(x|✓)p(✓|↵)

• Sensitivity

S :=dEp↵ [g(✓)]

��↵

�↵

Bayes Theorem

/✓ p(x|✓)p(✓|↵)

• Sensitivity

S :=dEp↵ [g(✓)]

��↵

�↵

Bayes Theorem

/✓ p(x|✓)p(✓|↵)

• Sensitivity

S :=dEp↵ [g(✓)]

��↵

�↵

Bayes Theorem

/✓ p(x|✓)p(✓|↵)

• Sensitivity

S :=dEp↵ [g(✓)]

��↵

�↵

⇡dEq⇤↵ [g(✓)]

��↵

�↵ =: S

Bayes Theorem

/✓ p(x|✓)p(✓|↵)

• Sensitivity

S :=dEp↵ [g(✓)]

��↵

�↵

⇡dEq⇤↵ [g(✓)]

��↵

�↵ =: S LRVB estimator

Bayes Theorem

/✓ p(x|✓)p(✓|↵)

• Sensitivity

S :=dEp↵ [g(✓)]

��↵

�↵

⇡dEq⇤↵ [g(✓)]

��↵

• When in exponential familyq⇤↵

Bayes Theorem

Robustness quantification• Bayes Theorem

✓@2KL

��m=m⇤

◆�1

p↵(✓) := p(✓|x,↵)/✓ p(x|✓)p(✓|↵)

• Sensitivity

S :=dEp↵ [g(✓)]

��↵

�↵

⇡dEq⇤↵ [g(✓)]

��↵

• When in exponential familyq⇤↵

C ⇠ Sep&LKJ(⌘, c, d)

✓µk

◆iid⇠ N

✓✓µ⌧

◆, C

◆ ✓µ⌧

◆iid⇠ N

✓✓µ0

◆,⇤�1

��2k

iid⇠ �(a, b)

• Perturb Λ11: 0.03 ➔ 0.04

Sensitivity

• Perturb Λ11: 0.03 ➔ 0.04

Sensitivity

• Sensitivity of the expected microcredit effect (τ)

• Normalized to be on scale of standard deviations in τ

• E.g.

StdDevq⌧ = 1.8

• E.g.

StdDevq⌧ = 1.8Eq⌧ = 3.7

• E.g.

StdDevq⌧ = 1.8Eq⌧ = 3.7= 2.06 ⇤ StdDevq⌧

• E.g.

⇤12 + = 0.03

• E.g.

⇤12 + = 0.03

13Eq⌧ < 1.0 ⇤ StdDevq⌧

Conclusion• We provide linear response variational Bayes:

supplements MFVB for fast & accurate covariance estimate

• More from LRVB: fast & accurate robustness quantification

• Interested in your data and models: • Sensitivity to prior perturbations • Sensitivity to data perturbations

T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational Bayes. NIPS, 2013. !R Giordano, T Broderick, and MI Jordan. Linear response methods for accurate covariance estimates from mean field variational Bayes. NIPS, 2015.!!R Giordano, T Broderick, and MI Jordan. Robust Inference with Variational Bayes. NIPS AABI Workshop, 2015. ArXiv:1512.02578.!! https://github.com/rgiordan/MicrocreditLRVB! !J. Huggins, T. Campbell, and T. Broderick. Core sets for scalable Bayesian logistic regression. Under review. ArXiv:1605.06423.

References

R Bardenet, A Doucet, and C Holmes. On Markov chain Monte Carlo methods for tall data. arXiv, 2015.

CM Bishop. Pattern Recognition and Machine Learning.

D Dunson. Robust and scalable approach to Bayesian inference. Talk at ISBA 2014.

DJC MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.

R Meager. Understanding the impact of microcredit expansions: A Bayesian hierarchical analysis of 7 randomised experiments. ArXiv:1506.06669, 2015.

RE Turner and M Sahani. Two problems with variational expectation maximisation for time-series models. In D Barber, AT Cemgil, and S Chiappa, editors, Bayesian Time Series Models, 2011.

B Wang and M Titterington. Inadequacy of interval estimates corresponding to variational Bayesian approximations. In AISTATS, 2004.

Fast Robustness Quantiﬁcation with Variational...

Documents

Transcript of Fast Robustness Quantiﬁcation with Variational...