Bayesian Model Choice and Information Criteria in Sparse Generalized Linear...

Bayesian Model Choice and Information Criteria inSparse Generalized Linear Models

Mathias Drton

Department of StatisticsUniversity of Chicago

(Paper with this title: Rina Foygel & M.D., arXiv:1112.5635)

Outline

1 BIC and extensions

2 Asymptotics for marginal likelihood of GLMs

3 Consistency for GLMs

4 Ising models

2 / 36

Outline




4 Ising models

BIC and extensions 2 / 36

Bayesian information criterion (BIC)

Sample Y1, . . . ,Yn

Parametric model MMaximized log-likelihood function ˆ(M)

Bayesian information criterion (Schwarz, 1978)

BIC(M) := ˆ(M)− dim(M)

2log n

‘Generic’ model selection approach:

Maximize BIC(M) over set of considered models


Motivation: 1) Bayesian model choice

Posterior model probability in fully Bayesian treatment:

P(M|Y1, . . . ,Yn) ∝ P(M)︸︷︷︸prior

P(Y1, . . . ,Yn |M).

Marginal likelihood:

Ln(M) := P(Y1, . . . ,Yn |M)

=

∫P(Y1, . . . ,Yn | θM,M)︸︷︷︸

likelihood fct.

d P(θM |M)︸︷︷︸prior


Motivation: 2) Asymptotics

Y1, . . . ,Yn i.i.d. sample from P0 ∈M

Theorem (Schwarz, 1978; Haughton, 1988; and others)

Assume P(θM |M) is a ‘nice’ prior on Rd . Then in ‘nice’ models,

log Ln(M) = ˆn(M)− d

2log n + Op(1),

and a better (Laplace) approximation is possible:

log Ln(M) = ˆn(M)− d

2log( n

2π

)+ log P(θM |M)

− 1

2log det

[1

nHessian(θM)

]+ Op

(n−1/2

)


Consistency

Theorem

Fix a finite set of ‘nice’ models. Then, BIC selects a true model ofsmallest dimension with probability tending to one as n→∞.

Proof.

Finite set of models =⇒ pairwise comparisons suffice.

If P0 ∈M1 (M2 and d1 < d2, then

ˆn(M2)− ˆ

n(M1) = Op(1); and (d2 − d1) log n→∞.

If P0 ∈M1 \ clos(M2), then with probability tending to one,

1

n

[ˆn(M1)− ˆ

n(M2)]> ε > 0; and log(n)/n→ 0.


Linear regression (covariates i.i.d. N(0, 1), φ1 = 1, σ = 2)


BIC in higher-dimensional linear regression

Exhaustive search up to 6 covariates

10 20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

p

Pro

b co

rrec

t

n = p,σ = 1,k = 2,φ1 = 1φ2 = 1


Higher-dimensional linear regression . . . too large models

σ = 1,k = 2,φ1 = φ2 = 1

Broman & Speed (2002)


Informative prior on models in higher dim. regression

σ = 1,k = 2,φ1 = φ2 = 1


Informative prior on models in higher dim. regression

Exhaustive search up to 6 covariates

10 20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

p

Pro

b co

rrec

t

BICEBIC

n = p,σ = 1,k = 2,φ1 = 1φ2 = 1


Extended Bayesian information criterion

Linear regression

Models given by subsets of covariates J ⊂ [p] := {1, . . . , p}Prior on models

P(J) =1

p + 1· 1( p|J|)

has k = #covariates and (J|k) uniformly distributed.

Extended BIC defined as

EBIC(J) = BIC(J)− |J| log p;

we have |J| � p in mind.

Bogdan et al. (2004), Chen & Chen (2008), Scott and Berger (2010), . . .


Theory = consistency for EBIC

Chen & Chen ’08 High-dimensional sparse linear regression(fixed design, # active covariates bounded).

Chen & Chen ’11 Generalized linear models(fixed design, canonical link).

Chen et al. ’11 Generalizations for fixed design regression

Gao et al. ’10 Gaussian graphical modelsFoygel & D ’10 (adjust penalty for number of graphs)


Questions

Bayesian connection under high-dimensional asymptotics:

� Laplace approximation to marginal likelihood accurate uniformly over agrowing number of models?

� EBIC captures growth of marginal likelihood?

Consistency for random designs?

Consistency for pseudo-likelihood approaches to graphical modelselection?

Consistency of fully Bayesian model choice as corollaries?

Shang & Clayton (2011)


Outline




4 Ising models

Asymptotics for marginal likelihood of GLMs 15 / 36

Generalized linear model: Setup

Independent (response) observations Y1, . . . ,Yn

Distribution of Yi ∼ pθi from univariate exponential family:

pθ(y) ∝ exp {y · θ − b(θ)} , θ ∈ Θ = R.

Linearity:θ = (θ1, . . . , θn)T = Xφ, φ ∈ Rp,

for design matrix X = (Xij) ∈ Rn×p.(rows , experiments, col’s , covariates)

Random design with X1•, . . . ,Xn• i.i.d.

Variable selection:

Find support J∗ ⊂ [p] of true parameter φ∗.


Assumptions

(A) Bounded covariates (or a moment condition)

(B1) Subexponential growth of dimension: log(pn) = o(n).

(B2) Dimension of smallest true model bounded by a fixed q ∈ N.

(B3) Small sets of covariates have second moment matrices with mimimaleigenvalue bounded away from zero:

λmin

(E[X1JXT

1J

])> a > 0 for all |J| ≤ 2q.

(B4) Norm of signal ‖φ∗‖2 bounded.


Theorem (Laplace approximation)

Assume (A), (B1)-(B4) and ‘nice priors’ (fJ : J ⊂ [p], |J| ≤ q). Then thereis a constant C such that the marginal likelihood sequence Ln(J) satisfiesthat

log Ln(J) = `n(φJ)− |J|2

log(n) + log fJ(φJ) +|J|2

log(2π)

− 1

2log det

(1

nHessianJ(φJ)

)± C

√log(np)

nfor all |J| ≤ q,

with probability tending to 1 as n→∞.


EBIC approximation

EBIC (with parameter γ ≥ 0):

EBICγ(J) = `n(φJ) − |J|2

log(n) − γ|J| · log(p) .

Corollary

Assume (A), (B1)-(B4) and ‘nice priors’ (fJ : J ⊂ [p], |J| ≤ q). Adopt theunnormalized model prior

Pγ(J) =

(p

|J|

)−γ· 1 {|J| ≤ q} .

Then there is a constant C ′ such that with probability tending to 1 asn→∞, we have∣∣∣log

[Pγ(J,Y )

]− EBICγ(J)

∣∣∣ ≤ C ′ for all |J| ≤ q.


Laplace approximation to marginal likelihood

∫RJ

exp(`n(φJ + γ)

)· fJ(φJ + γ) dγ

Taylor series:

`n(φJ + γ) = `[n](φJ) − 1

2γ> HessianJ(φJ + tγ · γ) γ

Approximation by Gaussian integral:

fJ(φJ) ·∫RJ

exp(`n(φJ)

)· exp

(−1

2γ> HessianJ(φJ) γ

)dγ

= fJ(φJ) · exp(`n(φJ) ·

√(2π

n

)|J|· det

(1

nHessianJ(φJ)

)−1



∫RJ

exp

(`n(φJ)− 1

2γ> HessianJ(φJ + tγ · γ) γ︸︷︷︸

≈ `n(φJ)− 1


)dγ

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1



∫RJ

exp

(`n(φJ)− 1

2γ> HessianJ(φJ + tγ · γ) γ︸︷︷︸

≈ `n(φJ)− 1


)dγ

●●

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1


Assumptions on priors

Family of priors (fJ : J ⊂ [p], |J| ≤ q) is ‘nice’ if for constants0 < F1,F2,F3 <∞ we have uniformly for all |J| ≤ q:

(i) an upper bound:sup φJ fJ(φJ) ≤ F1 <∞,

(ii) a lower bound over a compact set:

inf ‖φJ‖2≤R+1fJ(φJ) ≥ F2 > 0,

where R is a function of the constants in (A) & (B1)-(B4),

(iii) a Lipschitz property on the same compact set:

sup ‖φJ‖2≤R+1 ‖∇fJ(φJ)‖2 ≤ F3 <∞.


Outline




4 Ising models

Consistency for GLMs 23 / 36

(B5) Small true coefficients don’t decay too fast:√log(npn)

n= o

(min

{∣∣φ∗j ∣∣ : j ∈ J∗}).

Theorem (EBIC consistency in GLM)

Assume (A), (B1)-(B5). Let

κ = limn→∞

log pn

log n∈ [0,∞],

and take γ > 1− 12κ . Then with prob. tending to 1 as n→∞, we have

BICγ(J∗)− maxJ 6=J∗,|J|≤q

BICγ(J) ≥ log(p) · Chigh + log(n) · Clow

for constants Chigh,Clow > 0.


EBIC approximates Bayesian model choice

Corollary (Consistency of Bayesian model choice)

Assume (A), (B1)-(B5) and ‘nice priors’. Then with probability tending to1 as n→∞, we have

Pγ(J∗ |Y ) > maxJ 6=J∗,|J|≤q

Pγ(J |Y ) .


Experiment for sparse logistic regression (with lasso)

Spambase data from UCI Machine Learning Data Repository

n0 = 4601 emails, p0 = 57 covariates

Downsample to n < n0 experiments.

Create p − p0 noise covariates by random permutation.

Total number of covariates p satisfies pn = p0

25 ≈ 2.28.

Select a model from lasso path using EBIC, cross-validation andstability selection (Meinshausen & Buhlmann, 2010).


Positive selection and false discovery rate

Number of samples

Pos

itive

sel

ectio

n ra

te (

PS

R)

100 200 300 400 500 600

0%10

%20

%30

%40

%50

%

●

●

●

●●

●

●●

●● ●

●

●

●

BIC0

BIC0.25

BIC0.5

BIC1

Cross−validationStability selection

Number of samples

Fals

e di

scov

ery

rate

(F

DR

)

100 200 300 400 500 6000%

20%

40%

60%

80%

● ● ● ● ● ●● ● ● ● ● ●

●

●

BIC0

BIC0.25

BIC0.5

BIC1



Comparison to full data

P−value of feature in the full regression (sample size 4601)

Sm

ooth

ed p

rob.

of s

elec

tion

(sub

sam

ple

size

600

)

●

●

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

0.5

●

●

BIC0

BIC0.25

BIC0.5

BIC1


Figure: Smoothed probability of selecting a true feature, as a function of thep-value of that feature in the full regression.


Outline




4 Ising models

Ising models 29 / 36

Ising model

Observe i.i.d. X (1), . . . ,X (n) ∈ {0, 1}p

Likelihood function:

1

Z (Θ)· exp

{∑jΘj0xj +

∑j<kΘjkxjxk

}↑ ↑

normalizing (sparse)

const. potential matrix

Full conditional for Xj is proportional to

exp{

xj ·(

Θj0 +∑

k 6=jΘjkxk

)}Model selection problem:

Find support E∗ (the ‘graph’) of true potential matrix Θ∗


Neighborhood selection for sparse Ising models

For each Xj , select its neighborhood via the Lasso:

Θ(λ)j• = arg max

[`Xj |X−j

(Θj•

)+ λ ·

∑k 6=j

|Θjk |]

(Meinshausen & Buhlmann, 2006; Ravikumar et al., 2010)

How to choose λ, i.e., neighborhoods from each path?

Cross-validation tends to select too large neighborhoods.

Apply EBIC:

� Let Ej,λ be the edges incident to j in support of Θ(λ)j• .

� Maximize

`Xj |X−j

(Θ

(λ)j•

)− |Ej,λ|

2log(n) − |Ej,λ| · γ log(p)

with respect to λ.


Consisteny of EBIC for Ising model selection

Theorem

Consider subexponential growth of p = pn with

κ = limn→∞

log pn

log n∈ [0,∞].

Assume

all neighborhood sizes bounded by a constant,√log(np)

n � |Θ∗jk | ≤ a constant, for all edges (j , k).

Take γ > 2− 12κ . Then with probability tending to 1 as n→∞:

EBICγ selects the right neighborhood for every Xj .

Follows from consistency of EBIC for GLMs with random covariates.


Precipitation data (U.S. Historical Climatology Network)

89 weather stations

measure precipitation (1 or 0) on 278 (nonconsecutive) dates

discard locations of the weather stations can we recover the geographical layout?

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

−96 −94 −92 −90 −88 −86

3638

4042

Longitude

Latit

ude


●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

BIC

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

Extended BIC

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

Cross−validation

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

Stability selection

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

γ = 0.25


Edge selection vs distance

Distance between weather stations (miles)

Sm

ooth

ed p

roba

bilit

y of

sel

ectin

g ed

ge

0 100 200 300 400 500 600

0.0

0.2

0.4

0.6

0.8

1.0

BICextended BICcross−validationstability selection


Conclusion

Laplace approximation can be accurate uniformly over large numberof sparse GLMs

Chen & Chen’s extended Bayesian information criterion (EBIC):

� connected to Bayesian model choice;� its consistency proves consistency of ‘generic’ Bayesian procedures;� computationally inexpensive alternative to stability selection and other

resampling methods;� seems useful for tuning regularization methods.

For details including references, see:

Bayesian model choice and information criteria in sparse generalizedlinear models (with Rina Foygel). arXiv:1112.5635


Bayesian Model Choice and Information Criteria in Sparse Generalized Linear...

Documents

Transcript of Bayesian Model Choice and Information Criteria in Sparse Generalized Linear...