Bayesian Model Choice and Information Criteria in Sparse Generalized Linear...
Transcript of Bayesian Model Choice and Information Criteria in Sparse Generalized Linear...
Bayesian Model Choice and Information Criteria inSparse Generalized Linear Models
Mathias Drton
Department of StatisticsUniversity of Chicago
(Paper with this title: Rina Foygel & M.D., arXiv:1112.5635)
Outline
1 BIC and extensions
2 Asymptotics for marginal likelihood of GLMs
3 Consistency for GLMs
4 Ising models
2 / 36
Outline
1 BIC and extensions
2 Asymptotics for marginal likelihood of GLMs
3 Consistency for GLMs
4 Ising models
BIC and extensions 2 / 36
Bayesian information criterion (BIC)
Sample Y1, . . . ,Yn
Parametric model MMaximized log-likelihood function ˆ(M)
Bayesian information criterion (Schwarz, 1978)
BIC(M) := ˆ(M)− dim(M)
2log n
‘Generic’ model selection approach:
Maximize BIC(M) over set of considered models
BIC and extensions 3 / 36
Motivation: 1) Bayesian model choice
Posterior model probability in fully Bayesian treatment:
P(M|Y1, . . . ,Yn) ∝ P(M)︸ ︷︷ ︸prior
P(Y1, . . . ,Yn |M).
Marginal likelihood:
Ln(M) := P(Y1, . . . ,Yn |M)
=
∫P(Y1, . . . ,Yn | θM,M)︸ ︷︷ ︸
likelihood fct.
d P(θM |M)︸ ︷︷ ︸prior
BIC and extensions 4 / 36
Motivation: 2) Asymptotics
Y1, . . . ,Yn i.i.d. sample from P0 ∈M
Theorem (Schwarz, 1978; Haughton, 1988; and others)
Assume P(θM |M) is a ‘nice’ prior on Rd . Then in ‘nice’ models,
log Ln(M) = ˆn(M)− d
2log n + Op(1),
and a better (Laplace) approximation is possible:
log Ln(M) = ˆn(M)− d
2log( n
2π
)+ log P(θM |M)
− 1
2log det
[1
nHessian(θM)
]+ Op
(n−1/2
)
BIC and extensions 5 / 36
Consistency
Theorem
Fix a finite set of ‘nice’ models. Then, BIC selects a true model ofsmallest dimension with probability tending to one as n→∞.
Proof.
Finite set of models =⇒ pairwise comparisons suffice.
If P0 ∈M1 (M2 and d1 < d2, then
ˆn(M2)− ˆ
n(M1) = Op(1); and (d2 − d1) log n→∞.
If P0 ∈M1 \ clos(M2), then with probability tending to one,
1
n
[ˆn(M1)− ˆ
n(M2)]> ε > 0; and log(n)/n→ 0.
BIC and extensions 6 / 36
Linear regression (covariates i.i.d. N(0, 1), φ1 = 1, σ = 2)
BIC and extensions 7 / 36
BIC in higher-dimensional linear regression
Exhaustive search up to 6 covariates
10 20 30 40 50 60 70 80
0.0
0.2
0.4
0.6
0.8
1.0
p
Pro
b co
rrec
t
n = p,σ = 1,k = 2,φ1 = 1φ2 = 1
BIC and extensions 8 / 36
Higher-dimensional linear regression . . . too large models
σ = 1,k = 2,φ1 = φ2 = 1
Broman & Speed (2002)
BIC and extensions 9 / 36
Informative prior on models in higher dim. regression
σ = 1,k = 2,φ1 = φ2 = 1
BIC and extensions 10 / 36
Informative prior on models in higher dim. regression
Exhaustive search up to 6 covariates
10 20 30 40 50 60 70 80
0.0
0.2
0.4
0.6
0.8
1.0
p
Pro
b co
rrec
t
BICEBIC
n = p,σ = 1,k = 2,φ1 = 1φ2 = 1
BIC and extensions 11 / 36
Extended Bayesian information criterion
Linear regression
Models given by subsets of covariates J ⊂ [p] := {1, . . . , p}Prior on models
P(J) =1
p + 1· 1( p|J|)
has k = #covariates and (J|k) uniformly distributed.
Extended BIC defined as
EBIC(J) = BIC(J)− |J| log p;
we have |J| � p in mind.
Bogdan et al. (2004), Chen & Chen (2008), Scott and Berger (2010), . . .
BIC and extensions 12 / 36
Theory = consistency for EBIC
Chen & Chen ’08 High-dimensional sparse linear regression(fixed design, # active covariates bounded).
Chen & Chen ’11 Generalized linear models(fixed design, canonical link).
Chen et al. ’11 Generalizations for fixed design regression
Gao et al. ’10 Gaussian graphical modelsFoygel & D ’10 (adjust penalty for number of graphs)
BIC and extensions 13 / 36
Questions
Bayesian connection under high-dimensional asymptotics:
� Laplace approximation to marginal likelihood accurate uniformly over agrowing number of models?
� EBIC captures growth of marginal likelihood?
Consistency for random designs?
Consistency for pseudo-likelihood approaches to graphical modelselection?
Consistency of fully Bayesian model choice as corollaries?
Shang & Clayton (2011)
BIC and extensions 14 / 36
Outline
1 BIC and extensions
2 Asymptotics for marginal likelihood of GLMs
3 Consistency for GLMs
4 Ising models
Asymptotics for marginal likelihood of GLMs 15 / 36
Generalized linear model: Setup
Independent (response) observations Y1, . . . ,Yn
Distribution of Yi ∼ pθi from univariate exponential family:
pθ(y) ∝ exp {y · θ − b(θ)} , θ ∈ Θ = R.
Linearity:θ = (θ1, . . . , θn)T = Xφ, φ ∈ Rp,
for design matrix X = (Xij) ∈ Rn×p.(rows , experiments, col’s , covariates)
Random design with X1•, . . . ,Xn• i.i.d.
Variable selection:
Find support J∗ ⊂ [p] of true parameter φ∗.
Asymptotics for marginal likelihood of GLMs 16 / 36
Assumptions
(A) Bounded covariates (or a moment condition)
(B1) Subexponential growth of dimension: log(pn) = o(n).
(B2) Dimension of smallest true model bounded by a fixed q ∈ N.
(B3) Small sets of covariates have second moment matrices with mimimaleigenvalue bounded away from zero:
λmin
(E[X1JXT
1J
])> a > 0 for all |J| ≤ 2q.
(B4) Norm of signal ‖φ∗‖2 bounded.
Asymptotics for marginal likelihood of GLMs 17 / 36
Theorem (Laplace approximation)
Assume (A), (B1)-(B4) and ‘nice priors’ (fJ : J ⊂ [p], |J| ≤ q). Then thereis a constant C such that the marginal likelihood sequence Ln(J) satisfiesthat
log Ln(J) = `n(φJ)− |J|2
log(n) + log fJ(φJ) +|J|2
log(2π)
− 1
2log det
(1
nHessianJ(φJ)
)± C
√log(np)
nfor all |J| ≤ q,
with probability tending to 1 as n→∞.
Asymptotics for marginal likelihood of GLMs 18 / 36
EBIC approximation
EBIC (with parameter γ ≥ 0):
EBICγ(J) = `n(φJ) − |J|2
log(n) − γ|J| · log(p) .
Corollary
Assume (A), (B1)-(B4) and ‘nice priors’ (fJ : J ⊂ [p], |J| ≤ q). Adopt theunnormalized model prior
Pγ(J) =
(p
|J|
)−γ· 1 {|J| ≤ q} .
Then there is a constant C ′ such that with probability tending to 1 asn→∞, we have∣∣∣log
[Pγ(J,Y )
]− EBICγ(J)
∣∣∣ ≤ C ′ for all |J| ≤ q.
Asymptotics for marginal likelihood of GLMs 19 / 36
Laplace approximation to marginal likelihood
∫RJ
exp(`n(φJ + γ)
)· fJ(φJ + γ) dγ
Taylor series:
`n(φJ + γ) = `[n](φJ) − 1
2γ> HessianJ(φJ + tγ · γ) γ
Approximation by Gaussian integral:
fJ(φJ) ·∫RJ
exp(`n(φJ)
)· exp
(−1
2γ> HessianJ(φJ) γ
)dγ
= fJ(φJ) · exp(`n(φJ) ·
√(2π
n
)|J|· det
(1
nHessianJ(φJ)
)−1
Asymptotics for marginal likelihood of GLMs 20 / 36
Laplace approximation to marginal likelihood
∫RJ
exp
(`n(φJ)− 1
2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸
≈ `n(φJ)− 1
2γ> HessianJ(φJ) γ
)dγ
`n(φJ)
γ = 0 ↔ φ = φJ
‖γ‖2 ≤√
log(p)n
‖γ‖2 ≤ 1
Asymptotics for marginal likelihood of GLMs 21 / 36
Laplace approximation to marginal likelihood
∫RJ
exp
(`n(φJ)− 1
2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸
≈ `n(φJ)− 1
2γ> HessianJ(φJ) γ
)dγ
`n(φJ)
γ = 0 ↔ φ = φJ
‖γ‖2 ≤√
log(p)n
‖γ‖2 ≤ 1
Asymptotics for marginal likelihood of GLMs 21 / 36
Laplace approximation to marginal likelihood
∫RJ
exp
(`n(φJ)− 1
2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸
≈ `n(φJ)− 1
2γ> HessianJ(φJ) γ
)dγ
`n(φJ)
γ = 0 ↔ φ = φJ
‖γ‖2 ≤√
log(p)n
‖γ‖2 ≤ 1
Asymptotics for marginal likelihood of GLMs 21 / 36
Laplace approximation to marginal likelihood
∫RJ
exp
(`n(φJ)− 1
2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸
≈ `n(φJ)− 1
2γ> HessianJ(φJ) γ
)dγ
`n(φJ)
γ = 0 ↔ φ = φJ
‖γ‖2 ≤√
log(p)n
‖γ‖2 ≤ 1
Asymptotics for marginal likelihood of GLMs 21 / 36
Laplace approximation to marginal likelihood
∫RJ
exp
(`n(φJ)− 1
2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸
≈ `n(φJ)− 1
2γ> HessianJ(φJ) γ
)dγ
●●
`n(φJ)
γ = 0 ↔ φ = φJ
‖γ‖2 ≤√
log(p)n
‖γ‖2 ≤ 1
Asymptotics for marginal likelihood of GLMs 21 / 36
Assumptions on priors
Family of priors (fJ : J ⊂ [p], |J| ≤ q) is ‘nice’ if for constants0 < F1,F2,F3 <∞ we have uniformly for all |J| ≤ q:
(i) an upper bound:sup φJ fJ(φJ) ≤ F1 <∞,
(ii) a lower bound over a compact set:
inf ‖φJ‖2≤R+1fJ(φJ) ≥ F2 > 0,
where R is a function of the constants in (A) & (B1)-(B4),
(iii) a Lipschitz property on the same compact set:
sup ‖φJ‖2≤R+1 ‖∇fJ(φJ)‖2 ≤ F3 <∞.
Asymptotics for marginal likelihood of GLMs 22 / 36
Outline
1 BIC and extensions
2 Asymptotics for marginal likelihood of GLMs
3 Consistency for GLMs
4 Ising models
Consistency for GLMs 23 / 36
(B5) Small true coefficients don’t decay too fast:√log(npn)
n= o
(min
{∣∣φ∗j ∣∣ : j ∈ J∗}).
Theorem (EBIC consistency in GLM)
Assume (A), (B1)-(B5). Let
κ = limn→∞
log pn
log n∈ [0,∞],
and take γ > 1− 12κ . Then with prob. tending to 1 as n→∞, we have
BICγ(J∗)− maxJ 6=J∗,|J|≤q
BICγ(J) ≥ log(p) · Chigh + log(n) · Clow
for constants Chigh,Clow > 0.
Consistency for GLMs 24 / 36
EBIC approximates Bayesian model choice
Corollary (Consistency of Bayesian model choice)
Assume (A), (B1)-(B5) and ‘nice priors’. Then with probability tending to1 as n→∞, we have
Pγ(J∗ |Y ) > maxJ 6=J∗,|J|≤q
Pγ(J |Y ) .
Consistency for GLMs 25 / 36
Experiment for sparse logistic regression (with lasso)
Spambase data from UCI Machine Learning Data Repository
n0 = 4601 emails, p0 = 57 covariates
Downsample to n < n0 experiments.
Create p − p0 noise covariates by random permutation.
Total number of covariates p satisfies pn = p0
25 ≈ 2.28.
Select a model from lasso path using EBIC, cross-validation andstability selection (Meinshausen & Buhlmann, 2010).
Consistency for GLMs 26 / 36
Positive selection and false discovery rate
Number of samples
Pos
itive
sel
ectio
n ra
te (
PS
R)
100 200 300 400 500 600
0%10
%20
%30
%40
%50
%
●
●
●
●●
●
●●
●● ●
●
●
●
BIC0
BIC0.25
BIC0.5
BIC1
Cross−validationStability selection
Number of samples
Fals
e di
scov
ery
rate
(F
DR
)
100 200 300 400 500 6000%
20%
40%
60%
80%
● ● ● ● ● ●● ● ● ● ● ●
●
●
BIC0
BIC0.25
BIC0.5
BIC1
Cross−validationStability selection
Consistency for GLMs 27 / 36
Comparison to full data
P−value of feature in the full regression (sample size 4601)
Sm
ooth
ed p
rob.
of s
elec
tion
(sub
sam
ple
size
600
)
●
●
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0
0.1
0.2
0.3
0.4
0.5
●
●
BIC0
BIC0.25
BIC0.5
BIC1
Cross−validationStability selection
Figure: Smoothed probability of selecting a true feature, as a function of thep-value of that feature in the full regression.
Consistency for GLMs 28 / 36
Outline
1 BIC and extensions
2 Asymptotics for marginal likelihood of GLMs
3 Consistency for GLMs
4 Ising models
Ising models 29 / 36
Ising model
Observe i.i.d. X (1), . . . ,X (n) ∈ {0, 1}p
Likelihood function:
1
Z (Θ)· exp
{∑jΘj0xj +
∑j<kΘjkxjxk
}↑ ↑
normalizing (sparse)
const. potential matrix
Full conditional for Xj is proportional to
exp{
xj ·(
Θj0 +∑
k 6=jΘjkxk
)}Model selection problem:
Find support E∗ (the ‘graph’) of true potential matrix Θ∗
Ising models 30 / 36
Neighborhood selection for sparse Ising models
For each Xj , select its neighborhood via the Lasso:
Θ(λ)j• = arg max
[`Xj |X−j
(Θj•
)+ λ ·
∑k 6=j
|Θjk |]
(Meinshausen & Buhlmann, 2006; Ravikumar et al., 2010)
How to choose λ, i.e., neighborhoods from each path?
Cross-validation tends to select too large neighborhoods.
Apply EBIC:
� Let Ej,λ be the edges incident to j in support of Θ(λ)j• .
� Maximize
`Xj |X−j
(Θ
(λ)j•
)− |Ej,λ|
2log(n) − |Ej,λ| · γ log(p)
with respect to λ.
Ising models 31 / 36
Consisteny of EBIC for Ising model selection
Theorem
Consider subexponential growth of p = pn with
κ = limn→∞
log pn
log n∈ [0,∞].
Assume
all neighborhood sizes bounded by a constant,√log(np)
n � |Θ∗jk | ≤ a constant, for all edges (j , k).
Take γ > 2− 12κ . Then with probability tending to 1 as n→∞:
EBICγ selects the right neighborhood for every Xj .
Follows from consistency of EBIC for GLMs with random covariates.
Ising models 32 / 36
Precipitation data (U.S. Historical Climatology Network)
89 weather stations
measure precipitation (1 or 0) on 278 (nonconsecutive) dates
discard locations of the weather stations can we recover the geographical layout?
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
−96 −94 −92 −90 −88 −86
3638
4042
Longitude
Latit
ude
Ising models 33 / 36
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
BIC
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
Extended BIC
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
Cross−validation
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
Stability selection
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
γ = 0.25
Ising models 34 / 36
Edge selection vs distance
Distance between weather stations (miles)
Sm
ooth
ed p
roba
bilit
y of
sel
ectin
g ed
ge
0 100 200 300 400 500 600
0.0
0.2
0.4
0.6
0.8
1.0
BICextended BICcross−validationstability selection
Ising models 35 / 36
Conclusion
Laplace approximation can be accurate uniformly over large numberof sparse GLMs
Chen & Chen’s extended Bayesian information criterion (EBIC):
� connected to Bayesian model choice;� its consistency proves consistency of ‘generic’ Bayesian procedures;� computationally inexpensive alternative to stability selection and other
resampling methods;� seems useful for tuning regularization methods.
For details including references, see:
Bayesian model choice and information criteria in sparse generalizedlinear models (with Rina Foygel). arXiv:1112.5635
Ising models 36 / 36