Cp, AIC, BIC-Three Critera for Selecting Model

31
C p , AIC, BIC Three Criteria for Model Selection Rachel Fan Statistics Department Columbia University September 16, 2011 Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 1 / 31

Transcript of Cp, AIC, BIC-Three Critera for Selecting Model

Page 1: Cp, AIC, BIC-Three Critera for Selecting Model

Cp, AIC, BICThree Criteria for Model Selection

Rachel Fan

Statistics DepartmentColumbia University

September 16, 2011

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 1 / 31

Page 2: Cp, AIC, BIC-Three Critera for Selecting Model

Outline

1 Mallow’s Cp

2 Akaike Information Criteria (AIC)

3 Bayesian Information Criteria (BIC)

4 Comparison between AIC and BIC

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 2 / 31

Page 3: Cp, AIC, BIC-Three Critera for Selecting Model

Outline

1 Mallow’s Cp

2 Akaike Information Criteria (AIC)

3 Bayesian Information Criteria (BIC)

4 Comparison between AIC and BIC

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 3 / 31

Page 4: Cp, AIC, BIC-Three Critera for Selecting Model

Linear Model

Y = Xβ + ε

Yn×1 = (y1, · · · , yn)T

Xn×(k+1) = (1, x1, · · · , xk)

β(k+1)×1 = (β0, · · · , βk)

εn×1 = (ε1, · · · , εn), indepedence, Eεi = 0, V(εi ) = σ2

EY = Xβ

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 4 / 31

Page 5: Cp, AIC, BIC-Three Critera for Selecting Model

Notation

K+ = {0, 1, · · · , k}, the set of indices

P ⊆ K+; Q = K+ \ P|P| : the number of elements in P, let |P| = p, |Q| = q, sop + q = k + 1

βP = X−PY: The LS estimator of β using the subset of X withindices in P

X−P has zeroes in the rows corresponding to Q, and the remaining

rows contain the matrix (ZTP ZP)

−1ZTP , where ZP is obtained from X

by deleting the columns corresponding to Q

YP = XβP = XX−p Y = HpY

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 5 / 31

Page 6: Cp, AIC, BIC-Three Critera for Selecting Model

Scaled sum of squared errors

ΓP =||YP − EY||2

σ2

a measure of prediction adequacy

E||YP − EY||2 = VP + BP

VP =tr(V(Y)) =tr(Hp)σ2 = pσ2

BP = ||EY − EYP ||2 = βTXT (I− HP)Xβ

EΓP = p +Bp

σ2

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 6 / 31

Page 7: Cp, AIC, BIC-Three Critera for Selecting Model

Mallow’s Cp

CP :=SSEP

MSEk+1− n + 2p

SSEP = ||YP − Y||2

E(SSEP) = (n − p)σ2 + Bp

MSEP = SSEP/(n − p)

If the full model contains all relevant variables, E(MSEk+1) = σ2

E(CP) = p +Bp

σ2 = EΓP

CP is an estimate of Γp

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 7 / 31

Page 8: Cp, AIC, BIC-Three Critera for Selecting Model

Properties of Mallow’s Cp

CP :=SSEp

MSEk+1− n + 2p

If the P-subset model is adequate, SSEP ≈ (n − p)σ2 and CP ≈ p

CK+ = (n − k − 1)− n + 2(k + 1) = k + 1

If |P∗| = p + 1 and P ⊂ P∗, then

CP∗ − CP = 2− SS

MSEk+1

where SS is the contribution to SSR by the (p + 1)th variable

Assume εi ∼ N(0, σ2), SSMSEk+1

∼ t21

If the additional variable is unimportant, then BP ≈ BP∗ , E(SS) ≈ σ2

and so E(CP∗ − CP) ≈ 1

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 8 / 31

Page 9: Cp, AIC, BIC-Three Critera for Selecting Model

CP plot

Figure: CP plot with independent variables – P is an adequate subset (β = βp),p = k − 2, K+ \ P = {1, 2, 3}, which are unimportant. Every non-zero element ofβ is significant.

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 9 / 31

Page 10: Cp, AIC, BIC-Three Critera for Selecting Model

CP plot

Figure: Variables 1, 2, 3 are highly explanatory and highly correlated, and variablesin P are also explanatory

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 10 / 31

Page 11: Cp, AIC, BIC-Three Critera for Selecting Model

CP plot

Figure: Cp plot–Variables 1, 2 are jointly explanatory but separately not, andvariables in P are also explanatory where |P| = k − 4.

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 11 / 31

Page 12: Cp, AIC, BIC-Three Critera for Selecting Model

Outline

1 Mallow’s Cp

2 Akaike Information Criteria (AIC)

3 Bayesian Information Criteria (BIC)

4 Comparison between AIC and BIC

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 12 / 31

Page 13: Cp, AIC, BIC-Three Critera for Selecting Model

Model

x1, x2, · · · , xn ∼iid g(x)

True model: g(x)

Candidate parametric model : f (x |θ), θ ∈ Θp

MLE of θ: θ ∈ Θp

Fitted model: f (x |θ)

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 13 / 31

Page 14: Cp, AIC, BIC-Three Critera for Selecting Model

Kullback-Leibler Information

I(g ; f (·|θ)) := E(logg(x)

f (x |θ)) = S(g ; g)− S(g ; f (·|θ))

S(g ; f (·|θ)) = E(log(f (x |θ))) =∫

log f (x |θ)g(x)dx

I represents the separation between g and f

I ≥ 0 and the equality holds iff f = g a.e.

The best fitting model minimizes I(g ; f (·|θ)), or equivalently,maximizes S(g ; f (·|θ))

S(g ; f (·|θ)) cannot be computed since g(x) is unknown.

By SLLN,

1

n

n∑i=1

log f (xi |θ)→ S(g ; f (·|θ))

Maximizing the average log-likelihood leads to the MLE θ

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 14 / 31

Page 15: Cp, AIC, BIC-Three Critera for Selecting Model

Derivation of AIC

Consider g(x) = f (x |θ0), θ0 ∈ Θk

Denote I(g ; f (·|θ)) and S(g ; f (·|θ)) by I(θ0,θ) and S(θ0,θ)

Assume θ0 /∈ Θp (p < k), and θ = argmaxθ∈Θp(S(θ0,θ))

Suppose θ is sufficiently close to θ0

Try to find the model that maximizes E(I(θ0, θ))

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 15 / 31

Page 16: Cp, AIC, BIC-Three Critera for Selecting Model

Derivation of AIC

E(2nI(θ0, θ)) = E(

2n(S(θ0,θ0)− S(θ0, θ)))

= E(

2n(S(θ0,θ0)− S(θ0, θ) + S(θ0, θ)− S(θ0, θ)))

= E(n ‖ θ − θ0 ‖2

I +n ‖ θ − θ ‖2I

)→ n ‖ θ − θ0 ‖2

I +p

where ‖ θ − θ0 ‖2I = (θ − θ0)T I (θ0)(θ − θ0),

‖ θ − θ ‖2I = (θ − θ)T I (θ)(θ − θ)

The last limit follows from√n(θ − θ)→d N(0, I−1(θ))

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 16 / 31

Page 17: Cp, AIC, BIC-Three Critera for Selecting Model

Estimate n ‖ θ − θ0 ‖2I by

2

(n∑

i=1

log f (xi |θ0)−n∑

i=1

log f (xi |θ)

)

Needs bias correction by adding p

Therefore, an asymptotically unbiased estimator of E(2nI(θ0, θ)) is:

2

(n∑

i=1

log f (xi |θ0)−n∑

i=1

log f (xi |θ)

)+ 2p

Minimizing EI(θ0, θ) is equivalent to minimizingAIC := −2 log(maximum likelihood) + 2p

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 17 / 31

Page 18: Cp, AIC, BIC-Three Critera for Selecting Model

Outline

1 Mallow’s Cp

2 Akaike Information Criteria (AIC)

3 Bayesian Information Criteria (BIC)

4 Comparison between AIC and BIC

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 18 / 31

Page 19: Cp, AIC, BIC-Three Critera for Selecting Model

Model

x1, x2, · · · , xn ∼iid f (x |θ)

True model: f (x |θ) = exp{θ · y(x)− b(θ)} where θ ∈ ΘK

Candidate models: M1,M2, · · · ,M2k where Mj is a kj dimensionallinear submanifold of K-dimensional space

Prior : P(M = Mj) = αj ; θj |Mj ∼ µjPosterior: f (Mj ,θj |x) =

f (x|θj )αjdµj (θj )m(x) where m(x) is the marginal

density of x

f (Mj |x) =∫αj f (x|θj )dµj (θj )

m(x)

Find the model that gives the highest posterior density.

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 19 / 31

Page 20: Cp, AIC, BIC-Three Critera for Selecting Model

BIC

S(Y, n, j) := log∫αj exp(n(Y · θ − b(θ))dµj(θ)

Proposition: For fixed Y and j , as n→∞,

S(Y, n, j) = n sup(Y · θ − b(θ))− 1

2kj log n + R

where the remainder R = R(Y, n, j) is bounded in n for fixed Y and j .

BIC := −2 log(maximum likelihood) + p log n

p is the number of parameters in the candidate model

Select the model that minimizes BIC (maximizes S(Y, n, j))

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 20 / 31

Page 21: Cp, AIC, BIC-Three Critera for Selecting Model

BIC

Consider the general situation that f (x |θ) does not have to beexponential distribution

−2 log f (Mj |x) = 2 logm(x)−2 logαj − 2 log

∫f (x|θj)dµj(θj)︸ ︷︷ ︸

S ′(Mj |x)

Select the model that minimizes S ′(Mj |x)

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 21 / 31

Page 22: Cp, AIC, BIC-Three Critera for Selecting Model

BIC

By second order Taylor expansion

log f (x|θj) = log L(θj) ≈ log L(θj)−1

2(θj − θj)T [nI (θj)](θj − θj)

where I (θj) = − 1n∂2 log L(θj )

∂θ2j

∣∣∣θj=θj

If we have noninformative prior dµj(θj) = dθj

∫L(θj)dµj(θj) ≈ L(θj)

∫exp

{−1

2(θj − θj)T [nI (θj)](θj − θj)

}dθj

= L(θj)(2π

n)kj2 |I (θj)|−

12

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 22 / 31

Page 23: Cp, AIC, BIC-Three Critera for Selecting Model

BIC

Therefore,

S ′(Mj |x) ≈ −2 logαj − 2 log L(θj)(2π

n)kj2 |I (θj)|−

12

= −2 logαj − 2 log L(θj) + kj logn

2π+ log |I (θj)|

Ignoring the terms that are bounded when n goes to ∞, we obtain

S ′(Mj |x) ≈ −2 log L(θj) + kj log n =: BIC

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 23 / 31

Page 24: Cp, AIC, BIC-Three Critera for Selecting Model

Outline

1 Mallow’s Cp

2 Akaike Information Criteria (AIC)

3 Bayesian Information Criteria (BIC)

4 Comparison between AIC and BIC

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 24 / 31

Page 25: Cp, AIC, BIC-Three Critera for Selecting Model

Comparison between AIC and BIC

AIC = −2 log(maximum likelihood) + 2p

BIC = −2 log(maximum likelihood) + p log n

BIC is consistent yet not asymptotically efficient, AIC isasymptotically efficient, but not consistent.

Consistency: Suppose that the true model is of a finite dimension,and that this model is nested in the candidate collection underconsideration. A consistent criterion will asymptotically select thetrue model with probability one.

Efficiency: Suppose the true model is of an infinite dimension andtherefore lies outside of the candidate collection under consideration.An asymptotically efficient criterion will asymptotically select themodel which minimizes the mean squared error of prediction.

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 25 / 31

Page 26: Cp, AIC, BIC-Three Critera for Selecting Model

Comparison between AIC and BIC

AIC = −2 log(maximum likelihood) + 2p

BIC = −2 log(maximum likelihood) + p log n

The penalty term of BIC is more stringent than the penalty term ofAIC (For n ≥ 8, p log n is greater than 2p)

Therefore, BIC favors smaller models than AIC

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 26 / 31

Page 27: Cp, AIC, BIC-Three Critera for Selecting Model

Reference

Mallows, C.L. (1973), “Some Comments on CP”, Technometrics15(4): 661-675.

Akaike, H.(1974), “A New Look at the Statistical ModelIdentification”, IEEE Transactions on Automatic Control AC-19(46:716-723

Schwarz G.(1978), “Estimating the Dimension of a Model”TheAnnals of Statistics 6(2):461-464

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 27 / 31

Page 28: Cp, AIC, BIC-Three Critera for Selecting Model

Proof of Proposition

Proposition: For fixed Y and j , as n→∞,

S(Y, n, j) = n sup(Y · θ − b(θ))− 1

2kj log n + R

where the remainder R = R(Y, n, j) is bounded as n goes to ∞ for fixed Yand j .Lemma 1. The proposition holds when Y · θ − b(θ) = A− λ ‖ θ − θ0 ‖2

where λ > 0, θ0 is a fixed vector in mj and µj is the Lebesgue measure onmj .Lemma 2. If two bounded positive random variables U and V agree onthe set where either exceeds ρ for some 0 < ρ < supU, then

logE(Un)− logE(V n)→ 0 as n→∞Lemma 3. For some 0 < ρ < eA where A = sup(Y · θ − b(θ)), a vectorθ0, and some positive λ1 and λ2, the following holds whereverexp(Y · θ − b(θ)) > ρ:

A− λ1 ‖ θ − θ0 ‖2< Y · θ − b(θ) < A− λ2 ‖ θ − θ0 ‖2

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 28 / 31

Page 29: Cp, AIC, BIC-Three Critera for Selecting Model

Proof of Lemma 1

Lemma 1. The proposition holds when Y · θ − b(θ) = A− λ ‖ θ − θ0 ‖2

where λ > 0, θ0 is a fixed vector in mj and µj is the Lebesgue measure onmj .Proof: Since

S(Y, n, j) = log(αjenA(

π

nλ)kj/2)

= nA− 1

2kj log

π+ logαj

And sup(A− λ ‖ θ − θ0 ‖2) = AThis establish the proposition, with R = 1

2kj log πλ + logαj

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 29 / 31

Page 30: Cp, AIC, BIC-Three Critera for Selecting Model

Proof of Lemma 2

Lemma 2. If two bounded positive random variables U and V agree onthe set where either exceeds ρ for some 0 < ρ < supU, then

logE(Un)− logE(V n)→ 0 as n→∞

Proof: It suffices to show that this holds for V that vanishes where U ≤ ρIn this case 0 ≤ Un − V n ≤ ρ and therefore

E(V n) ≤ E(Un) ≤ E(V n) + ρn = E(V n)

(1 +

ρn

E(V n)

)From (E(V n))1/n → supV and supV = supU ≥ ρ, we know ρ

E(V n)1/n < 1

in limits and hence ρn

E(V n) → 0, and hence log(1 + ρn

E(V n) )→ 0, whichestablishes the result.

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 30 / 31

Page 31: Cp, AIC, BIC-Three Critera for Selecting Model

Proof of Lemma 3

Lemma 3. For some 0 < ρ < eA where A = sup(Y · θ − b(θ)), a vectorθ0, and some positive λ1 and λ2, the following holds whereverexp(Y · θ − b(θ)) > ρ:

A− λ1 ‖ θ − θ0 ‖2< Y · θ − b(θ) < A− λ2 ‖ θ − θ0 ‖2

Proof: As b(θ) = COV(y), it is positive definite. Therefore, Y · θ − b(θ)is strictly convex and attains its maximum. Let θ0 = argmax(Y ·θ−b(θ)).The second order Taylor expansion at θ0 yields:

Y · θ − b(θ) = A +1

2(θ − θ0)T b(θ0)(θ − θ0)

If 2λ1 and 2λ2 are the larger and smaller than all the eigenvalues of b(θ0).It is now easy to determine ρ < eA so that it will bound exp(Y · θ − b(θ))outside the neighborhood.

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 31 / 31