Cp, AIC, BIC-Three Critera for Selecting Model

Cp, AIC, BICThree Criteria for Model Selection

Rachel Fan

Statistics DepartmentColumbia University

September 16, 2011

Rachel Fan (Statistics Department Columbia University) Cp , AIC, BIC September 16, 2011 1 / 31

Outline

1 Mallow’s Cp

2 Akaike Information Criteria (AIC)

3 Bayesian Information Criteria (BIC)

4 Comparison between AIC and BIC


Outline

1 Mallow’s Cp





Linear Model

Y = Xβ + ε

Yn×1 = (y1, · · · , yn)T

Xn×(k+1) = (1, x1, · · · , xk)

β(k+1)×1 = (β0, · · · , βk)

εn×1 = (ε1, · · · , εn), indepedence, Eεi = 0, V(εi ) = σ2

EY = Xβ


Notation

K+ = {0, 1, · · · , k}, the set of indices

P ⊆ K+; Q = K+ \ P|P| : the number of elements in P, let |P| = p, |Q| = q, sop + q = k + 1

βP = X−PY: The LS estimator of β using the subset of X withindices in P

X−P has zeroes in the rows corresponding to Q, and the remaining

rows contain the matrix (ZTP ZP)

−1ZTP , where ZP is obtained from X

by deleting the columns corresponding to Q

YP = XβP = XX−p Y = HpY


Scaled sum of squared errors

ΓP =||YP − EY||2

σ2

a measure of prediction adequacy

E||YP − EY||2 = VP + BP

VP =tr(V(Y)) =tr(Hp)σ2 = pσ2

BP = ||EY − EYP ||2 = βTXT (I− HP)Xβ

EΓP = p +Bp

σ2


Mallow’s Cp

CP :=SSEP

MSEk+1− n + 2p

SSEP = ||YP − Y||2

E(SSEP) = (n − p)σ2 + Bp

MSEP = SSEP/(n − p)

If the full model contains all relevant variables, E(MSEk+1) = σ2

E(CP) = p +Bp

σ2 = EΓP

CP is an estimate of Γp


Properties of Mallow’s Cp

CP :=SSEp

MSEk+1− n + 2p

If the P-subset model is adequate, SSEP ≈ (n − p)σ2 and CP ≈ p

CK+ = (n − k − 1)− n + 2(k + 1) = k + 1

If |P∗| = p + 1 and P ⊂ P∗, then

CP∗ − CP = 2− SS

MSEk+1

where SS is the contribution to SSR by the (p + 1)th variable

Assume εi ∼ N(0, σ2), SSMSEk+1

∼ t21

If the additional variable is unimportant, then BP ≈ BP∗ , E(SS) ≈ σ2

and so E(CP∗ − CP) ≈ 1


CP plot

Figure: CP plot with independent variables – P is an adequate subset (β = βp),p = k − 2, K+ \ P = {1, 2, 3}, which are unimportant. Every non-zero element ofβ is significant.


CP plot

Figure: Variables 1, 2, 3 are highly explanatory and highly correlated, and variablesin P are also explanatory


CP plot

Figure: Cp plot–Variables 1, 2 are jointly explanatory but separately not, andvariables in P are also explanatory where |P| = k − 4.


Outline

1 Mallow’s Cp





Model

x1, x2, · · · , xn ∼iid g(x)

True model: g(x)

Candidate parametric model : f (x |θ), θ ∈ Θp

MLE of θ: θ ∈ Θp

Fitted model: f (x |θ)


Derivation of AIC

Consider g(x) = f (x |θ0), θ0 ∈ Θk

Denote I(g ; f (·|θ)) and S(g ; f (·|θ)) by I(θ0,θ) and S(θ0,θ)

Assume θ0 /∈ Θp (p < k), and θ = argmaxθ∈Θp(S(θ0,θ))

Suppose θ is sufficiently close to θ0

Try to find the model that maximizes E(I(θ0, θ))


Derivation of AIC

E(2nI(θ0, θ)) = E(

2n(S(θ0,θ0)− S(θ0, θ)))

= E(

2n(S(θ0,θ0)− S(θ0, θ) + S(θ0, θ)− S(θ0, θ)))

= E(n ‖ θ − θ0 ‖2

I +n ‖ θ − θ ‖2I

)→ n ‖ θ − θ0 ‖2

I +p

where ‖ θ − θ0 ‖2I = (θ − θ0)T I (θ0)(θ − θ0),

‖ θ − θ ‖2I = (θ − θ)T I (θ)(θ − θ)

The last limit follows from√n(θ − θ)→d N(0, I−1(θ))


Estimate n ‖ θ − θ0 ‖2I by

2

(n∑

i=1

log f (xi |θ0)−n∑

i=1

log f (xi |θ)

)

Needs bias correction by adding p

Therefore, an asymptotically unbiased estimator of E(2nI(θ0, θ)) is:

2

(n∑

i=1

log f (xi |θ0)−n∑

i=1

log f (xi |θ)

)+ 2p

Minimizing EI(θ0, θ) is equivalent to minimizingAIC := −2 log(maximum likelihood) + 2p


Outline

1 Mallow’s Cp





Model

x1, x2, · · · , xn ∼iid f (x |θ)

True model: f (x |θ) = exp{θ · y(x)− b(θ)} where θ ∈ ΘK

Candidate models: M1,M2, · · · ,M2k where Mj is a kj dimensionallinear submanifold of K-dimensional space

Prior : P(M = Mj) = αj ; θj |Mj ∼ µjPosterior: f (Mj ,θj |x) =

f (x|θj )αjdµj (θj )m(x) where m(x) is the marginal

density of x

f (Mj |x) =∫αj f (x|θj )dµj (θj )

m(x)

Find the model that gives the highest posterior density.


BIC

S(Y, n, j) := log∫αj exp(n(Y · θ − b(θ))dµj(θ)

Proposition: For fixed Y and j , as n→∞,

S(Y, n, j) = n sup(Y · θ − b(θ))− 1

2kj log n + R

where the remainder R = R(Y, n, j) is bounded in n for fixed Y and j .

BIC := −2 log(maximum likelihood) + p log n

p is the number of parameters in the candidate model

Select the model that minimizes BIC (maximizes S(Y, n, j))


BIC

By second order Taylor expansion

log f (x|θj) = log L(θj) ≈ log L(θj)−1

2(θj − θj)T [nI (θj)](θj − θj)

where I (θj) = − 1n∂2 log L(θj )

∂θ2j

∣∣∣θj=θj

If we have noninformative prior dµj(θj) = dθj

∫L(θj)dµj(θj) ≈ L(θj)

∫exp

{−1

2(θj − θj)T [nI (θj)](θj − θj)

}dθj

= L(θj)(2π

n)kj2 |I (θj)|−

12


Outline

1 Mallow’s Cp





Comparison between AIC and BIC

AIC = −2 log(maximum likelihood) + 2p

BIC = −2 log(maximum likelihood) + p log n

BIC is consistent yet not asymptotically efficient, AIC isasymptotically efficient, but not consistent.

Consistency: Suppose that the true model is of a finite dimension,and that this model is nested in the candidate collection underconsideration. A consistent criterion will asymptotically select thetrue model with probability one.

Efficiency: Suppose the true model is of an infinite dimension andtherefore lies outside of the candidate collection under consideration.An asymptotically efficient criterion will asymptotically select themodel which minimizes the mean squared error of prediction.


Comparison between AIC and BIC

AIC = −2 log(maximum likelihood) + 2p

BIC = −2 log(maximum likelihood) + p log n

The penalty term of BIC is more stringent than the penalty term ofAIC (For n ≥ 8, p log n is greater than 2p)

Therefore, BIC favors smaller models than AIC


Reference

Mallows, C.L. (1973), “Some Comments on CP”, Technometrics15(4): 661-675.

Akaike, H.(1974), “A New Look at the Statistical ModelIdentification”, IEEE Transactions on Automatic Control AC-19(46:716-723

Schwarz G.(1978), “Estimating the Dimension of a Model”TheAnnals of Statistics 6(2):461-464


Proof of Proposition

Proposition: For fixed Y and j , as n→∞,

S(Y, n, j) = n sup(Y · θ − b(θ))− 1

2kj log n + R

where the remainder R = R(Y, n, j) is bounded as n goes to ∞ for fixed Yand j .Lemma 1. The proposition holds when Y · θ − b(θ) = A− λ ‖ θ − θ0 ‖2

where λ > 0, θ0 is a fixed vector in mj and µj is the Lebesgue measure onmj .Lemma 2. If two bounded positive random variables U and V agree onthe set where either exceeds ρ for some 0 < ρ < supU, then

logE(Un)− logE(V n)→ 0 as n→∞Lemma 3. For some 0 < ρ < eA where A = sup(Y · θ − b(θ)), a vectorθ0, and some positive λ1 and λ2, the following holds whereverexp(Y · θ − b(θ)) > ρ:

A− λ1 ‖ θ − θ0 ‖2< Y · θ − b(θ) < A− λ2 ‖ θ − θ0 ‖2


Proof of Lemma 1

Lemma 1. The proposition holds when Y · θ − b(θ) = A− λ ‖ θ − θ0 ‖2

where λ > 0, θ0 is a fixed vector in mj and µj is the Lebesgue measure onmj .Proof: Since

S(Y, n, j) = log(αjenA(

π

nλ)kj/2)

= nA− 1

2kj log

nλ

π+ logαj

And sup(A− λ ‖ θ − θ0 ‖2) = AThis establish the proposition, with R = 1

2kj log πλ + logαj


Proof of Lemma 2

Lemma 2. If two bounded positive random variables U and V agree onthe set where either exceeds ρ for some 0 < ρ < supU, then

logE(Un)− logE(V n)→ 0 as n→∞

Proof: It suffices to show that this holds for V that vanishes where U ≤ ρIn this case 0 ≤ Un − V n ≤ ρ and therefore

E(V n) ≤ E(Un) ≤ E(V n) + ρn = E(V n)

(1 +

ρn

E(V n)

)From (E(V n))1/n → supV and supV = supU ≥ ρ, we know ρ

E(V n)1/n < 1

in limits and hence ρn

E(V n) → 0, and hence log(1 + ρn

E(V n) )→ 0, whichestablishes the result.


Proof of Lemma 3

Lemma 3. For some 0 < ρ < eA where A = sup(Y · θ − b(θ)), a vectorθ0, and some positive λ1 and λ2, the following holds whereverexp(Y · θ − b(θ)) > ρ:

A− λ1 ‖ θ − θ0 ‖2< Y · θ − b(θ) < A− λ2 ‖ θ − θ0 ‖2

Proof: As b(θ) = COV(y), it is positive definite. Therefore, Y · θ − b(θ)is strictly convex and attains its maximum. Let θ0 = argmax(Y ·θ−b(θ)).The second order Taylor expansion at θ0 yields:

Y · θ − b(θ) = A +1

2(θ − θ0)T b(θ0)(θ − θ0)

If 2λ1 and 2λ2 are the larger and smaller than all the eigenvalues of b(θ0).It is now easy to determine ρ < eA so that it will bound exp(Y · θ − b(θ))outside the neighborhood.


Cp, AIC, BIC-Three Critera for Selecting Model

Documents

Transcript of Cp, AIC, BIC-Three Critera for Selecting Model