Chapter 1: Variable Selection and Model...

Chapter 1: Variable Selection and

Model SelectionMaarten Jansen

Overview

1. Multiple linear regression (Slide 2)

2. Variable and model selection (Slide 68)

3. Model averaging (Slide 145)

4. High dimensional model selection (Slide 124)

c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.1

PART 1: MULTIPLE LINEAR REGRESSION

1.1 The general linear model, slide 3

1.2 Algebraic solution in a Euclidean space, slide 7

1.3 Tools for statistical exploration, slide 19

1.4 Statistical properties of least squares estimators, slide 35


1.1 The general linear model

Yi = β0 +

p−1∑

j=1

βjxi,j + εi With:

• Yi the ith observation of the response variable

– dependent variable

– explained, observed variable

• xi,j the (observed) ith value of the jth covariate

also called

– independent variable

– regression variable

– explanatory variable

– predictor, predicting variable

– control variable


The general linear model (2)

• βj the regressiecoefficient corresponding to the jth covariate

This is a parameter vector to be estimated

• β0 the intercept (other coefficients sometimes “slopes”)

• εi the error (observational noise) in the ith observation

In matrix form: Y = Xβ + ε where X ∈ Rn×p and the first column of X all

ones, (Xi,0 = 1) corresponding to the intercept. The other columns of X are

observed covariate values.

We also define the expected or true response µ = Xβ

µ = E(Y )

Y = µ + ε


Examples, specific cases (1)

1. Polynomial regression Yi = β0 +

p−1∑

j=1

βjxji + εi So xi,j = xji .

There is only one explanatory variable in strict sense (x with observations

xi), but the response depends on several powers of x.

2. Models with interaction

Suppose: 2 explanatory variables, x1 and x2, but Y also depends on the

interaction between them Yi = β0 + β1xi,1 + β2xi,2 + β1,2xi,1xi,2 + εi

We then set x3 = x1x2


Examples, specific cases (2)

3. One-way variance analysis (ANOVA) Yij = µi + εij with i = 1, . . . , k and

j = 1, . . . , ni

Note that in this model the observations have double indices

All observations at the same level µi are only different by the observational

noise, so in a regression model they should share common variables of the

explanatory variables.

The model becomes Yij =∑k

ℓ=1 xij,ℓµℓ + εij where we take indices i and

j together into a single index ij and xij,ℓ gets the value 1 if ℓ = i and 0

otherwise.


1.2 Algebraic solution in a Euclidean space

Model Y = Xβ + ε

We search for β so that Xβ is as close as possible to Y in quadratic norm.

Algebraicly this means that we look for the minimum least squares solution

to the overdetermined set of equations Y = Xβ has to be minimized.

Formally, find β that minimizes ‖Y −Xβ‖22In a Euclidean space (i.e., a space where the notion of orthogonality can be

defined using inner products) that means that the

residual vector e = Y −Xβ

has to be orthogonal (perpendicular) to the approximation (or projection) Xβ

in the subspace spanned by X.

On other words: The residual vector has minimum norm ⇔ it is orthogonal to

the estimator vector


The normal equation and least squares (LS) estimator

Developping the algebraic interpretation:(Xβ

)T(Y −Xβ) = 0

This is the case if XT (Y −Xβ) = 0

from which the normal equation follows β = (XTX)−1XTY

Let µ be the estimator for µ = Xβ , then

µ = X(XTX)−1XTY


Special case: p = 2, simple linear regression

Model Yi = β0 + β1xi,1 + εi, i = 1, . . . , n

Simple linear regression: x0 = 1 and x1 = x.

Explicit formulas for LS estimators

β1 =

n∑

i=1

(Yi − Y )(xi,1 − x1)

n∑

i=1

(xi,1 − x1)2

β0 = Y − β1x1


Least squares as an orthogonal projection

The prediction µ is an orthogonal Projection with projection matrix

P = X(XTX)−1XT so that µ = PY

Orthogonality P T = P (See slide 12)

Projection PP = P (idempotence)

Pµ = µ PX = X

Interpretation: columns of X contain eigenvectors with eigenvalue 1. If

XTY = 0, then PY = 0 (from definition of P ) These are eigenvectors with

eigenvalue 0.

Next slides discuss projections and orthogonal projections in general


Projections

P is a projection matrix ⇔ PP = P i.e., P is idempotent

Properties

1. P has eigenvalues 0 and/or 1

Indeed, let Px = λx with x non-trivial, then λ2x = PPx = Px = λx, hence λ2 = λ

2. There exists an orthogonal Q so that P = Q

[I A

0 0

]QT

P = E

[I 0

0 0

]E−1 = QR

[I 0

0 0

]R−1QT = Q

[I R11(R

−1)120 0

]QT

using the notations of submatrices R =

[R11 R12

0 R22

]and similar for R−1

3. Singular values of P are either 0 or larger than 1

Indeed, singular values of P are eigenvalues of PP T = Q

[I +AAT

0

0 0

]QT which

are either 0 or 1+


Orthogonal projections

Projection P is an orthogonal projection ⇔ (I − P )TP = 0

So, the matrix P itself is NOT orthogonal: P TP 6= I

Projection P is an orthogonal projection ⇔ P T = P

Indeed, (I − P )TP = 0 ⇔ P TP = P

But then P T = (P TP )T = P TP = P

Hence P must be symmetric


Residuals

The vector of Residuals is e = Y − µ = (I − P )Y (See also slide 7)

The normal equation finds the estimator that minimizes the mean squared

residual eTe/n = ‖e‖2/n = 1n

∑ni=1 e

2i

• The matrix I − P is idempotent: (I − P )(I − P ) = I − P

• This matrix has rank n − p: all linear combinations of the columns of X

belong to the kernel of the matrix, since (I − P )X = 0

• An idempotent matrix with rank k has k independent eigen vectors with

eigen value 1 and n− k independent eigen vectors with eigen value 0. The

latter are the columns of X

• The matrix I − P is symmetric, so eigen vectors corresponding to eigen

value 1 are orthogonal to eigen vectors with eigen value 0. Within each of

the two groups of eigen vectors, orthogalisation (using Gram-Schmidt) is

straightforward.

• So: there exists an orthogonal transform U so that I − P = UTΛU with

Λ = diag(1, . . . , 1, 0, . . . , 0)


Orthogonalisation

If X has orthogonal columns, then XTX = Ip, and thus the projection

becomes

µ = X ·XT · YThe ith component is then

µi =

p−1∑

j=0

n∑

k=1

xi,jxk,j · Yk

The computation of all n components of µ can proceed in parallel, without

numerical errors progressing from one component to another.

Otherwise, we Orthogonalise X


Gram-Schmidt-orthogonalisation

Let xj, j = 0, . . . , p− 1 be the columns of X, then define

q0 = x0 = 1

q1 = x1 − 〈x1,q0〉‖q0‖2 · q0

. . .

qj = xj −∑j−1

k=0〈xj ,qk〉‖qk‖2 · qk

This is: qj is the residual of orthogonal projection of xj onto q0, . . . , qj−1

Looks like linear regression within the covariate matrix X

The new covariates qj are orthogonal, but not orthonormal (they can be easily

normalised)


Gram-Schmidt in matrix form

We have

xj = qj +

j−1∑

k=0

〈xj, qk〉‖qk‖2

· qk

Apply scalar product with qj on both sides to find that〈xj,qj〉‖qj‖2 = 1,

Hence xj =∑j

k=0〈xj,qk〉‖qk‖2 · qk

In matrix-form: QR-decomposition X = Q ·R with the columns Q equal to qj, and

R an upper triangular matrix, with entries

Rk,j =〈xj, qk〉‖qk‖2

.

R ∈ Rp×p and R is invertible.

Interpretation Every column of X (every covariate xj can be written as a linear combi-

nation of orthogonal covariates, which are the columns of Q. (Right-multiplication with

R)


Regression after orthogonalisation

Plugging in into the solution of the normal equation:

µ = QR(RTQTQR)−1RTQT · Y = Q(QTQ)−1QT · YThis is a projection onto the orthogonal (but not orthonormal) basis Q.

Written out: µ =

p−1∑

k=0

〈Y , qk〉‖qk‖2

· qk

The vector of coefficients γ with γk =〈Y , qk〉‖qk‖2

satisfies µ = Qγ and γ = (QTQ)−1QT · Y

Relation between γ and β:

β = (XTX)−1XT · Y = (RTQTQR)−1RTQT · Y = R−1(QTQ)−1QT · Y

β = R−1γ γ = Rβ (Left-multiplication with R)


Orthogonalisation details for p = 1, simple linear regression

Model Yi = β0 + β1xi,1 + εi, i = 1, . . . , n

Simple linear regression: x0 = 1 and x1 = x.

q1 = x1 −〈x1, q0〉‖q0‖2

· q0 = x1 −〈x1, 1〉‖1‖2 · 1 = x1 −

∑ni=1 xi,1n

· 1 = x1 − x1 · 1

R =

[1/n x10 1

]R−1 =

[n −nx10 1

]

γ0 = Y γ1 =〈q1,Y 〉‖q1‖2 = 〈x1−x1·1,Y 〉

‖x1−x1·1‖2

Since the second row of R−1 equals[0 1

], we find immediately that

β1 =〈x1−x1·1,Y 〉‖x1−x1·1‖2

which corresponds to the classical result, slide 9


1.3 Tools for statistical explorationIn order to further explore the properties of the least squares estimator, we

need to establish some classical results in multivariate statistics

1. Covariance matrix

2. Schur complement

3. Multivariate normal distribution

4. Trace of a matrix


Tool 1: Covariance matrix

Definition

Let X be a p-variate random vector with joint density function fX(x), joint (cu-

mulative) distribution function FX(x), and joint characteristic function φX(t) = E(eitTX)

µ = E(X)

ΣX = E[(X − µ)(X − µ)T

]We have (ΣX)ij = cov(Xi, Xj)

ΣX is symmetric


Covariance matrix under linear transformations

If Y = AX Then ΣY = AΣXAT

Proof:

It holds that µY = AµX, so

ΣY = E[(Y − µY )(Y − µY )

T]= E

[(A(X − µX))(A(X − µX))T

]

So, with Ai the ith column of A, we have cov(Yi, Yj) = ATi ΣXAj = AT

j ΣXAi

Special case: matrix A has just one row: A = aT , then Y = aTX and σ2Y =

aTΣXa


Covariance matrix is positive (semi-) definite

A symmetric matrix S is called positive (semi-)definite if xTSx > (≥)0 for

every vector x 6= 0

A positive-definite matrix always has real, positive eigenvalues

ΣX is positive semi-definite, since if aTΣXa < 0, then Y = aTX would have

a negative variance

We further assume that ΣX is positive definite.


Eigenvalue decomposition of the covariance matrix

There exists a square, orthogonal matrix U so that ΣX = UΛUT , with Λ a

diagonal matrix with the eigenvalues of ΣX on the diagonal

Furthermore, there exists a matrix A = UΛ1/2 so that ΣX = AAT

For this matrix it holds [det(A)]2 = det(ΣX)

Suppose now Z = A−1(X − µ) (with µ = EX), then

EZ = 0 and ΣZ = A−1(AAT )A−T = I

The components of Z are thus uncorrelated

Vice versa, if ΣZ = I and A is an arbitrary matrix, and X = AZ, then

ΣX = AAT and this arbitrary A can then be written as A = UΛ1/2, where

U orthogonal.


The precision matrix or concentration matrix

The inverse of a non-singular covariance matrix plays an important role. It is

known as the precision matrix or the concentration matrix

Notation (capital Kappa) KX = Σ−1X

In the context of the multivariate normal distribution, we will need to develop

the quadratic form uTKXu where uT =

[u1

u2

]and KX =

[K11 K12

K21 K22

],

uTKXu = uT1K11u1 + uT

1K12u2 + uT2K21u1 + uT

2K22u2

= uT1K11u1 + 2uT

1K12u2 + uT2K22u2

(where we used the symmetry of KX, i.e., K21 = KT12)


Tool 2: The Schur complement of a matrix block

Definition:

Given the matrix S =

[S11 S12

S21 S22

], the Schur complement of submatrix S11 is

given by S1|2 = S11 − S12S−122 S21

Property: S1|2 =[(

S−1)11

]−1


Proof of S1|2 =[(

S−1)11

]−1

Construct L =

[I11 0

−S−122 S21 I22

], then L−1 =

[I11 0

+S−122 S21 I22

]

and

SL =

[S11 S12

S21 S22

] [I11 0

−S−122 S21 I22

]=

[S11 − S12S

−122 S21 S12

0 S22

]

=

[S1|2 S12

0 S22

]=

[I11 S12S

−122

0 I22

] [S1|2 0

0 S22

]

So, S =

[I11 S12S

−122

0 I22

] [S1|2 0

0 S22

] [I11 0

S−122 S21 I22

]

and

S−1

=

[I11 0

−S−122 S21 I22

] [S

−11|2 0

0 S−122

] [I11 −S12S

−122

0 I22

]=

[S

−11|2 . . .

. . . . . .

]


Tool 3: The multivariate normal distribution

Definition: multivariate normal distribution

X ∼ N(µ,ΣX) ⇔ Zi ∼ N(0, 1)

As the components of Z are uncorrelated, and (here) normally distributed,

they must me mutually independent.

The joint density function

fX(x) =1

(2π)p/2√det(ΣX)

e−12(x−µ)TΣ−1

X (x−µ)

The characteristic function

φX(t) = exp

(itTµ− 1

2tTΣXt

)


Properties of the multivariate normal distribution

Property 1 If X ∼ N(µ,ΣX), then (X − µ)TΣ−1X (X − µ) ∼ χ2

p

This follows from (X−µ)TΣ−1X (X−µ) = (X−µ)TA−TA−1(X−µ) = ZTZ =∑p

i=1Z2i and Z2

i ∼ χ21 and Zi mutually independent

Property 2

If X ∼ N(µ,ΣX), then for C ∈ Rq×p with q ≤ p: CX ∼ N(Cµ,CΣXCT )

Special cases

C = aT ∈ R1×p X ∼ N(µ,ΣX) ⇔ aTX ∼ N(aTµ,aTΣXa)

C =

[0 0

0 I

] If X =

[X1

X2

]∼ N(µ,ΣX) with µ =

[µ1

µ2

]and ΣX =

[Σ11 Σ12

Σ21 Σ22

], then X2 ∼ N(µ2,Σ22)


Properties of the multivariate normal distribution

Property 3

Suppose X =

[X1

X2

]and X ∼ N(µ,ΣX) with µ =

[µ1

µ2

]and ΣX =

[Σ11 Σ12

Σ21 Σ22

]and det(Σ22) 6= 0, then

X1|X2 = x2 ∼ N(µ1|2,Σ1|2)

with µ1|2 = µ1 + Σ12Σ−122 (x2 − µ2) and Σ1|2 = Σ11 − Σ12Σ

−122 Σ21

It holds that (see slide 25) Σ−11|2 =

(Σ−1

)11= K11 and Σ12Σ

−122 = −K−1

11 K12

Note also that Σ21 = ΣT12


Marginal and conditional covariance

(further details of slide 29)

Suppose X =

[X1

X2

]is normally distributed with mean µ =

[µ1

µ2

]and covariance

matrix ΣX =

[Σ11 Σ12

Σ21 Σ22

]and det(Σ22) 6= 0, then

Marginal covariance (holds in general)

cov(X1) = Σ11

Conditional covariance = Schur-complement (holds for normal RV)

cov(X1|X2) = Σ1|2 = Σ11 − Σ12Σ−122 Σ21 =

[(Σ−1

)11

]−1


Conditional mean & covariance for normal random vectors

We now prove the result on slide 29.

We know: if X =

[X1

X2

]is normal, then X2 is normal as well (see slide 28).

So, denoting U = X−µ, fU1|U2(u1|u2) =

fU(u)fU2

(u2)∝ exp

[(uTΣ−1u− uT

2Σ−122 u2

)/2]

Now, with K = Σ−1, we substitute uTΣ−1u = uT1K11u1 + 2uT

1K12u2 + uT2K22u2

(see slide 24) and Σ−122 =

[(K−1

)22

]−1= K22 − K21K

−111 K12:

2 log fU1|U2(u1|u2)

= uT1K11u1 + 2uT

1K12u2 + uT2K22u2 − uT

2K22u2 + uT2K21K

−111 K12u2 + constant

= uT1K11u1 + 2uT

1K12u2 + uT2K21K

−111 K12u2 + constant

=(u1 + K−1

11 K12u2

)TK11

(u1 + K−1

11 K12u2

)+ constant

From this, we conclude E(U1|U2 = u2) = −K−111 K12u2

and cov(U1|U2 = u2) = K−111 = Σ11 − Σ12Σ

−122 Σ21


Conditional mean & covariance for normal random vectors

(2)

(Proof, continued)

And so,

E(X1|X2 = x2) = E(µ1 +U1|X2 − µ2 = x2 − µ2)

= µ1 + E(U1|U2 = x2 − µ2)

= µ1 − K−111 K12 (x2 − µ2)

From ΣXKX = I we find

(1) Σ11K12 + Σ12K22 = 0 and

(2) Σ21K12 + Σ22K22 = I ⇔ Σ21K12 = I − Σ22K22

(1) and (2) are used in the following argumentK−1

11 K12 = Σ11K12 − Σ12Σ−122 Σ21K12

= Σ11K12 − Σ12Σ−122 (I − Σ22K22)

= Σ11K12 − Σ12Σ−122 + Σ12Σ

−122 Σ22K22 = −Σ12Σ

−122


Tool 4: Trace of a matrix

Tr(A) =∑p

i=1Aii

Tr(AB) = Tr(BA) (A ∈ Rp×q,B ∈ Rq×p)

Holds for both square (p = q) and rectangular matrices with matching sizes

Proof:

Tr(AB) =

p∑

i=1

(AB)ii =

p∑

i=1

q∑

j=1

AijBji =

q∑

j=1

p∑

i=1

BjiAij =

q∑

j=1

(BA)jj = Tr(BA)

Corollary A,B,C square matrices, then: Tr(ABC) = Tr(BCA) = Tr(CAB)

Trace remains unchanged under cyclic permutation, so NOT under every permutation:

Tr(ABC) 6= Tr(BAC)

Holds also for more than 3 matrices and also for rectangular matrices with matching

dimensions

Tr(A + B) = Tr(A) + Tr(B)

Attention: Tr(A · B) 6= Tr(A) · Tr(B) but det(A · B) = det(A) det(B)

Trace and eigenvalues

If A = EΛE−1, with Λ = diag(λi), then Tr(A) =∑n

i=1 λi det(A) =∏n

i=1 λi


Maximum likelihood for regression

Model Y = Xβ + ε (random variable Y instead of X in previous slides)

Assume ε ∼ N.I.D.(0, σ2), then joint observational density is multivariate nor-

mal, more precisely L(β, σ2;Y ) =1

(2πσ2)n/2e

−12σ2

(Y −Xβ)T (Y −Xβ)

To be maximised over β and σ2.

logL(β, σ2;Y ) = −n

2log(2πσ2

)− 1

2σ2(Y −Xβ)T (Y −Xβ)

Dependence on β only through second term, so β maximises L(β, σ2;Y ) or

logL(β, σ2;Y ) iff β minimises

(Y −Xβ)T (Y −Xβ) = ‖Y −Xβ‖22This is least squares.

Conclusion: for normal errors is the least squares solution the maximum like-

lihood solution.


1.4 Statistical properties of LS estimators

Model Y = Xβ + ε Least squares: β = (XTX)−1XTY

Property 1: unbiased

If ε has zero mean, then Eβ = β

(The errors do not need to be uncorrelated, normally distributed)

Property 2: covariance

If ε is uncorrelated, then Σβ= σ2(XTX)−1

(Normality and zero mean not required)

Property 3: Normality If ε is normally distrubuted, then β is normally distributed

Indeed: the estimator is a linear combination of normally distributed random

variables

(No need for uncorrelated, zero mean errors)

Marginal distributions βj ∼ N(βj, σ

2(XTX)−1jj

)


Property 4: best linear unbiased estimator (BLUE)

Suppose

• Zero mean errors E(ε) = 0

• Uncorrelated, homoscedastic errors Σε = σ2I i.e.: var(εi) = σ2 and

cov(εi, εj) = 0

(not necessarily independent)

• Normality is not necessary

While β = (XTX)−1XTY , let β = AY =[(XTX)−1XT +∆

]Y , be another

unbiased, linear estimator, then:

Gauss-Markov theorem

For any choice of cT , under the assumptions above, MSE(cT β) ≥ MSE(cT β)

In other words, for any A so that E(AY ) = β (unbiased), it holds that

Σβ− Σ

βis positive semi-definite

Note:MSE MSE(θ) = E

[(θ − θ

)2]MSE(θ) =

[E(θ)− θ

]2+ var(θ)


Gauss-Markov theorem: proof

As E(AY ) = β, we have to prove that var(cT β) ≥ var(cT β)

Using the linearity, we find, for any (arbitrary) value of β:

β = E(AY ) = AE(Y ) = A(Xβ) = (AX)β,

This amounts to AX = I: A is a left inverse of n× p rectangular X

So is the least squares projection (XTX)−1XT .

Defining ∆ = A − (XTX)−1XT , we thus have ∆X = O

The covariance matrix of β is σ2(XTX)−1 (slide 35), so

var(cT β)− var(cT β) = σ2cT (AAT − (XTX)−1)c

We haveAAT − (XTX)−1 =

[(XTX)−1)XT +∆)

] [(XTX)−1)XT +∆)

]T − (XTX)−1

= (XTX)−1)XT∆T + ∆X(XTX) + ∆∆T

= ∆∆T

The quadratic form ∆∆T is always pos., semi-def., thereby completing the

proof


Remark: the expected squared loss

Definition: Risk or expected squared loss R(β) = 1nE(‖β − β‖2

)

This definition satisfies R(β) = 1nTr(Σβ

) + 1n‖Eβ − β‖2

We know (slide 22): Σβ− Σ

βpos. semi-def. ⇔ has non-negative eigenvalues

All eigenvalues non-negative ⇒ their sum (=Trace) non-negative:

Σβ− Σ

βpos.def ⇒ Tr

(Σβ− Σ

β

)≥ 0,

Hence, for least squares estimators:

Property 4, BLUE: Σβ− Σ

βpos.def ⇒ R(β) ≤ R(β)

Risk is sometimes refered to as MSE, but working with MSE of any linear

combination MSE(cT β) as on slide 36 puts a stronger condition

On the other hand, if we can find an example where R(β) ≥ R(β), then Σβ−Σ

β

cannot possibly be positive definite


Stein’s phenomenon

The proof of the Gauss-Markov theorem hinges on the fact that

(XTX)−1)XT∆T = 0.

This is due to the linearity and unbiasedness of the estimator AY and to the

specific form of the LS estimator β = (XTX)−1)XTY .

Stein’s phenomenon

Biased estimators may have uniformly (i.e., for any true value) lower risk

Stein’s example

• Y = µ + ε, we want to estimate µ (i.e.,no regression, just a signal-plus-

noise model)

• ε ∼ Nn(µ, σ2I): noise is normal, homoscedastic and uncorrelated (hence

i.i.d. normal)

The BLUE estimator is µ = Y .

The risk (slide 38)is R(Y ) = σ2

A biased, shrinkage estimator µi = Yi(1− α/‖Y ‖2)c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.39

The expected squared loss of a shrinkage estimator

R(µ) =1

n

n∑

i=1

E(Yi − µi)2 − 2

α

nE

(n∑

i=1

(Yi − µi)Yi

‖Y ‖2

)+α2

nE

(n∑

i=1

Y 2i

‖Y ‖4

)

= σ2 − 2α

nE

((Y − µ)TY

‖Y ‖2)+α2

nE

(1

‖Y ‖2)

Now we use Stein’s Lemma If X ∼ N(µ, σ2) then E [(X − µ)h(X)] = σ2E [h′(X)]

with h(x) continuous

Here: X = Yi, h(X) = hi(Yi) = Yi/‖Y ‖2, so h′(X) =∂hi(Yi)

∂Yi=

‖Y ‖2 − 2Y 2i

‖Y ‖4Using the vector h(Y ) = Y /‖Y ‖2:

E

((Y − µ)TY

‖Y ‖2)

= E[(Y − µ)Th(Y )

]= σ2

n∑

i=1

E

[∂hi(Yi)

∂Yi

]= σ2

n∑

i=1

E [h′i(Yi)]

= σ2n∑

i=1

E

[1

‖Y ‖2 −2Y 2

i

‖Y ‖4]= σ2E

[n

‖Y ‖2 −2

‖Y ‖2]

Suppose n > 2 and take α = n− 2, then R(µ) ≤ σ2 = R(Y )


A proof of Stein’s Lemma

If X ∼ N(µ, σ2) then E [(X − µ)h(X)] = σ2E [h′(X)] with h(x) continuous (not

necessarily differentiable in all points)

Proof: Let φσ(x) be the density of zero mean normal RV with variance σ2, then

φ′σ(x) = −xφσ(x)/σ

2, i.e., xφσ(x) = −σ2φ′σ(x).

Then the denisty of X ∼ N(µ, σ2) is fX(x) = φσ(x− µ) and so,

E [(X − µ)h(X)] =

∫ ∞

−∞(x− µ)h(x)φσ(x− µ)dx

= −σ2

∫ ∞

−∞h(x)φ′

σ(x− µ)dx

= −σ2 [h(x)φσ(x− µ)]∞−∞ + σ2

∫ ∞

−∞h′(x)φσ(x− µ)dx

= 0 + σ2E [h′(X)]


Stein’s phenomenon: discussion

Remarkable result: a set of observations Y = µ + ε, without any assumed

connection among the components of µ (no regression; no correlation) is bet-

ter estimated if taken together

Non-linear estimator, because shrinkage factor depends on the observations

Next example is about a regression model. We will use a linear, but biased

estimator: β = AY , but E(AY ) 6= β


Example 2: ridge regression/Tikhonov regularisation

If X has nearly linearly dependent columns, the XTX is nearly singular

Variances of β can be arbitrarily large (see slide 35)

We give up unbiasedness for the sake of variance reduction: regularisation of

a (nearly) singular system

The least squares problem minβ ‖Y −Xβ‖22 leads to the singular normal equa-

tion (XTX)β = XTY

We regularise minβ

‖Y −Xβ‖22 + ‖Lβ‖22

Very often, one takes L =√λI, so min

β‖Y −Xβ‖22 + λ‖β‖22

The solution is

β =(XTX + LTL

)−1XTY or β =

(XTX + λI

)−1XTY


Covariance matrix of the regularised solution

Consider the regularized estimator as a function of λ: β(λ) =(XTX + λI

)−1XTY

Then Σβ(λ) = σ2(XTX + λI)−1(XTX)(XTX + λI)−1

Let XTX = UΛUT , where Λ is a diagonal matrix, then

Σβ(λ) = σ2U(Λ + λI)−1UTUΛUTU(Λ + λI)−1UT = σ2U(Λ + λI)−2

ΛUT

The effect of the regularization can be described by

Σβ(0) − Σ

β(λ) = σ2U[Λ

−2 − (Λ + λI)−2]ΛUT

The diagonal matrix D =[Λ

−2 − (Λ + λI)−2]Λ has elements

Dii =1

Λii− Λii

(Λii + λ)2≥ 0

Hence, Σβ(0) − Σ

β(λ) is positive definite, meaning that, for any c

var(cT β(0)) = cTΣβ(0)c ≥ cTΣ

β(λ)c = var(cT β(λ))


The bias of the regularised solution

We check if/when the decrease in variance dominates the increase in bias

bias(β(λ)) = E(β(λ))− β =(XTX + λI

)−1XTXβ − β

=(XTX + λI

)−1 [XTX −

(XTX + λI

)]β

= −λ(XTX + λI

)−1β

bias(cT β(λ)) = −λcT(XTX + λI

)−1β depends on c and β

No general conclusion

We investigate the risk of the regularised estimator. If the risk for a nonzero λ

is smaller than for λ = 0, then λ = 0 cannot be possibly minimise MSE(cT β(λ))

for any c (see slide 38)


The risk of a ridge regression

R(β(λ)) = 1nTr

(Σβ(λ)

)+ 1

n‖Eβ(λ)− β‖22

=σ2

n

n∑

i=1

Λii

(Λii + λ)2+λ2

n‖(XTX + λI

)−1β‖22

≤ σ2

n

n∑

i=1

Λii

(Λii + λ)2+λ2

n‖(XTX + λI

)−1 ‖2F‖β‖22

=σ2

n

n∑

i=1

Λii

(Λii + λ)2+λ2‖β‖22

n

n∑

i=1

1

(Λii + λ)2

=1

n

n∑

i=1

Λiiσ2 + λ2‖β‖22

(Λii + λ)2

All terms in this sum have a negative derivative for λ in λ = 0, hence

For any vector β, there exists a parameter λ > 0, so that R(β(λ)) < R(β(0))


Risk as a function of λ

R( )

var( )

bias2( )

Note that the exact expression of R(βλ) depends on the unknown β: this has

to be estimated in practice (see later)


Variance estimation in a linear model

S2 = (Y −Xβ)T (Y −Xβ)n−p = eTe

n−p (e = residual, see slide 13)

Note that the first factor is transpose (↔ expression of covariance matrix)

ES2 = σ2

Proof We know µ = Xβ = PY ,

so Y −Xβ = (I − P )Y = (I − P )(Xβ + ε) = (I − P )ε

and so

E[(Y −Xβ)T (Y −Xβ)

]= E

[εT (I − P )T (I − P )ε

]= E

[εT (I − P )ε

]

= E[Tr(εT (I − P )ε)

]= E

[Tr((I − P )(εεT ))

]

= Tr[(I − P )E(εεT )

]= Tr(I − P )σ2 = (n− p)σ2

(Using that scalar = trace(scalar) and then cyclic permutation within trace, see slide 33

+ properties of P , see slide 10)


The sum of squared residuals is a biased estimator

Interpretation

The mean squared residual eTe/n underestimates the expected squared error

σ2 = E(ǫ2). This is because the least squares method minimizes the mean

squared residual, and so even the true model β has a larger mean squared

residual than the estimated model.

The minimum mean squared residual is therefore always smaller than the

mean squared error in the true model:1n‖Y − µ‖2 ≤ 1

n‖Y − µ‖2

⇒ 1nE(‖Y − µ‖2

)≤ 1

nE(‖Y − µ‖2

)

⇒ 1nE(eTe) ≤ 1

nE(εTε) = σ2


Distribution of the variance estimator

Theorem

(n− p)S2/σ2 ∼ χ2n−p

• There exists an orthogonal transform U so that I − P = UTΛU with Λ =

diag(1, . . . , 1, 0, . . . , 0) (see slide 13)

• We know from the proof of ES2 = σ2 that (n− p)S2/σ2 = (Y −Xβ)T (Y −Xβ)/σ2 = εT

σ(I − P )ε

σ= εT

σUT

ΛUεσ

• εσ is a vector of independent, standard normal random variables: Σε/σ =

I. After application of the orthogonal transform Z = Uεσ we have ΣZ =

UIUT = I, so Z is also a vector of independent, standard normal random

variables, Zi, so (n− p)S2/σ2 =∑n−p

i=1 Z2i ∼ χ2

n−p


Hypothesis testing - likelihood ratio

We want to test if H0 : Aβ = c with A ∈ Rq×p where q ≤ p and rank of A equal

to its number of rows q

Example c = 0 and the rows of A =[I 0

]. This leads to the hypothesis test

if the first q βj equal zero (other subsets of βj are possible)

As σ is unknown, it is a component in the parameter vector θ = (β, σ2) to be

estimated.

Let Ω = θ be the vector space of all values that the parameter vector can

take and Ω0 = θ ∈ Ω|Aβ = c the subspace of values that satisfy the null

hypothesis.

Likelihood ratio

Λ =maxθ0∈Ω0

L(θ0;Y )

maxθ∈ΩL(θ;Y )


Likelihood ratio

Theorem (from mathematical statistics)

If n → ∞, then, under H0, −2 log Λ d→ χ2ν−ν0

ν is the dimension of the space Ω, ν0 is the dimension of Ω0: in our case ν = p,

ν0 = p− q

Maximum likelihood estimator in multiple regression

L(β, σ2;Y ) =1

(2πσ2)n/2e

−12σ2

(Y −Xβ)T (Y −Xβ)

logL(β, σ2;Y ) = −n

2log(2πσ2

)− 1

2σ2(Y −Xβ)T (Y −Xβ)

Taking derivatives for σ2 yields

σ2 = 1n(Y −Xβ)T (Y −Xβ) = n−p

n S2 where β follows from taking derivatives

for βj (= least squares as discussed before; but we don’t need this for the

subsequent analysis)

Plugging in leads to L(β, σ2;Y ) = 1(2πσ2)n/2

e−n/2


Likelihood ratio (2)

Maximum likelihood estimator within Ω0

Maximise L(β, σ2;Y ) under the restriction that Aβ = c

Technique: Lagrange multipliers

Minimise LL0(β, σ2,λ) = logL(β, σ2;Y ) + 2λT (Aβ − c)

Term with λT does not depend on parameter σ2 ⇒ same expression as before for σ20:

σ20 =

1n(Y −Xβ0)

T (Y −Xβ0)

where β0 is now found by ∇βLL0(β, σ2,λ) = 0

Use ∇x‖Bx− b‖22 = 2BT (Bx− b) and ∇x(dTx) = d (mind the transpose!)

2XT (Xβ − Y )− 2ATλ = 0 which is (XTX)β = XTY + AT

λ or β = β + (XTX)−1ATλ

Substitution into Aβ = c leads to elimination of λ:

β0 = β + (XTX)−1AT[A(XTX)−1AT

]−1(c− Aβ)

And for this value of β0, the likelihood still simplifies to L(β0, σ20;Y ) = 1

(2πσ20)

n/2e−n/2


Likelihood ratio (3)

Expression for the likelihood ratio

Λ =(σ2

σ20

)n/2

Large values of −2 log Λ, i.e., small values of Λ lead to rejection of null hypoth-

esis.


Variance estimator under reduced model (1)

Starting from the reduced model H0 : Aβ = c with A ∈ Rq×p where q ≤ p and rank of A

equal to number of rows q

If H0 holds, then β0 = β + (XTX)−1AT[A(XTX)−1AT

]−1

(c− Aβ)

Define µ = Xβ and µ0 = Xβ0,

then µ0 − µ = X(β0 − β) = X(XTX)−1AT[A(XTX)−1AT

]−1

(c− Aβ)

and (Y − µ)T (µ0 − µ)

=[(I −X(XTX)−1XT )Y

]T(µ0 − µ)

= Y T (I −X(XTX)−1XT )X(XTX)−1AT[A(XTX)−1AT

]−1

(c− Aβ)

= Y T[(I −X(XTX)−1XT )X(XTX)−1

]· AT

[A(XTX)−1AT

]−1

(c− Aβ)

= Y T[X(XTX)−1 −X(XTX)−1(XTX)(XTX)−1

]· AT

[A(XTX)−1AT

]−1

(c− Aβ)

= Y T · 0 · AT[A(XTX)−1AT

]−1

(c− Aβ) = 0

Conclusion for Y −µ0 = (Y −µ)+(µ0−µ): ‖Y − µ0‖2 = ‖Y − µ‖2 + ‖µ0 − µ‖2


Variance estimator under reduced model (2)

Conclusion from previous slide

‖Y − µ0‖2 = ‖Y − µ‖2 + ‖µ0 − µ‖2

Interpretation (Pythagoras) µ0 and µ are projections of Y onto X, µ being

the orthogonal projection. Hence, ∆µ0µY is a rectangular triangle.

So

σ20 = 1

n‖Y − µ0‖2

= 1n‖Y − µ‖2 + 1

n‖µ0 − µ‖2

= σ2 + 1n‖X(XTX)−1AT

[A(XTX)−1AT

]−1

(c− Aβ)‖2

= σ2 + 1n(c− Aβ)T

[A(XTX)−1AT

]−T

A(XTX)−TXTX(XTX)−1AT[A(XTX)−1AT

]−1

(c− Aβ)

= σ2 + 1n(c− Aβ)T

[A(XTX)−1AT

]−1 [A(XTX)−1AT

] [A(XTX)−1AT

]−1

(c− Aβ)

Conclusion σ20 − σ2 = 1

n(c− Aβ)T[A(XTX)−1AT

]−1(c− Aβ)


An equivalent F -test

The distribution of −2 log Λ is only asymptotically χ2. The expression of −2 log Λ

leads to logarithms of sample variances. Intuitively, a ratio of sample variances

suggests the use of an F -test, which is exact, and without the need of taking

a log.

∼ ANOVA

Is the “extra sum of squares” (the part of the variance that is explained by

giving up the restrictions imposed by the null hypothesis) significant?

Formally: F = (σ20−σ2)/q

σ2/(n−p)

We have: F = n−pq

σ20−σ2

σ2 = n−pq

(σ20

σ2 − 1)= n−p

q

(Λ−n/2 − 1

)

We now show that, under H0, F ∼ Fq,n−p


The proof for the F -test

We have

1. σ2 = n−pn S2 = (Y −Xβ)T (Y −Xβ)

n

2. σ20 − σ2 = (Aβ − c)T

[A(XTX)−1AT

]−1(Aβ − c)/n

We have to prove that

1. nσ2/σ2 = (n− p)S2/σ2 ∼ χ2n−p (done before)

2. under H0: n(σ20 − σ2)/σ2 = (Aβ − c)T

[A(XTX)−1AT

]−1(Aβ − c)/σ2 ∼ χ2

q

3. S2 and β are independent

Therefore, all functions of S2 are also independent from all functions of β.

More precisely σ2 and σ20 − σ2 are mutually independent.


The proof for the F -test: distribution under H0

As β ∼ N(β, (XTX)−1σ2)

we have that Aβ−c is a vector with length q, which is normally distributed with

expected value Aβ − c = 0 (under H0)

and with covariance matrix A(XTX)−1ATσ2.

Then (property covariance matrices)

(Aβ − c)T[A(XTX)−1AT

]−1(Aβ − c)/σ2 ∼ χ2

q


The proof for the F -test: independence S2 and β

We show that (Y − Xβ) is independent from β. As both are normally dis-

tributed, it suffices to show that they are uncorrelated.

The covariance matrix Mind the positions of the transposes (outer product) :

cov(Y −Xβ, β) = E[(Y −Xβ)βT

]− E(Y −Xβ) · EβT

= E[(Y −X(XTX)−1XTY ) · ((XTX)−1XTY )T

]− 0 · βT

= (I −X(XTX)−1XT ) · E(Y Y T ) · ((XTX)−1XT )T

= (I −X(XTX)−1XT ) · (σ2I +XββTXT )X(XTX)−1

= (I −X(XTX)−1XT ) · (X(XTX)−1σ2 +XββT ) = 0


Simultaneous confidence

We know that β and S2 are independent, so it follows immediately that

(β − β)T (XTX)(β − β)/p

S2∼ Fp,n−p

(This is the F -statistic for the full model, so in the test H0: all βi zero, i.e.,

q = p)

So

β :

(β − β)T (XTX)(β − β)

S2≤ pFp,n−p,α

is an 1− α confidence region for β


Testing a single linear combination of β

aT β ∼ N(aTβ,aT (XTX)−1aσ2) SoaT β − aTβ

S√aT (XTX)−1a

∼ tn−p

So P

(|aT (β − β)|

S√aT (XTX)−1a

≥ tn−p,α/2

)= α

Alternative argument (for H0 : aTβ = 0)

Take A = aT on slide 51, and c = 0, then q = 1, and F defined on slide 57

equals F =(aT β)2

S2aT (XTX)−1a∼ F1,n−p (under H0)

It holds that for B ∈ −1, 1 independent from F :

F ∼ F1,n−p ⇔ B√F ∼ tn−p


Testing multiple linear combinations of β (1)

We verify now that the simulateneous confidence interval follows indeed from

the F test for the full model.

We search for Mα so that P

(maxa

|aT (β − β)|S√aT (XTX)−1a

≥ Mα

)= α

We square the expression and get[aT (β − β)

]2=[aT (β − β)

]T [aT (β − β)

]=

(β − β)T (aaT )(β − β)

The value of the expression in a is the same as in ra with r ∈ R0, so we can

maximise under the normalisation that aT (XTX)−1a = 1

XTX is a p × p symmetric, full rank matrix and so it can be decomposed

as XTX = RRT with R a p × p invertible matrix. So, we can write that

aT (XTX)−1a = aT (R−TR−1)a = uTu with u = R−1a, so a = Ru.

We maximise over u:

maxa

(β − β)TaaT (β − β)

aT (XTX)−1a= max

u

(β − β)TRuuTRT (β − β)

uTu


Testing multiple linear combinations of β (2)

We maximise over u:

maxa

(β − β)TaaT (β − β)

aT (XTX)−1a= max

u


uTu

The numerator contains a squared inner product of u with RT (β−β). Accord-

ing to Cauchy-Schwarz’s inequality:

(β − β)TRuuTRT (β − β) ≤ ‖u‖2 · ‖RT (β − β)‖2,

So maxu


uTu= ‖RT (β − β)‖2 = (β − β)TRRT (β − β)

Since(β−β)TRRT (β−β)

S2 ∼ pFp,n−p We can take Mα =√pFp,n−p,α


Residuals and outliers

Residuals (See also slides 7 and 13)

e = Y − µ = (I − P )Y

Σe = σ2(I − P )

This allows to standardise/studentise the residuals and then test for normality.

Cook-distance; influence measure; influence points

measures the influence of observation i on the eventual estimated vector:

Di =(β(i) − β)T (XTX)(β(i) − β)/p

S2


Transformations of the response

Model

Y (λ) = Xβ + ε

with Y(λ)i = Y λ

i −1λ

General: transformation of a random variable

If Y = g(X) then

fY (y) = fX(x(y))|J(y)| = fX(g−1(y))|J(y)|

with J = det

. . . . . . . . .

. . . ∂xi

∂yj. . .

. . . . . . . . .

and

∂xi∂yj

=∂g−1

i (y)

∂yj


Transformations of the response: our case

In our case Y (λ) ∼ N(Xβ, σ2I) takes the role of X, so∂xi

∂yi= 0 unless j = i, in that case ∂xi

∂yi= yλ−1

i

So J =∏n

i=1 yλ−1i

The likelihood becomes L(β, σ2, λ;Y ) = 1(2πσ2)n/2

e−12σ2

(Y (λ)−Xβ)T (Y (λ)−Xβ)∏ni=1 Y

λ−1i

Differentation for σ2 yields

σ2 = 1n(Y

(λ) −Xβ)T (Y (λ) −Xβ)

Plugging in

L(β, σ2, λ;Y ) = 1(2πσ2)n/2

e−n/2∏n

i=1 Yλ−1i

This function can be evaluated for different values of λ (note that σ2 in its turn

is also a function of λ)


PART 2: VARIABLE AND MODEL SELECTION

2.1 Motivation and setup slide 69

2.2 Selection criteria slide 81

2.3 Minimisation of the Information criterion slide ??


2.1 Model selection: Motivation and setup

• Motivation slide 70

• Model selection and variable selection

• The use of likelihood in model selection


Motivation (1): full or large models are not the best

The analysis until now assumed that the full (global) model is correct. For

instance, the expressions Eβ = β and Σβ

= σ2(XTX)−1 are true only if

indeed Y can be written as Xβ and no explanatory variable is missing.

In order to check if a specific covariate xi should be considered as a candidate

in the global model, one could “play safe” put it in the full model and let the

hypothesis test H0 : βi = 0 decide whether or not it is significant.

But, if we have to play safe for a very wide range of possible explanatory

variables, a lot of parameters need to be estimated. This has two effects:

1. The estimators have larger standard errors

2. Few degrees of freedom left for estimation of variance σ2

So: larger errors AND more difficult to assess the uncertainty: hypothesis tests

will have weak power, and really significant parameters may be left undiscov-

ered.

Occam’s razor: Keep the model as simple as possible, but no simpler.


Motivation (1bis): uncontrolled variance inflation

The variance in the full model is unnecessarily large (previous slide)

Sometimes, it is also arbitrarily large, if Σβ= σ2(XTX)−1 has large entries,

i.e., if XTX is close to being singular


Motivation (2): the true model is not always the best model

Previous slide: the full/largest model is not the best to work in

This slide: the true/physical model (=DGP = data generating process) may

not be the best statistical model

Example: assume that two quantities, x and µ, are related by the physical law

µ = β0 + β1x + β2x2

We observe Yi = µi + εi for µi = β0 + β1xi + β2x2i

• Estimate y0 = β0 + β1x0 + β2x20 for a given x0

If β2 is (very) small, the model µi = β0 + β1xi is incorrect, introduces bias

but an estimator β2 may add more variance than it reduces bias

• Estimate β2 or β2/β1, then we need an estimator β2

Conclusions

1. “Best” statistical model may not be the true, physical model

2. “Best” statistical model depends on “what you want to estimate”. This is

called the focus


Motivation (3): the larger the model, the more false

positives

Large models require more multiple testing issues

(Even if in large models the estimations had the same variance as in small

models, there would be more false discoveries)

Motivation (4): variable selection reduces collinearity


“All models are wrong, but some are useful”

George Box, Journal of the American Statistical Association, 1976:

Since all models are wrong the scientist cannot obtain a ”correct” one by ex-

cessive elaboration. On the contrary following William of Occam he should

seek an economical description of natural phenomena. Just as the ability to

devise simple but evocative models is the signature of the great scientist so

overelaboration and overparameterization is often the mark of mediocrity.

Conclusion: Full/large models are not the best

• Large standard errors in estimators

• Few degrees of freedom to estimate noise variance

• As a result of these two: hypothesis tests with low power

• More false positives

• Even if true model is large...


Objective and situation of model selection

• A procedure to limit the size of the full model

• before estimation + testing

• Compare nested models but also totally different models

• ↔ testing: no null hypothesis at input, and no statistical evidence at output

• Testing: asymmetry in H0 ↔ H1; model selection is symmetric between the

candidate models

• Compare model selection ∼ forensic profiling: profiling is not a proof of

guilt.

After profiling, suspects are innocent until proven guilty (=H0, presumption

of innocence → hypothesis testing)

• Information criterion expresses how well the data can be described in a

given model


Model selection and variable selection

Variable selection

• Selects covariates

• Covariates require parameter βi to be estimated

Model selection

• selects statistical model, given the covariates

– polynomial, exponential, periodic, . . .

– How many terms (degrees, frequencies)

(variables-within-model selection)

• Model for distribution of observations

– Example: exponential distribution versus Weibull

– Complicated models introduce parameters not linked to covariates

• Model for interactions between covariates


The use of likelihood in model selection

We know likelihood from statistical inference where it is used for

• parameter estimation within a given model by MLE, maximum likelihood

• Hypothesis testing, e.g.: χ2 test for goodness of fit, likelihood ratio tests

• Validation of the model with significant parameters after testing (e.g.:R2

statistics)

We want to use likelihood in model selection, i.e., before statistical inference

to compare different models


The use of likelihood in model selection (see also slide 103)

• Working principle:

Choose the model in which the maximum likelihood estimator maxi-

mizes the expected (log-)likelihood with respect to the true, underly-

ing data generating process (DGP) = true/physical model (see also

side 72)

• Problem: DGP is unknown, expected likelihood cannot be computed

• Observed likelihood measures closeness of fit within the model under

consideration

• Problem: large models lead to better fit, but more noisy parameter estima-

tors (this is exactly one of the arguments for model selection)

• Solution: likelihood has to be adjusted or penalized:

closeness-complexity trade-off (where likelihood = closeness)

• Closeness-complexity trade-off is equivalent (in expected value) to bias-

variance trade-off: both trade-offs measure

expected distance to DGP/true model = expected likelihood


Example of expected log-likelihood: normal model with

known variances

Let Y ∼ N.I.D.(µ, σ2I), with σ2 known, then use expression likelihood

on slide 34:

E[− logL(β;Y )

]=

1

2σ2E[(Y − µ)T (Y − µ)

]+

n

2log(2πσ2)

=1

2σ2E‖Y − µ‖22 +

n

2log(2πσ2)

=1

2σ2‖µ− µ‖22 +

1

2σ2E‖Y − µ‖22 +

n

2log(2πσ2)

=1

2σ2‖µ− µ‖22 +

1

2σ2nσ2 +

n

2log(2πσ2)

Definition: Prediction Error: replace µ by µ:

PE(β) = 1nE(‖µ− µ‖2

)= 1

nE(‖Xβ −Xβ‖2

)See also slide 86

This is a plot of PE as a function of the model

size in nested model selection


Prediction error and Risk in linear models

Definition (1) (slide 38): Risk or expected squared loss R(β) = 1nE(‖β − β‖2

)

This definition satisfies R(β) =1

n

m∑

j=1

var(βj) + E[βj − βj

]2

Result slide 38: Among all unbiased estimators, BLUE has lowest risk

Definition (2): Prediction error (PE) PE(β) = 1nE(‖µ− µ‖2

)= 1

nE(‖Xβ −Xβ‖2

)

This definition satisfies PE(β) =1

n

(n∑

i=1

var(µi) + E [µi − µi]2

)

Σµ − Σµ = X(Σβ− Σ

β)XT is pos. semi-def. so

Among all unbiased estimators, BLUE has lowest PE


2.2 Model selection methods and criteria

Expected likelihood (slide 78) is just one possible measure of quality, but other

(expected) distances with respect to the DGP are similar. We develop several

variants

• Adjusted R2: slides 82 and further on

• Prediction error and Mallows’s Cp: slides 86 and further on

• Kullback-Leibler distance and Akaike’s Information Criterion: slides

101 and further on

• The Bayesian Information Criterion or Schwarz Criterion: slides 114

and further on

• Generalised Cross Validation: a criterion mainly used in sparse variable

selection; see slide 140


Example of a validation of a given model after inference: R2

Least squares is an orthogonal projection: with µ = Xβ we have (Y −µ)TY = 0 and also (Y − µ)T1 = 0 since 1 is the first column of X and

the residual is orthogonal to all columns of X So

n∑

i=1

(Yi − µi)(µi − Y ) = 0

n∑

i=1

(Yi − Y )2 =n∑

i=1

(Yi − µi)2 +

n∑

i=1

(µi − Y )2

SST = SSR + SSE with

Sum of squares Total SST =∑n

i=1(Yi − Y )2

Sum of squares Regression SSR =∑n

i=1(µi − Y )2

Sum of squares Errors of Prediction or Sum of squared residuals

SSE =∑n

i=1(Yi − µi)2 =

∑ni=1 e

2i where (see also p.7,13) ei = Yi − µi

We define Coefficient of determination R2 =SSRSST

R2 = 1− SSESST


Using R2 before inference

R2 is a measure of how well the model explains the variance of the response

variables. However, a submodel of a full model has always a smaller R2,

even if the full model only adds unnecessary, insignificant parameters to the

submodel.

Therefore R2 cannot be used before inference: it would just promote large

models.

Note The denominator in R2 is the same for all models. The numerator is (up

to a constant) the log-likelihood of a normal model with known variance. So

R2 measures the likelihood of a model. The more parameters the model has,

the more likely the observations become, because there are more degrees of

freedom to make the obseravtions fit into the model.

Adjusted coefficient of determination

R2= 1− SSE/(n− p)

SST/(n− 1)Or 1−R

2= n−1

n−p(1−R2)


Adjusted R2: interpretation

We know: S2 = (Y −µ)T (Y −µ)n−p = 1

n−p

∑ni=1(Yi − µi)

2 = 1n−pSSE

and σ2 = 1n

∑ni=1(Yi − µi)

2 = n−pn S2 = 1

nSSE = 1n(SST − SSR) =

1nSST(1−R2)

Suppose we have two nested models, i.e., one model containing all covariates

of the other (this is: a full model with p covariates and a reduced model with q

restrictions, so with p− q out of p free covariates), then the F -value in testing

the significance of the full model is then:

F =(σ2

p−q−σ2p)/q

σ2p/(n−p)

Dividing numerator and denominator by SST leads to:

F =(R2

p−R2p−q)/q

(1−R2p)/(n−p) =

(1−R2p−q)−(1−R2

p)

q(1−R2p)/(n−p)

We rewrite this in terms of “adjusted R2”

F = n−pq

[1−R

2p−q

1−R2p

− 1

]+

1−R2p−q

1−R2p

from which it follows that R2p−q ≤ R

2p ⇔ F ≥ 1


Adjusted R2: interpretation (2)

The criterion that decides which model has higher quality is thus equivalent

to a “hypothesis test” with critical value F = 1, instead of Fq,n−p,α. This illus-

trates the symmetry of the model selection criterion, compared to the testing

procedure, which uses null and alternative hypotheses, with clearly unequal

roles.


Mallows’s Cp statistic: prediction error

Criterion Choose the model with smallest mean prediction error

The prediction for covariate values xp: µp = xTp βp = xT

p (XTp Xp)

−1XTp Y

where Xp is a model with p covariates (p columns, n rows of observations)

xTp contains the covariate values in one point of interest. xT

p may or may not

coincide with one of the rows in Xp

The prediction error is the expected difference E(µp − µ)2 between the pre-

diction and the true, error-free value of the response for the given covariate

values.

We want to assess the quality of the model Xp. The precise covariate values

in which we measure the quality are free in theory. However, since we observe

the model in a limited number of covariate values, we restrict to those values.

Indeed, it is hard in practice to say something about the quality of a prediction

of a response value, if we even don’t observe that response value. In the

points of observations, we still don’t know the exact (noise-free) response.


The bias-variance trade-off

Prediction error = variance + bias2 E(µp − µ)2 = var(µp) + (Eµp − µ)2

with µ = xTβ in the correct model

Intuitively: the smaller the model, the smaller the prediction variance, but the

larger the bias.

Because of linearity of expectations

Eµp = E(xTp (X

Tp Xp)

−1XTp Y ) = xT

p (XTp Xp)

−1XTp µ

And var(µp) = xTpΣβp

xp = σ2xTp (X

Tp Xp)

−1xp

We search for the minimum average prediction error, where the average is

taken over all points of observation, so we want to minimize

PE(βp) =1

n

n∑

i=1

E(µpi − µi)2


The prediction variance of a model

The average variance can we written as

1

σ2

n∑

i=1

var(µpi) =n∑

i=1

xTpi(X

Tp Xp)

−1xpi

=n∑

i=1

Tr(xTpi(X

Tp Xp)

−1xpi

)=

n∑

i=1

Tr((XT

p Xp)−1(xpix

Tpi))

= Tr

(n∑

i=1

(XTp Xp)

−1(xpixTpi)

)= Tr

((XT

p Xp)−1

n∑

i=1

(xpixTpi)

)

= Tr(Ip) = p

thereby using Xp =

...

xTpi...

so XT

p Xp =n∑

i=1

(xpixTpi)

The average variance is thus only dependent on the size of the model. It does

not depent on the exact locations of the points in which the prediction quality

is evaluated.


The prediction bias of a model

Remark if we assume that the working model is correct (i.e., contains the

DGP=true model), then result on slide 35 applies: bias is zero

The average mean or expected error (bias)

1

σ2

n∑

i=1

(Eµpi − µi)2 =

1

σ2(Eµp − µ)T (Eµp − µ)

Now Eµp =

...

xTp (X

Tp Xp)

−1XTp µ

...

= Xp(X

Tp Xp)

−1XTp µ = Ppµ

so the average bias is 1σ2µ

T (I − P )T (I − P )µ = 1σ2µ

T (I − P )µ

This contains the unknown, error-free response variable. We try to estimate

the bias by plugging in the observed values instead. We find the expression

Y T (I − P )Y = SSEp.


The expression of Mallows’ Cp

The estimator of the bias is biased again, as follows from E(Y T (I − P )Y ) =

E[(µ + ε)T (I − P )(µ + ε)

]

= µT (I − P )µ + E[εT (I − P )µ

]+ E

[µT (I − P )ε

]+ E

[εT (I − P )ε

]

= µT (I − P )µ + 0 + 0 + Tr(I − P )σ2 = µT (I − P )µ + (n− p)σ2

The bias of the estimator of the bias, (n− p)σ2, however only depends on the

model size, no longer on the unknown response.

So PE(βp) =1nE[SSEp + (2p− n)σ2

]

The random variable on the right hand side (between the expectation brackets)

is observable and it is an unbiased estimator of the prediction quality measure

in the left hand side of the expression.

In most practical situations σ2 has to be estimated by S2 and so the studentized

value

Cp =SSEpS2 + 2p− n is a consistent estimator for n

σ2PE(βp)


Remark on the estimation of σ2

The variance of the errors did not appear in the definition of PE(βp), it comes

from the development of the expression.

We impose that the variance estimator must not depend on the selected model

(unlike βp)

Therefore, σ2 must be kept outside the optimisation of the likelihood within a

given model, S2 must not be model-dependent


The closeness-complexity trade-off

The Cp criterion reformulates the bias-variance trade-off on slide 87 as a

trade-off between closeness (SSEp) and complexity (2p)

From PE(βp) =1

nE‖µp − µ‖22 =

1

n

n∑

i=1

var(µpi) +1

n‖Eµp − µ‖22

to PE(βp) =1

nE[SSEp + (2p− n)σ2

]

The equivalence holds in expected value

1

nE‖µp − µ‖22 =

1

nE[SSEp + (2p− n)σ2

]

but1

n‖µp − µ‖22 6=

1

n

[SSEp + (2p− n)σ2

]

The closeness-complexity trade-off can also be found directly from the expres-

sion of the prediction error, see next slide


Degrees of freedom (DoF)

Mallows’s Cp without going by variance–bias trade-off (straight to closeness–

complexity)

PE(βp) = 1nE‖µp − µ‖22 = 1

nE‖µp − Y + Y − µ‖22 = 1nE‖ − ep + ε‖22

= 1n

[E‖ep‖22 + 2E(−eTp ε) + E‖ε‖22

]

= 1nE[SSE(βp)

]+ 2

nE[(ε− ep)

Tε]− 1

nE‖ε‖22

= 1nE[SSE(βp)

]+

2νpnσ2 − σ2

where νp are the degrees of freedom, defined as νp =1

σ2E[εT (ε− ep)

]

And define a variant of Mallows’s Cp based on degrees of freedom and without

studentisation:

∆(βp) =1

nSSE(βp) +

2νpn

σ2 − σ2 Then E[∆(βp)

]= PE(βp)

Note that DoF depends on the selected model. It is not a fixed number (as in

statistical inference)

Alternative expression for DoF: νp = E(εT µp)/σ2

Proof: νp =1σ2E[εT (ε− ep)

]= 1

σ2E[εT (Y −µ−Y + µp)

]= 1

σ2E[εT (µp−µ)

]


DoF in linear smoothing

Linear smoothing in a given/fixed model: βp = PpY

The definition of DoF (slide 93) does not rely on orthogonality, so Pp may be

any projection or even any smoothing procedure.

Depending on the context, the matrix Pp is termed projection matrix or influ-

ence matrix

Then νp = Tr(Pp)

Proof:

νp = E(εT µp)/σ2 = E(εTPpY )/σ2 = E

[εTPp(µ + ε)

]/σ2

= 0 + E[Tr(Ppεε

T )]/σ2 = Tr(Ppσ

2I)/σ2

Special case: orthogonal projection (least squares) within fixed/given

model

νp = p


Degrees of freedom after optimisation

• For a least squares in a fixed model, we find νp = p (see p.94)

• If Xp (and Pp) result from optimisation of Mallows’s Cp, the selection de-

pends on the errors

• ∆(βp) (see definition p.93) is still unbiased, but νp 6= p, because the (se-

lected) projection Pp depends on the errors ε. Errors can cause false pos-

itives

• False positives are more likely when many true parameters are zero. There-

fore, the effect of the optimisation is important in sparse models. Other-

wise, it can be ignored. See further slide 139


The use of Cp

• In the notation Cp, the p refers to the size of the model under consideration.

It suggests that Cp is a function of p only. But Cp can be used to compare

two non-overlapping models with the same size.

• Cp measures the quality of the model in terms of prediction error. It does

not measure the correctness of the model.

prediction ↔ estimation

• Nested models, such as polynomial regression where we include all pow-

ers up to p: a larger model thus always includes a smaller one.

Then Cp can be seen as a function of p The curve of Cp has a typical

form with a minimum that is the optimal balance between bias (descending

as function of p) and variance (proportional to p). If DGP = truemodel ∈models, then bias → 0


Illustration of a nested model

Polynomial regression model Yi =

p−1∑

k=0

βkxki + εi

with n = 30 observations in [0, 1], and p = 8.

(Degree 7, specified in the simulation by 7 zeros ⇒ 6 extremes; two zeros are

outsize [0, 1])

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1


Prediction error

The Cp plot

Model size (p)

0 10 20 30

PEC

p

The minimum prediction error estimator

0 0.2 0.4 0.6 0.8 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1


Conclusions about the example

The true model (DGP) is a polynomial of degree 7. It has two zeros outside

the interval of observations. The presence of these zeros has little impact on

the observations, it can hardly be detected. As a consequence, the minimum

prediction error is reached for a simpler model: a polynomial of degree 5 (p =

6).

Question

1. Why is PE(βp) almost a linear function for p large?


Kullback-Leibler divergence

• Data generating density – data generating process – DGP gY (y; σ):

(joint density of observations Y )

• Models for selection fY (y; βp; σ2p): where βp has p parameters

• then Kullback–Leibler (KL) divergence

KL(gY , fY (·; βp, σ

2p))=

1

n

n∑

i=1

[E log gYi

(Yi, σ2)− E log fYi

(Yi; βp, σ2p)],

measures distance true model (DGP) – statistical model

• Again: bias–variance balance


Kullback-Leibler divergence is always positive

In general, if X ∼ gX (DGP – Data generating process), then

KL(gX, f) = E [log(gX(X))− log(f(X))]

For all models f(x), we have

KL(gX, f) ≥ 0

Indeed, because of Jensen’s inequality, we have E [log(X)] ≤ log(E(X)) ,

and so,

KL(gX, f) = E[log(gX(X)f(X)

)]= −E

[log(

f(X)gX(X)

)]

≥ − log(E[

f(X)gX(X)

])

= − log

(∫

R

f(x)

gX(x)gX(x)dx

)= − log

(∫

R

f(x)dx

)= − log(1) = 0


Kullback-Leibler divergence and expected log-likelihood

KL(gY , fY (·; βp, σ

2p))=

1

n

n∑

i=1

[E log gYi

(Yi, σ2)− E log fYi

(Yi; βp, σ2p)],

E log gYi(Yi, σ

2) does not depend on chosen model, so

KLgY , fY (·, βp, σ2p) = Constant − ELL(βp, σ

2p;Y )

with LL(βp, σ2p;Y ) =

1

n

n∑

i=1

log fYi(Yi; βp, σ

2p)

KL-distance = constant - expected log-likelihood


Expected log-likelihood in model selection

• See also slide 78

• Definition

ℓ(βp, σ2p) = ELL(βp, σ

2p;Y ) =

1

n

n∑

i=1

E log fYi(Yi; βp, σ

2p)

ℓ(βp, σ2p) =

1

n

n∑

i=1

∫ ∞

−∞log fYi

(u; βp, σ2p)gYi

(u; σ2)du

• Estimation

– Substitute βp and σ2p by sample based estimators:

ℓ(βp, σ2p) is random

– Estimate expected value (integral) from sample Y :

ℓ(βp, σ2p) =

1

n

n∑

i=1

log fYi(Yi; βp, σ

2p)

– E[ℓ(βp, σ

2p)]= ℓ(βp, σ

2p) but E

[ℓ(βp, σ

2p)]6= E

[ℓ(βp, σ

2p)]


Estimating KL from a sample

Problem: using the same sample for

• parameter estimation within a model and

• distance of model to data generating density


Development of AIC for linear model (1)

Suppose linear model – normal errors; Let µ = E(Y ) and µp = Xpβp

ℓ(βp, σ2p) = −1

n

n∑

i=1

E

[(Yi − µpi)

2

2σ2p

+1

2log(2πσ2

p)

]

= −1

n

n∑

i=1

[(µi − µpi)

2

2σ2p

+E(Yi − µi)

2

2σ2p

+1

2log(2πσ2

p)

]

= −1

n

n∑

i=1

[(µi − µpi)

2

2σ2p

+σ2

2σ2p

+1

2log(2πσ2

p)

]

Note σ2p 6= σ2: (model ↔ data generating process)

ℓ(βp, σ2p) = −1

n

n∑

i=1

[(µi − µpi)

2

2σ2p

+σ2

2σ2p

+1

2log(2πσ2

p)

]

Taking σ2p =

1

n− νp

n∑

i=1

(Yi − µpi)2 (where νp to be chosen: unbiased, MLE or

other) we have

ℓ(βp, σ2p) = −1

n

n∑

i=1

[(Yi − µpi)

2

2σ2p

+1

2log(2πσ2

p)

]= −1

2

n− νpn

− 1

2log(2πσ2

p)


Special case: σ2 known or easy to estimate

See also slide 79

Observed likelihood ∼ SSE

ℓ(βp) = −1

n

n∑

i=1

[(Yi − µpi)

2

2σ2+1

2log(2πσ2)

]= −c1

1

nSSE(βp) + c2

Expected likelihood ∼ Prediction Error

ℓ(βp) = −1

n

n∑

i=1

[(µi − µpi)

2

2σ2+1

2+1

2log(2πσ2)

]

Then Eℓ(βp) = −a1PE(βp) + a2



For the computation of E[ℓ(βp, σ

2p)]

we use that µpi =(Xpβp

)iis independent

from σ2p = ‖Y −Xpβp‖22/(n− νp)

Proof is extension of slide 60 (not relying on the fact that in the true model/DGP,

the estimator is unbiased)

cov(Y −Xpβp, βp) = E[(Y −Xpβp)β

Tp

]− E(Y −Xpβp) · EβT

p

= E[(Y −Xp(X

Tp Xp)

−1XTp Y ) · ((XT

p Xp)−1XT

p Y )T]

−E(Y −Xp(XTp Xp)

−1XTp Y ) · E((XT

p Xp)−1XT

p Y )T

= (I −Xp(XTp Xp)

−1XTp ) · E(Y Y T ) · ((XT

p Xp)−1XT

p )T

−(I −Xp(XTp Xp)

−1XTp ) · E(Y )E(Y T ) · ((XT

p Xp)−1XT

p )T

= (I −Xp(XTp Xp)

−1XTp ) ·

[(σ2I +XββTXT )−XββTXT

]Xp(X

Tp Xp)

−1

= 0

Hence, any function of Y −Xpβp is independent from any function of βp



E[ℓ(βp, σ

2p)]= −1

n

n∑

i=1

E

[(µi − µpi)

2

2σ2p

+σ2

2σ2p

+1

2log(2πσ2

p)

]

= −1

2E

[σ2

σ2p

(1

nσ2‖µ− µp‖22 + 1

)+ log(2πσ2

p)

]

= −1

2E

(σ2

σ2p

)E

(1

nσ2‖µ− µp‖22 + 1

)− E

[log(2πσ2

p)]

We now use

1. Y ∼ χ2(ν) → E(1/Y ) = 1/(ν − 2)

2. (cfr.slides 88 and 89) E(‖µ− µp‖22

)= p + µT (I − P )µ

Hence, if βp contains the true model (DGP), then

E[ℓ(βp, σ

2p)]= −1

2

(n− νp

n− p− 2

)(pn+ 1)− 1

2E[log(2πσ2

p)]

while (p.105) E[ℓ(βp, σ

2p)]= −1

2

n− νpn

− 1

2E[log(2πσ2

p)]


Akaike’s Information Criterion — AIC – definition

E[ℓ(βp, σ

2p)]= E

[ℓ(βp, σ

2p)]+

1

2

n− νpn

− 1

2

(n− νp

n− p− 2

)(pn+ 1)

= E[ℓ(βp, σ

2p)]−(p + 1

n

)(n− νp

n− p− 2

)

As (n− νp)/(n− p− 2) = O(1), we can define AIC as

AIC(βp) =2

n

n∑

i=1

log fYi(Yi; βp, σ

2p)− 2

(p + 1

n

)

AIC = 2 · log-likelihood - 2 · model size

Intuition the penalty comes from the fact that max.likelihood estimator sub-

stituted into sample likelihood expression fails to detect the noise variance

(second term on second line on p.105 disappears)


AIC – sketch of general proof

• Let θp minimize ℓ(θp) =1

nlog fY (y|θp) (sample log-likelihood)

• Let θ∗p minimize ℓ(θp) =

1

nE (log fY (y;θp)) (expected log-likelihood)

• θ∗p is the least false parameter value

If the DGP is included in model p, then θ∗p is the true parameter value If gY (y; σ

2)

This is because the expected score is zero in the true parameter value, i.e.,

E

(∂

∂θjlog fY (Y ;θ∗)

)= 0

• E(ℓ(θ∗

p))= ℓ(θ∗

p) but E(ℓ(θp)

)6= E

(ℓ(θp)

)

• We use Taylor series approximation for both ℓ(θp) and ℓ(θp) around θ∗p in

which both functions have the same expected value


Taylor series for use in the proof of AIC

• Denote Hℓ(θ) is the Hessian matrix, i.e.,[Hℓ(θ)

]ij=

∂2ℓ(θ)

∂θi∂θj(similar for Hℓ(θ))

• ℓ(θp) = ℓ(θ∗p) +

1

2(θp − θ∗

p)THℓ(θ∗

p)(θp − θ∗p)

where we use that ∇θℓ(θ∗p) = 0 (least false value)

• ℓ(θp) = ℓ(θ∗p) +∇θℓ(θ

∗p)

T (θp − θ∗p) +

1

2(θp − θ∗

p)THℓ(θ∗

p)(θp − θ∗p)

• Together:

ℓ(θp)− ℓ(θp) = ℓ(θ∗p)− ℓ(θ∗

p) +∇θℓ(θ∗p)

T (θp − θ∗p)

+1

2(θp − θ∗

p)T(Hℓ(θ∗

p)−Hℓ(θ∗p))(θp − θ∗

p)


Model misspecification

For the proof of AIC, we need some results under model misspecification

Taylor series for ∇θℓ(θ) around θ∗p: ∇θℓ(θp) ≈ ∇θℓ(θ

∗p) +Hℓ(θ∗

p)(θp − θ∗p)

By defintion, ∇θℓ(θp) = 0, so the score ∇θℓ(θ∗p) satisfies

∇θℓ(θ∗p) ≈ −Hℓ(θ∗

p)(θp − θ∗p) ≈ −Hℓ(θ∗

p)(θp − θ∗p)

Argument for the latter approximation:

• The score can be written as ∇θℓ(θ∗p) =

1

n

n∑

i=1

∇θ log fYi(Yi; θ∗p)

This has the form of a sample mean, fluctuating around E(∇θℓ(θ∗p)) = ∇θ(Eℓ(θ∗

p)) =

∇θℓ(θ∗p) = 0

hence, it can be expected that a standardised value converges in distribution to a

multivariate standard normal. (CLT)

• The Hessian: Hℓ(θ∗p) =

1

n

n∑

i=1

H log fYi(Yi; θ∗p)

P→ Hℓ(θ∗p) does not fluctuate around

zero, but converges in probability to a constant

• Convergence rates: score Op(1/√n); Hessian Op(1/n)

• Conclusion: approximate Hessian by Hℓ(θ∗p), keep score as a random vector


AIC - general proof (ctd.)

From slide 112, we have θp − θ∗p ≈ −Hℓ(θ∗

p)−1∇θℓ(θ

∗p)

Substitute into the expression for ℓ(θp)− ℓ(θp) on slide 111:

ℓ(θp)− ℓ(θp) ≈ ℓ(θ∗p)− ℓ(θ∗

p)−∇θℓ(θ∗p)

THℓ(θ∗p)

−1∇θℓ(θ∗p) + op(1/n)

Expected values

Use: scalar = Tr(scalar) — Tr(AB) = Tr(BA) — ∇θℓ(θ∗p) has zero mean

E(ℓ(θp)

)− E

(ℓ(θp)

)≈ 0− Tr

[Hℓ(θ∗

p)−1E

(∇θℓ(θ

∗p)∇θℓ(θ

∗p)

T)]

= −Tr[Hℓ(θ∗

p)−1Σ∇θℓ(θ∗

p)

]

If θ∗p is true model (DGP), then −Hℓ(θ∗

p) = nΣ∇θℓ(θ∗p)

(Fisher information ma-

trix), so E(ℓ(θp)

)− E

(ℓ(θp)

)≈ Tr(Ip/n) = p/n

So, take AIC(θp) = 2ℓ(θp)− 2p

n


Bayesian Information Criterion (BIC) or Schwarz Criterion

The Bayesian framework

• No true model, all models have prior probability of being the DGP

• Let Sm be the event that the data were generated in model m

• Let Θm be the set of possible values of θ within the mth model

• Let ϕ(θ|Sm) a prior density for the parameters in model m

• Then (Bayes) P (Sm|y) =P (Sm)fY (y|Sm)

fY (y)=

P (Sm)fY (y|Sm)K∑

k=1

P (Sk)fY (y|Sk)

with

fY (y|Sm) =

∫∫∫

Θm

fY |θm(y; θm|Sm)ϕ(θ|Sm)dθm =

∫∫∫

Θm

fY |θm(y; θm)ϕ(θ|Sm)dθm

• If P (Sm) is constant, then maximisation of P (Sm|y) amounts to maximisa-

tion of fY |Sm(y)


Differences in setup AIC — BIC

AIC has the notion of a data generating process/true model

BIC has starts from the probability that the data were generated by model m,

for all models under consideration

AIC: Every model m has a true or least false parameter θ∗m

BIC: Within every model, the parameter values have random/prior distribution

AIC tries to find the model with best expected (log-)likelihood.

BIC tries to maximise the posterior probability, given the observations, in case

of non-informative prior, this is maximum sample likelihood (see slide 116


BIC - formula

Development of expression p.114, with θm, the max. likelihood estimator of

θm:

fY |θm(y;θm) = exp

[nℓ(θm)

]≈ exp

[nℓ(θm)−

1

2(θm − θm)

TnHℓ(θm)(θm − θm)

]

Then we can apply Laplace’s approximation for an integral (see slide 119)

fY (y|Sm) =

∫∫∫

Θm

fY |θm(y;θm)ϕ(θ|Sm)dθm

≈ exp[nℓ(θm)

]ϕ(θm|Sm)

(2π

n

)p/2 ∣∣∣det(Hℓ(θm)

)∣∣∣−1/2

= exp[(n/2)

(2ℓ(θm)− log(n) · p

n

)]·

exp

[(n/2)

(log(2π) · p

n+

2

nlog(ϕ(θm|Sm)

)− 1

nlog∣∣∣det

(Hℓ(θm)

)∣∣∣)]


BIC - formula simplified

All terms on slide 116 have the factor exp(n/2) in common. Between brackets,

we see, for n → ∞,

• in the first term ℓ(θm) = −1n

∑ni=1 log fYi

(Yi; θm) = Op(1) (see p.103)

• in the second term O(log(n)/n)

• in all other terms O(1/n) or Op(1/n)

(In particular, the Hessian converges to constant (independent from n) in

probability, see p.112)

Defining BIC(θm) = 2ℓ(θm)− log(n) · pn=

2

n

n∑

i=1

log fYi(Yi; θm)− log(n) · p

n

we have P (Sm|y) ≈P (Sm) exp

[BIC(θm)/2

]

K∑

k=1

P (Sk) exp[BIC(θk)/2

]


BIC as a Bayes factor

When all prior probabilities are equal (or unknown, and thus supposed to be

equal, for convenience), the ratio of conditional distributions controls the se-

lection.

This ratio is known as the Bayes factor

Bayes factor =fY (y|Sm)

fY (y|Sk)=

exp[BIC(θm)/2

]

exp[BIC(θk)/2

]

Bayes factors are the Bayesian analogues for likelihood ratios


Laplace’s approximation for integrals

Let I =

∫∫∫

R

n

f(x)eg(x)dx, and let x0 be the global maximum of g(x), then

g(x) ≈ g(x0)+1

2(x−x0)

THg(x0)(x−x0) = g(x0)−1

2(x−x0)

T |Hg(x0)|(x−x0)

Moreover f(x) ≈ f(x0) +∇Tf(x0)(x− x0) +1

2(x− x0)

THf(x0)(x− x0)

So

I ≈ eg(x0)

∫∫∫

R

n

exp

[−1

2(x− x0)

T |Hg(x0)|(x− x0)

] (f(x0) +∇Tf(x0)(x− x0) + . . .

)dx

The exponential factor can be read as the density of a multivariate normal

around x0 times the missing constant (2π)n/2∣∣det

(Hg(x0)

−1)∣∣1/2, hence

∫∫∫R

n

exp

[−1

2(x− x0)

T |Hg(x0)|(x− x0)

]∇Tf(x0)(x− x0)dx = 0

I ≈ eg(x0)f(x0)(2π)n/2 |det (Hg(x0))|−1/2


A classical application of Laplace’s approximation:

Stirling’s formula

n! ≈√2πn(n/e)n Γ(z + 1) ≈

√zzz

ez

√2π

Proof

Γ(z + 1) =

∫ ∞

0e−xxzdx =

∫ ∞

0e−zu(zu)zzdu = zz+1

∫ ∞

0e−zuuzdu = zz+1

∫ ∞

0ez[log(u)−u]du

The function gz(u) = z [log(u)− u] has a global minimum in u = 1, with

g′z(u) = z(1/u− 1), g′z(1) = 0 and g′′z (u) = −z/u2, so

Γ(z + 1) ≈ zz+1√2πegz(1)

1√|g′′z (1)|

=√2πz(z/e)z


AIC and BIC: Consistency

Page 113: AIC(θp) = 2ℓ(θp)− 2p

n

Page 117: BIC(θm) = 2ℓ(θm)− log(n) · pn

Selection consistency

If the DGP exists and if it belongs to the models under consideration,

then an information criterion is (weakly/strongly) selection consistent if

(with probability tending to one/almost surely) the optimisation of the infor-

mation criterion selects the DGP.

BIC is consistent


AIC and BIC: Efficiency

Efficieny

An information criterion is asymptotically efficient if its optimisation finds a

model p for which the prediction error PE(βp) satisfies (for any ε)

limn→∞

P

(PE(βp)

PE(β∗)< 1 + ε

)= 1

with β∗ = argmin PE(β)

Note: Prediction Error is considered as a function of the selection. If the

selection is random, then PE(βp) is random

AIC is efficient

A criterion cannot be consistent and efficient at the same time


Algorithms for searching the best model

• Suppose the full model has K possible variables, then the number of pos-

sible models in 2K

• Computing all adjusted R2’s, Cp scores or other criterion is impossible (ex-

ponential computational complexity)

• Possible strategies:

1. Forward selection: Start with a zero model, then gradually add the most

significant parameter in a test with H0 the current model.

2. Backward elimination

3. Forward-backward alternation: after a variable has been added, it is

tested if any variable in the model can now be eliminated without signifi-

cant increase in RSS.

4. Grouped/structured selection: the routine may impose that some pa-

rameters must be included before others. For instance, no interaction

terms without main effects. Imposing nested models is an extreme ex-

ample.


PART 3: HIGH DIM. MODEL SELECTION

Sparsity

• Parsimonious model selection: we search for models where “some” co-

variates have exactly zero contribution.

• Sparse model selection: the number of zeros dominates (by far) the num-

ber of nonzeros

Sparsity will be used as an assumption in high dimensional model selection,

thereby defining a solution.

On the other hand, sparsity will cause problems w.r.t. false positives


Model selection as a constraint optimization problem

Constraint optimization problem = regularized least squares

RegLS(β) = ‖Y −X · β‖2 + λ‖β‖ℓq

• Minimum over β

• Defines the estimator within a model

• Operates within full model, but estimator is parsimonious (if q < 2)

RegLS(β) depends on smoothing parameter λ. Parameter can be estimated

using penalized likelihood, for instance Mallows’ Cp: Cp =SSEpS2

+ 2p− n

• Information criterion

• Minimize over λ (or equivalently: over model size p)

• smoothing parameter λ determines model size p (and vice versa)


Regularisation: from combinatorial to convex constraints

• In the constraint optimization problem RegLSℓ0(β) = ‖Y −X · β‖2 + λp

the constraint or penalty equals p =∑

i β0i = ‖β‖0, where we take 00 = 0.

ℓ0 regularisation = combinatorial integer programming problem

• Replace ℓ0 by ℓ1: → convex quadratic programming problem

RegLSℓ1(β) = ‖Y −X · β‖2 + λ∑

i

|βi|

Combinatorial constraint

X β = Y

I(|β1|>0) + I(| β

2|>0) = 1

Algebraic constraint

X β = Y

|β1|1/2

+ |β2|1/2

= 1

Convex constraint

X β = Y

|β1| + |β

2| = 1

Linear constraint

X β = Y

β1

2 + β

2

2 = r

2


Regularisation: linear vs. quadratic constraint

In this example, the design matrix overspecifies the parameter vector (X has

more rows than columns)

The dotted line depicts optimal balance as function of the regularisation pa-

rameter λ (λ = 0 corresponds to the centre of the ellipses)


Regularisation: linear vs. quadratic constraint: high-dim

case

In this example, the system of normal equations is singular


Equivalence λ ↔ p

• For q < 2, the outcome of the regularized minimum sum of squared resid-

uals is a sparse model, with, say, p selected covariates: with each λ corre-

sponds one p.

• So, we can also regularize with p

• The outcomes are models that are not (necessarily) nested, but they can

be ordered (put on a line) from simple to complex.

• We can thus, for instance, plot PE(βp) as a function of p, just as in the case

with nested models


Solving the ℓ1 constraint problem: KKT conditions

Steepest descend for quadratic forms

Suppose F (β) = ‖Y −Xβ‖22 =n∑

i=1

Yi −

∑

j

Xijβj

2

then∂F

∂βk=

n∑

i=1

2

Yi −

∑

j

Xijβj

Xik

Sum over i is an inner product of the vectors Y −Xβ and the kth column of

X, so all k together, the gradient vector becomes XT (Y −Xβ)

Karush-Kuhn-Tucker (KKT) conditions

If β solves the ℓ1 constraint optimization problem RegLSℓ1(β) = ‖Y −Xβ‖22 +2λ‖β‖1 then

XTj (Y −Xβ) = λsign(βj) if βj 6= 0∣∣∣XTj (Y −Xβ)

∣∣∣ ≤ λ if βj = 0


Intuitive interpretation of the KKT conditions

Intuition

• βj in the model

If βj 6= 0, then∂RegLSℓ1(β)

∂βj(β) is continuous, and it must be zero. That is

expressed by the first line in the Karush-Kuhn-Tucker conditions

• βj not in the model

If βj = 0, then the partial derivative is discontinuous. The steepest decrease

of ‖Y −Xβ‖22 by letting βj deviate from zero, should then be smaller than

the increase in the penalty, i.e., the cost (price to pay) for introducing βj into

the model is higher than the gain:

gain < cost ⇔ ∂

∂βj‖Y −Xβ‖22 < 2λ

∂

∂βj‖β‖ℓ1 ⇔

∣∣∣XTj (Y −Xβ)

∣∣∣ < λ


Interpretation of the Karush-Kuhn-Tucker conditions

The solution of the ℓ1 constraint problem will be sparse, but it does not coin-

cide with the ℓ0 constraint solution. Indeed, the solution that satisfies the KKT

conditions cannot be an orthogonal projection (a least squares solution).

On the next slides is a simple, yet important, special case: soft- versus hard-

thresholding. The latter is a ℓ0, and thus a projection, the former is projection

plus shrinkage

Then, on slide 135, we extend the interpretation to the general case.


Special case of ℓ0 regularisation: Hard-Thresholding

• Suppose X = I, so we have the observational model

Y = β + ε

In this model, both ℓ0 and ℓ1-penalized closeness-of-fit can be optimized

componentwise

• ℓ0-penalized closeness-of-fit:

RegLSℓ0(β) =∑n

i=1(Yi − βi)2 + λ2I(|βi| > 0)

Solution is Hard-thresholding: βi = HTλ(Yi),

where HTλ(x) is the following function

- Y


Special case of ℓ1 regularisation: Soft-Thresholding

• ℓ1-penalized closeness-of-fit:

RegLSℓ1(β) =∑n

i=1(Yi − βi)2 + 2λ|βi|

Solution is Soft-thresholding: βi = STλ(Yi),

where STλ(x) is the following function

- Y


Shrinkage in ℓ1 constraint least squares for Y = Xβ + ε

Let I be the set of selected components, then the Karush-Kuhn-Tucker condi-

tions are equivalent to XTI XIβI = STλ(X

TI Y )

From which we conclude that within the optimal selection, the estimation is

not least squares, but it contains a shrinkage

Note that XTXβ 6= STλ(XTY ) (i.e., without selection I)


LASSO/Basis pursuit

• ℓ1 penalization/regularization/constraint is called least absolute shrinkage

and selection operator (LASSO) or Basis Pursuit

• For given λ, ℓ1 penalization leads to the same degree of sparsity as ℓ0 (see

Figure on slide )

• For fixed λ, and if all nonzero β are large enough, ℓ1 penalization is vari-

able selection consistent: for n → ∞, the set of nonzero variables in the

selection equals the true set with probability tending to one

• The convex optimization problem can be solved by quadratic/dynamic pro-

gramming or by specific methods, such as

– Least Angle Regression (LARS), a direct solver

– Iterative Soft Thresholding (It. ST), an iterative solver

We discuss both on the subsequent slides


Least Angle Regression

• Brings in a new variable that with maximum projection of current residual,

i.e., component with highest magnitude in the vector XT (Y −Xβ0)

• Moves along equiangular vector uI : β1 = β0 + αuI with

uI = XI(XTI XI)

−11I/√1TI (X

TI XI)−11I

where I is set of active variables (variables currently in the model), until one

variable not in I has the same projection of the new residual

XT (Y −Xβ1)

• Stopping criterion: Mallows’s Cp: we stop if (we think that) Cp(p) has reached

a minimum


Iterative soft-thresholding

• General X, observational model Y = Xβ + ε

• We want ‖Y −Xβ‖22 as small as possible, under the constraint that ‖β‖1 =∑i |βi| is restricted

• Suppose we have an estimator β(r), then we improve this estimator in two

steps

– We proceed in the direction of the steepest descend in ‖Y −Xβ‖22.The steepest decsend is XT (Y −Xβ(r))

We define β(r+1/2) = β(r) +XT (Y −Xβ(r))

– We search for β(r+1) which is as close as possible to β(r+1/2), but which

has restricted ℓ1-norm, so, we take

β(r+1) = STλ(β(r+1/2))


Degrees of freedom in sparse variable selection

Degrees of freedom (see definition slide 93) νp =1

σ2E[εT (ε− ep)

]


Generalized Cross Validation (GCV):

compromise between CV and Cp

Definition GCV(βp) =1nSSE(βp)(1− νp

n

)2

This comes from a formalization + simplification of the routine of

cross validation

GCV is an estimator of the prediction error (like Cp)

The use of GCV assumes sparse models

GCV combines benefits CV and Cp

• Like Cp, GCV is much faster than CV

• Like CV, GCV does not need a variance estimator

Implicit variance estimator is quite robust


Generalised Cross Validation

GCV also estimates the prediction error, however without having to estimate

the variance of the errors.

GCV(βp) =1nSSE(βp)(1− νp

n

)2 while ∆(βp) =1

nSSE(βp) +

2νpnσ2 − σ2

(see slide 93 for definition of ∆(βp))

Working GCV is based on

GCV(βp)− σ2 =∆(βp)−

(νpn

)2σ2

(1− νp

n

)2

GCV(βp)− σ2 −∆(βp)

∆(βp)=

2νpn

−(νpn

)2

(1− νp

n

)2 −

(νpn

)2

(1− νp

n

)2 ·σ2

∆(βp)

Interpretation GCV(βp)−σ2−∆(βp) is small if νp/n is small: requires sparsity

assumption


Illustration: GCV in LASSO νλ = E(N1(λ))

λ

PE∆

p

PE+σ2GCV

GCVhat

small modelslarge models

GCV(λ) =

1

nSSE(βλ)

(1− νλ

n

)2 where: νλ = n1 = E(N1(λ))

GCV(λ) =

1

nSSE(βλ)

(1− νλ

n

)2 where: νλ = N1(λ)


Two results supporting the use of GCV

1. Uniform efficiency (requires sparsity)

There exists a sequence Λn of (large) subsets of R, so that

supλ∈Λn

|GCV(λ)− σ2 −∆λ|∆λ + Vn

P→ 0 as n → ∞

where Vn = max

(0, sup

λ∈Λn

(PE(βλ,p)−∆λ)

)

Interpretation

On a large subset of R, GCV is “equivalent” to Mallows’s Cp (see further)

2. Behavior near zero λ in LASSO (full model)

limλ→0

E [GCV(λ)] = cn > 0

Interpretation

The behavior near zero does not disturb the minimization procedure


Uniform efficiency: “equivalence” GCV and Cp

PE (Prediction Error) and Cp

E(∆p) = PE(βp)

(See slide 93)

Bias-var ⇔ Closeness-complexity

Equivalence in expected value

Cp and GCV

1. Asymptotic

2. No expectations

3. Requires sparsity

4. Conditions on design matrix X


PART 4: MODEL AVERAGING


Chapter 1: Variable Selection and Model...

Documents

Transcript of Chapter 1: Variable Selection and Model...