Chapter 1: Variable Selection and Model...
Transcript of Chapter 1: Variable Selection and Model...
Chapter 1: Variable Selection and
Model SelectionMaarten Jansen
Overview
1. Multiple linear regression (Slide 2)
2. Variable and model selection (Slide 68)
3. Model averaging (Slide 145)
4. High dimensional model selection (Slide 124)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.1
PART 1: MULTIPLE LINEAR REGRESSION
1.1 The general linear model, slide 3
1.2 Algebraic solution in a Euclidean space, slide 7
1.3 Tools for statistical exploration, slide 19
1.4 Statistical properties of least squares estimators, slide 35
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.2
1.1 The general linear model
Yi = β0 +
p−1∑
j=1
βjxi,j + εi With:
• Yi the ith observation of the response variable
– dependent variable
– explained, observed variable
• xi,j the (observed) ith value of the jth covariate
also called
– independent variable
– regression variable
– explanatory variable
– predictor, predicting variable
– control variable
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.3
The general linear model (2)
• βj the regressiecoefficient corresponding to the jth covariate
This is a parameter vector to be estimated
• β0 the intercept (other coefficients sometimes “slopes”)
• εi the error (observational noise) in the ith observation
In matrix form: Y = Xβ + ε where X ∈ Rn×p and the first column of X all
ones, (Xi,0 = 1) corresponding to the intercept. The other columns of X are
observed covariate values.
We also define the expected or true response µ = Xβ
µ = E(Y )
Y = µ + ε
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.4
Examples, specific cases (1)
1. Polynomial regression Yi = β0 +
p−1∑
j=1
βjxji + εi So xi,j = xji .
There is only one explanatory variable in strict sense (x with observations
xi), but the response depends on several powers of x.
2. Models with interaction
Suppose: 2 explanatory variables, x1 and x2, but Y also depends on the
interaction between them Yi = β0 + β1xi,1 + β2xi,2 + β1,2xi,1xi,2 + εi
We then set x3 = x1x2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.5
Examples, specific cases (2)
3. One-way variance analysis (ANOVA) Yij = µi + εij with i = 1, . . . , k and
j = 1, . . . , ni
Note that in this model the observations have double indices
All observations at the same level µi are only different by the observational
noise, so in a regression model they should share common variables of the
explanatory variables.
The model becomes Yij =∑k
ℓ=1 xij,ℓµℓ + εij where we take indices i and
j together into a single index ij and xij,ℓ gets the value 1 if ℓ = i and 0
otherwise.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.6
1.2 Algebraic solution in a Euclidean space
Model Y = Xβ + ε
We search for β so that Xβ is as close as possible to Y in quadratic norm.
Algebraicly this means that we look for the minimum least squares solution
to the overdetermined set of equations Y = Xβ has to be minimized.
Formally, find β that minimizes ‖Y −Xβ‖22In a Euclidean space (i.e., a space where the notion of orthogonality can be
defined using inner products) that means that the
residual vector e = Y −Xβ
has to be orthogonal (perpendicular) to the approximation (or projection) Xβ
in the subspace spanned by X.
On other words: The residual vector has minimum norm ⇔ it is orthogonal to
the estimator vector
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.7
The normal equation and least squares (LS) estimator
Developping the algebraic interpretation:(Xβ
)T(Y −Xβ) = 0
This is the case if XT (Y −Xβ) = 0
from which the normal equation follows β = (XTX)−1XTY
Let µ be the estimator for µ = Xβ , then
µ = X(XTX)−1XTY
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.8
Special case: p = 2, simple linear regression
Model Yi = β0 + β1xi,1 + εi, i = 1, . . . , n
Simple linear regression: x0 = 1 and x1 = x.
Explicit formulas for LS estimators
β1 =
n∑
i=1
(Yi − Y )(xi,1 − x1)
n∑
i=1
(xi,1 − x1)2
β0 = Y − β1x1
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.9
Least squares as an orthogonal projection
The prediction µ is an orthogonal Projection with projection matrix
P = X(XTX)−1XT so that µ = PY
Orthogonality P T = P (See slide 12)
Projection PP = P (idempotence)
Pµ = µ PX = X
Interpretation: columns of X contain eigenvectors with eigenvalue 1. If
XTY = 0, then PY = 0 (from definition of P ) These are eigenvectors with
eigenvalue 0.
Next slides discuss projections and orthogonal projections in general
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.10
Projections
P is a projection matrix ⇔ PP = P i.e., P is idempotent
Properties
1. P has eigenvalues 0 and/or 1
Indeed, let Px = λx with x non-trivial, then λ2x = PPx = Px = λx, hence λ2 = λ
2. There exists an orthogonal Q so that P = Q
[I A
0 0
]QT
P = E
[I 0
0 0
]E−1 = QR
[I 0
0 0
]R−1QT = Q
[I R11(R
−1)120 0
]QT
using the notations of submatrices R =
[R11 R12
0 R22
]and similar for R−1
3. Singular values of P are either 0 or larger than 1
Indeed, singular values of P are eigenvalues of PP T = Q
[I +AAT
0
0 0
]QT which
are either 0 or 1+
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.11
Orthogonal projections
Projection P is an orthogonal projection ⇔ (I − P )TP = 0
So, the matrix P itself is NOT orthogonal: P TP 6= I
Projection P is an orthogonal projection ⇔ P T = P
Indeed, (I − P )TP = 0 ⇔ P TP = P
But then P T = (P TP )T = P TP = P
Hence P must be symmetric
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.12
Residuals
The vector of Residuals is e = Y − µ = (I − P )Y (See also slide 7)
The normal equation finds the estimator that minimizes the mean squared
residual eTe/n = ‖e‖2/n = 1n
∑ni=1 e
2i
• The matrix I − P is idempotent: (I − P )(I − P ) = I − P
• This matrix has rank n − p: all linear combinations of the columns of X
belong to the kernel of the matrix, since (I − P )X = 0
• An idempotent matrix with rank k has k independent eigen vectors with
eigen value 1 and n− k independent eigen vectors with eigen value 0. The
latter are the columns of X
• The matrix I − P is symmetric, so eigen vectors corresponding to eigen
value 1 are orthogonal to eigen vectors with eigen value 0. Within each of
the two groups of eigen vectors, orthogalisation (using Gram-Schmidt) is
straightforward.
• So: there exists an orthogonal transform U so that I − P = UTΛU with
Λ = diag(1, . . . , 1, 0, . . . , 0)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.13
Orthogonalisation
If X has orthogonal columns, then XTX = Ip, and thus the projection
becomes
µ = X ·XT · YThe ith component is then
µi =
p−1∑
j=0
n∑
k=1
xi,jxk,j · Yk
The computation of all n components of µ can proceed in parallel, without
numerical errors progressing from one component to another.
Otherwise, we Orthogonalise X
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.14
Gram-Schmidt-orthogonalisation
Let xj, j = 0, . . . , p− 1 be the columns of X, then define
q0 = x0 = 1
q1 = x1 − 〈x1,q0〉‖q0‖2 · q0
. . .
qj = xj −∑j−1
k=0〈xj ,qk〉‖qk‖2 · qk
This is: qj is the residual of orthogonal projection of xj onto q0, . . . , qj−1
Looks like linear regression within the covariate matrix X
The new covariates qj are orthogonal, but not orthonormal (they can be easily
normalised)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.15
Gram-Schmidt in matrix form
We have
xj = qj +
j−1∑
k=0
〈xj, qk〉‖qk‖2
· qk
Apply scalar product with qj on both sides to find that〈xj,qj〉‖qj‖2 = 1,
Hence xj =∑j
k=0〈xj,qk〉‖qk‖2 · qk
In matrix-form: QR-decomposition X = Q ·R with the columns Q equal to qj, and
R an upper triangular matrix, with entries
Rk,j =〈xj, qk〉‖qk‖2
.
R ∈ Rp×p and R is invertible.
Interpretation Every column of X (every covariate xj can be written as a linear combi-
nation of orthogonal covariates, which are the columns of Q. (Right-multiplication with
R)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.16
Regression after orthogonalisation
Plugging in into the solution of the normal equation:
µ = QR(RTQTQR)−1RTQT · Y = Q(QTQ)−1QT · YThis is a projection onto the orthogonal (but not orthonormal) basis Q.
Written out: µ =
p−1∑
k=0
〈Y , qk〉‖qk‖2
· qk
The vector of coefficients γ with γk =〈Y , qk〉‖qk‖2
satisfies µ = Qγ and γ = (QTQ)−1QT · Y
Relation between γ and β:
β = (XTX)−1XT · Y = (RTQTQR)−1RTQT · Y = R−1(QTQ)−1QT · Y
β = R−1γ γ = Rβ (Left-multiplication with R)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.17
Orthogonalisation details for p = 1, simple linear regression
Model Yi = β0 + β1xi,1 + εi, i = 1, . . . , n
Simple linear regression: x0 = 1 and x1 = x.
q1 = x1 −〈x1, q0〉‖q0‖2
· q0 = x1 −〈x1, 1〉‖1‖2 · 1 = x1 −
∑ni=1 xi,1n
· 1 = x1 − x1 · 1
R =
[1/n x10 1
]R−1 =
[n −nx10 1
]
γ0 = Y γ1 =〈q1,Y 〉‖q1‖2 = 〈x1−x1·1,Y 〉
‖x1−x1·1‖2
Since the second row of R−1 equals[0 1
], we find immediately that
β1 =〈x1−x1·1,Y 〉‖x1−x1·1‖2
which corresponds to the classical result, slide 9
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.18
1.3 Tools for statistical explorationIn order to further explore the properties of the least squares estimator, we
need to establish some classical results in multivariate statistics
1. Covariance matrix
2. Schur complement
3. Multivariate normal distribution
4. Trace of a matrix
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.19
Tool 1: Covariance matrix
Definition
Let X be a p-variate random vector with joint density function fX(x), joint (cu-
mulative) distribution function FX(x), and joint characteristic function φX(t) = E(eitTX)
µ = E(X)
ΣX = E[(X − µ)(X − µ)T
]We have (ΣX)ij = cov(Xi, Xj)
ΣX is symmetric
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.20
Covariance matrix under linear transformations
If Y = AX Then ΣY = AΣXAT
Proof:
It holds that µY = AµX, so
ΣY = E[(Y − µY )(Y − µY )
T]= E
[(A(X − µX))(A(X − µX))T
]
So, with Ai the ith column of A, we have cov(Yi, Yj) = ATi ΣXAj = AT
j ΣXAi
Special case: matrix A has just one row: A = aT , then Y = aTX and σ2Y =
aTΣXa
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.21
Covariance matrix is positive (semi-) definite
A symmetric matrix S is called positive (semi-)definite if xTSx > (≥)0 for
every vector x 6= 0
A positive-definite matrix always has real, positive eigenvalues
ΣX is positive semi-definite, since if aTΣXa < 0, then Y = aTX would have
a negative variance
We further assume that ΣX is positive definite.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.22
Eigenvalue decomposition of the covariance matrix
There exists a square, orthogonal matrix U so that ΣX = UΛUT , with Λ a
diagonal matrix with the eigenvalues of ΣX on the diagonal
Furthermore, there exists a matrix A = UΛ1/2 so that ΣX = AAT
For this matrix it holds [det(A)]2 = det(ΣX)
Suppose now Z = A−1(X − µ) (with µ = EX), then
EZ = 0 and ΣZ = A−1(AAT )A−T = I
The components of Z are thus uncorrelated
Vice versa, if ΣZ = I and A is an arbitrary matrix, and X = AZ, then
ΣX = AAT and this arbitrary A can then be written as A = UΛ1/2, where
U orthogonal.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.23
The precision matrix or concentration matrix
The inverse of a non-singular covariance matrix plays an important role. It is
known as the precision matrix or the concentration matrix
Notation (capital Kappa) KX = Σ−1X
In the context of the multivariate normal distribution, we will need to develop
the quadratic form uTKXu where uT =
[u1
u2
]and KX =
[K11 K12
K21 K22
],
uTKXu = uT1K11u1 + uT
1K12u2 + uT2K21u1 + uT
2K22u2
= uT1K11u1 + 2uT
1K12u2 + uT2K22u2
(where we used the symmetry of KX, i.e., K21 = KT12)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.24
Tool 2: The Schur complement of a matrix block
Definition:
Given the matrix S =
[S11 S12
S21 S22
], the Schur complement of submatrix S11 is
given by S1|2 = S11 − S12S−122 S21
Property: S1|2 =[(
S−1)11
]−1
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.25
Proof of S1|2 =[(
S−1)11
]−1
Construct L =
[I11 0
−S−122 S21 I22
], then L−1 =
[I11 0
+S−122 S21 I22
]
and
SL =
[S11 S12
S21 S22
] [I11 0
−S−122 S21 I22
]=
[S11 − S12S
−122 S21 S12
0 S22
]
=
[S1|2 S12
0 S22
]=
[I11 S12S
−122
0 I22
] [S1|2 0
0 S22
]
So, S =
[I11 S12S
−122
0 I22
] [S1|2 0
0 S22
] [I11 0
S−122 S21 I22
]
and
S−1
=
[I11 0
−S−122 S21 I22
] [S
−11|2 0
0 S−122
] [I11 −S12S
−122
0 I22
]=
[S
−11|2 . . .
. . . . . .
]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.26
Tool 3: The multivariate normal distribution
Definition: multivariate normal distribution
X ∼ N(µ,ΣX) ⇔ Zi ∼ N(0, 1)
As the components of Z are uncorrelated, and (here) normally distributed,
they must me mutually independent.
The joint density function
fX(x) =1
(2π)p/2√det(ΣX)
e−12(x−µ)TΣ−1
X (x−µ)
The characteristic function
φX(t) = exp
(itTµ− 1
2tTΣXt
)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.27
Properties of the multivariate normal distribution
Property 1 If X ∼ N(µ,ΣX), then (X − µ)TΣ−1X (X − µ) ∼ χ2
p
This follows from (X−µ)TΣ−1X (X−µ) = (X−µ)TA−TA−1(X−µ) = ZTZ =∑p
i=1Z2i and Z2
i ∼ χ21 and Zi mutually independent
Property 2
If X ∼ N(µ,ΣX), then for C ∈ Rq×p with q ≤ p: CX ∼ N(Cµ,CΣXCT )
Special cases
C = aT ∈ R1×p X ∼ N(µ,ΣX) ⇔ aTX ∼ N(aTµ,aTΣXa)
C =
[0 0
0 I
] If X =
[X1
X2
]∼ N(µ,ΣX) with µ =
[µ1
µ2
]and ΣX =
[Σ11 Σ12
Σ21 Σ22
], then X2 ∼ N(µ2,Σ22)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.28
Properties of the multivariate normal distribution
Property 3
Suppose X =
[X1
X2
]and X ∼ N(µ,ΣX) with µ =
[µ1
µ2
]and ΣX =
[Σ11 Σ12
Σ21 Σ22
]and det(Σ22) 6= 0, then
X1|X2 = x2 ∼ N(µ1|2,Σ1|2)
with µ1|2 = µ1 + Σ12Σ−122 (x2 − µ2) and Σ1|2 = Σ11 − Σ12Σ
−122 Σ21
It holds that (see slide 25) Σ−11|2 =
(Σ−1
)11= K11 and Σ12Σ
−122 = −K−1
11 K12
Note also that Σ21 = ΣT12
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.29
Marginal and conditional covariance
(further details of slide 29)
Suppose X =
[X1
X2
]is normally distributed with mean µ =
[µ1
µ2
]and covariance
matrix ΣX =
[Σ11 Σ12
Σ21 Σ22
]and det(Σ22) 6= 0, then
Marginal covariance (holds in general)
cov(X1) = Σ11
Conditional covariance = Schur-complement (holds for normal RV)
cov(X1|X2) = Σ1|2 = Σ11 − Σ12Σ−122 Σ21 =
[(Σ−1
)11
]−1
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.30
Conditional mean & covariance for normal random vectors
We now prove the result on slide 29.
We know: if X =
[X1
X2
]is normal, then X2 is normal as well (see slide 28).
So, denoting U = X−µ, fU1|U2(u1|u2) =
fU(u)fU2
(u2)∝ exp
[(uTΣ−1u− uT
2Σ−122 u2
)/2]
Now, with K = Σ−1, we substitute uTΣ−1u = uT1K11u1 + 2uT
1K12u2 + uT2K22u2
(see slide 24) and Σ−122 =
[(K−1
)22
]−1= K22 − K21K
−111 K12:
2 log fU1|U2(u1|u2)
= uT1K11u1 + 2uT
1K12u2 + uT2K22u2 − uT
2K22u2 + uT2K21K
−111 K12u2 + constant
= uT1K11u1 + 2uT
1K12u2 + uT2K21K
−111 K12u2 + constant
=(u1 + K−1
11 K12u2
)TK11
(u1 + K−1
11 K12u2
)+ constant
From this, we conclude E(U1|U2 = u2) = −K−111 K12u2
and cov(U1|U2 = u2) = K−111 = Σ11 − Σ12Σ
−122 Σ21
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.31
Conditional mean & covariance for normal random vectors
(2)
(Proof, continued)
And so,
E(X1|X2 = x2) = E(µ1 +U1|X2 − µ2 = x2 − µ2)
= µ1 + E(U1|U2 = x2 − µ2)
= µ1 − K−111 K12 (x2 − µ2)
From ΣXKX = I we find
(1) Σ11K12 + Σ12K22 = 0 and
(2) Σ21K12 + Σ22K22 = I ⇔ Σ21K12 = I − Σ22K22
(1) and (2) are used in the following argumentK−1
11 K12 = Σ11K12 − Σ12Σ−122 Σ21K12
= Σ11K12 − Σ12Σ−122 (I − Σ22K22)
= Σ11K12 − Σ12Σ−122 + Σ12Σ
−122 Σ22K22 = −Σ12Σ
−122
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.32
Tool 4: Trace of a matrix
Tr(A) =∑p
i=1Aii
Tr(AB) = Tr(BA) (A ∈ Rp×q,B ∈ Rq×p)
Holds for both square (p = q) and rectangular matrices with matching sizes
Proof:
Tr(AB) =
p∑
i=1
(AB)ii =
p∑
i=1
q∑
j=1
AijBji =
q∑
j=1
p∑
i=1
BjiAij =
q∑
j=1
(BA)jj = Tr(BA)
Corollary A,B,C square matrices, then: Tr(ABC) = Tr(BCA) = Tr(CAB)
Trace remains unchanged under cyclic permutation, so NOT under every permutation:
Tr(ABC) 6= Tr(BAC)
Holds also for more than 3 matrices and also for rectangular matrices with matching
dimensions
Tr(A + B) = Tr(A) + Tr(B)
Attention: Tr(A · B) 6= Tr(A) · Tr(B) but det(A · B) = det(A) det(B)
Trace and eigenvalues
If A = EΛE−1, with Λ = diag(λi), then Tr(A) =∑n
i=1 λi det(A) =∏n
i=1 λi
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.33
Maximum likelihood for regression
Model Y = Xβ + ε (random variable Y instead of X in previous slides)
Assume ε ∼ N.I.D.(0, σ2), then joint observational density is multivariate nor-
mal, more precisely L(β, σ2;Y ) =1
(2πσ2)n/2e
−12σ2
(Y −Xβ)T (Y −Xβ)
To be maximised over β and σ2.
logL(β, σ2;Y ) = −n
2log(2πσ2
)− 1
2σ2(Y −Xβ)T (Y −Xβ)
Dependence on β only through second term, so β maximises L(β, σ2;Y ) or
logL(β, σ2;Y ) iff β minimises
(Y −Xβ)T (Y −Xβ) = ‖Y −Xβ‖22This is least squares.
Conclusion: for normal errors is the least squares solution the maximum like-
lihood solution.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.34
1.4 Statistical properties of LS estimators
Model Y = Xβ + ε Least squares: β = (XTX)−1XTY
Property 1: unbiased
If ε has zero mean, then Eβ = β
(The errors do not need to be uncorrelated, normally distributed)
Property 2: covariance
If ε is uncorrelated, then Σβ= σ2(XTX)−1
(Normality and zero mean not required)
Property 3: Normality If ε is normally distrubuted, then β is normally distributed
Indeed: the estimator is a linear combination of normally distributed random
variables
(No need for uncorrelated, zero mean errors)
Marginal distributions βj ∼ N(βj, σ
2(XTX)−1jj
)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.35
Property 4: best linear unbiased estimator (BLUE)
Suppose
• Zero mean errors E(ε) = 0
• Uncorrelated, homoscedastic errors Σε = σ2I i.e.: var(εi) = σ2 and
cov(εi, εj) = 0
(not necessarily independent)
• Normality is not necessary
While β = (XTX)−1XTY , let β = AY =[(XTX)−1XT +∆
]Y , be another
unbiased, linear estimator, then:
Gauss-Markov theorem
For any choice of cT , under the assumptions above, MSE(cT β) ≥ MSE(cT β)
In other words, for any A so that E(AY ) = β (unbiased), it holds that
Σβ− Σ
βis positive semi-definite
Note:MSE MSE(θ) = E
[(θ − θ
)2]MSE(θ) =
[E(θ)− θ
]2+ var(θ)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.36
Gauss-Markov theorem: proof
As E(AY ) = β, we have to prove that var(cT β) ≥ var(cT β)
Using the linearity, we find, for any (arbitrary) value of β:
β = E(AY ) = AE(Y ) = A(Xβ) = (AX)β,
This amounts to AX = I: A is a left inverse of n× p rectangular X
So is the least squares projection (XTX)−1XT .
Defining ∆ = A − (XTX)−1XT , we thus have ∆X = O
The covariance matrix of β is σ2(XTX)−1 (slide 35), so
var(cT β)− var(cT β) = σ2cT (AAT − (XTX)−1)c
We haveAAT − (XTX)−1 =
[(XTX)−1)XT +∆)
] [(XTX)−1)XT +∆)
]T − (XTX)−1
= (XTX)−1)XT∆T + ∆X(XTX) + ∆∆T
= ∆∆T
The quadratic form ∆∆T is always pos., semi-def., thereby completing the
proof
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.37
Remark: the expected squared loss
Definition: Risk or expected squared loss R(β) = 1nE(‖β − β‖2
)
This definition satisfies R(β) = 1nTr(Σβ
) + 1n‖Eβ − β‖2
We know (slide 22): Σβ− Σ
βpos. semi-def. ⇔ has non-negative eigenvalues
All eigenvalues non-negative ⇒ their sum (=Trace) non-negative:
Σβ− Σ
βpos.def ⇒ Tr
(Σβ− Σ
β
)≥ 0,
Hence, for least squares estimators:
Property 4, BLUE: Σβ− Σ
βpos.def ⇒ R(β) ≤ R(β)
Risk is sometimes refered to as MSE, but working with MSE of any linear
combination MSE(cT β) as on slide 36 puts a stronger condition
On the other hand, if we can find an example where R(β) ≥ R(β), then Σβ−Σ
β
cannot possibly be positive definite
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.38
Stein’s phenomenon
The proof of the Gauss-Markov theorem hinges on the fact that
(XTX)−1)XT∆T = 0.
This is due to the linearity and unbiasedness of the estimator AY and to the
specific form of the LS estimator β = (XTX)−1)XTY .
Stein’s phenomenon
Biased estimators may have uniformly (i.e., for any true value) lower risk
Stein’s example
• Y = µ + ε, we want to estimate µ (i.e.,no regression, just a signal-plus-
noise model)
• ε ∼ Nn(µ, σ2I): noise is normal, homoscedastic and uncorrelated (hence
i.i.d. normal)
The BLUE estimator is µ = Y .
The risk (slide 38)is R(Y ) = σ2
A biased, shrinkage estimator µi = Yi(1− α/‖Y ‖2)c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.39
The expected squared loss of a shrinkage estimator
R(µ) =1
n
n∑
i=1
E(Yi − µi)2 − 2
α
nE
(n∑
i=1
(Yi − µi)Yi
‖Y ‖2
)+α2
nE
(n∑
i=1
Y 2i
‖Y ‖4
)
= σ2 − 2α
nE
((Y − µ)TY
‖Y ‖2)+α2
nE
(1
‖Y ‖2)
Now we use Stein’s Lemma If X ∼ N(µ, σ2) then E [(X − µ)h(X)] = σ2E [h′(X)]
with h(x) continuous
Here: X = Yi, h(X) = hi(Yi) = Yi/‖Y ‖2, so h′(X) =∂hi(Yi)
∂Yi=
‖Y ‖2 − 2Y 2i
‖Y ‖4Using the vector h(Y ) = Y /‖Y ‖2:
E
((Y − µ)TY
‖Y ‖2)
= E[(Y − µ)Th(Y )
]= σ2
n∑
i=1
E
[∂hi(Yi)
∂Yi
]= σ2
n∑
i=1
E [h′i(Yi)]
= σ2n∑
i=1
E
[1
‖Y ‖2 −2Y 2
i
‖Y ‖4]= σ2E
[n
‖Y ‖2 −2
‖Y ‖2]
Suppose n > 2 and take α = n− 2, then R(µ) ≤ σ2 = R(Y )
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.40
A proof of Stein’s Lemma
If X ∼ N(µ, σ2) then E [(X − µ)h(X)] = σ2E [h′(X)] with h(x) continuous (not
necessarily differentiable in all points)
Proof: Let φσ(x) be the density of zero mean normal RV with variance σ2, then
φ′σ(x) = −xφσ(x)/σ
2, i.e., xφσ(x) = −σ2φ′σ(x).
Then the denisty of X ∼ N(µ, σ2) is fX(x) = φσ(x− µ) and so,
E [(X − µ)h(X)] =
∫ ∞
−∞(x− µ)h(x)φσ(x− µ)dx
= −σ2
∫ ∞
−∞h(x)φ′
σ(x− µ)dx
= −σ2 [h(x)φσ(x− µ)]∞−∞ + σ2
∫ ∞
−∞h′(x)φσ(x− µ)dx
= 0 + σ2E [h′(X)]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.41
Stein’s phenomenon: discussion
Remarkable result: a set of observations Y = µ + ε, without any assumed
connection among the components of µ (no regression; no correlation) is bet-
ter estimated if taken together
Non-linear estimator, because shrinkage factor depends on the observations
Next example is about a regression model. We will use a linear, but biased
estimator: β = AY , but E(AY ) 6= β
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.42
Example 2: ridge regression/Tikhonov regularisation
If X has nearly linearly dependent columns, the XTX is nearly singular
Variances of β can be arbitrarily large (see slide 35)
We give up unbiasedness for the sake of variance reduction: regularisation of
a (nearly) singular system
The least squares problem minβ ‖Y −Xβ‖22 leads to the singular normal equa-
tion (XTX)β = XTY
We regularise minβ
‖Y −Xβ‖22 + ‖Lβ‖22
Very often, one takes L =√λI, so min
β‖Y −Xβ‖22 + λ‖β‖22
The solution is
β =(XTX + LTL
)−1XTY or β =
(XTX + λI
)−1XTY
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.43
Covariance matrix of the regularised solution
Consider the regularized estimator as a function of λ: β(λ) =(XTX + λI
)−1XTY
Then Σβ(λ) = σ2(XTX + λI)−1(XTX)(XTX + λI)−1
Let XTX = UΛUT , where Λ is a diagonal matrix, then
Σβ(λ) = σ2U(Λ + λI)−1UTUΛUTU(Λ + λI)−1UT = σ2U(Λ + λI)−2
ΛUT
The effect of the regularization can be described by
Σβ(0) − Σ
β(λ) = σ2U[Λ
−2 − (Λ + λI)−2]ΛUT
The diagonal matrix D =[Λ
−2 − (Λ + λI)−2]Λ has elements
Dii =1
Λii− Λii
(Λii + λ)2≥ 0
Hence, Σβ(0) − Σ
β(λ) is positive definite, meaning that, for any c
var(cT β(0)) = cTΣβ(0)c ≥ cTΣ
β(λ)c = var(cT β(λ))
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.44
The bias of the regularised solution
We check if/when the decrease in variance dominates the increase in bias
bias(β(λ)) = E(β(λ))− β =(XTX + λI
)−1XTXβ − β
=(XTX + λI
)−1 [XTX −
(XTX + λI
)]β
= −λ(XTX + λI
)−1β
bias(cT β(λ)) = −λcT(XTX + λI
)−1β depends on c and β
No general conclusion
We investigate the risk of the regularised estimator. If the risk for a nonzero λ
is smaller than for λ = 0, then λ = 0 cannot be possibly minimise MSE(cT β(λ))
for any c (see slide 38)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.45
The risk of a ridge regression
R(β(λ)) = 1nTr
(Σβ(λ)
)+ 1
n‖Eβ(λ)− β‖22
=σ2
n
n∑
i=1
Λii
(Λii + λ)2+λ2
n‖(XTX + λI
)−1β‖22
≤ σ2
n
n∑
i=1
Λii
(Λii + λ)2+λ2
n‖(XTX + λI
)−1 ‖2F‖β‖22
=σ2
n
n∑
i=1
Λii
(Λii + λ)2+λ2‖β‖22
n
n∑
i=1
1
(Λii + λ)2
=1
n
n∑
i=1
Λiiσ2 + λ2‖β‖22
(Λii + λ)2
All terms in this sum have a negative derivative for λ in λ = 0, hence
For any vector β, there exists a parameter λ > 0, so that R(β(λ)) < R(β(0))
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.46
Risk as a function of λ
R( )
var( )
bias2( )
Note that the exact expression of R(βλ) depends on the unknown β: this has
to be estimated in practice (see later)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.47
Variance estimation in a linear model
S2 = (Y −Xβ)T (Y −Xβ)n−p = eTe
n−p (e = residual, see slide 13)
Note that the first factor is transpose (↔ expression of covariance matrix)
ES2 = σ2
Proof We know µ = Xβ = PY ,
so Y −Xβ = (I − P )Y = (I − P )(Xβ + ε) = (I − P )ε
and so
E[(Y −Xβ)T (Y −Xβ)
]= E
[εT (I − P )T (I − P )ε
]= E
[εT (I − P )ε
]
= E[Tr(εT (I − P )ε)
]= E
[Tr((I − P )(εεT ))
]
= Tr[(I − P )E(εεT )
]= Tr(I − P )σ2 = (n− p)σ2
(Using that scalar = trace(scalar) and then cyclic permutation within trace, see slide 33
+ properties of P , see slide 10)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.48
The sum of squared residuals is a biased estimator
Interpretation
The mean squared residual eTe/n underestimates the expected squared error
σ2 = E(ǫ2). This is because the least squares method minimizes the mean
squared residual, and so even the true model β has a larger mean squared
residual than the estimated model.
The minimum mean squared residual is therefore always smaller than the
mean squared error in the true model:1n‖Y − µ‖2 ≤ 1
n‖Y − µ‖2
⇒ 1nE(‖Y − µ‖2
)≤ 1
nE(‖Y − µ‖2
)
⇒ 1nE(eTe) ≤ 1
nE(εTε) = σ2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.49
Distribution of the variance estimator
Theorem
(n− p)S2/σ2 ∼ χ2n−p
• There exists an orthogonal transform U so that I − P = UTΛU with Λ =
diag(1, . . . , 1, 0, . . . , 0) (see slide 13)
• We know from the proof of ES2 = σ2 that (n− p)S2/σ2 = (Y −Xβ)T (Y −Xβ)/σ2 = εT
σ(I − P )ε
σ= εT
σUT
ΛUεσ
• εσ is a vector of independent, standard normal random variables: Σε/σ =
I. After application of the orthogonal transform Z = Uεσ we have ΣZ =
UIUT = I, so Z is also a vector of independent, standard normal random
variables, Zi, so (n− p)S2/σ2 =∑n−p
i=1 Z2i ∼ χ2
n−p
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.50
Hypothesis testing - likelihood ratio
We want to test if H0 : Aβ = c with A ∈ Rq×p where q ≤ p and rank of A equal
to its number of rows q
Example c = 0 and the rows of A =[I 0
]. This leads to the hypothesis test
if the first q βj equal zero (other subsets of βj are possible)
As σ is unknown, it is a component in the parameter vector θ = (β, σ2) to be
estimated.
Let Ω = θ be the vector space of all values that the parameter vector can
take and Ω0 = θ ∈ Ω|Aβ = c the subspace of values that satisfy the null
hypothesis.
Likelihood ratio
Λ =maxθ0∈Ω0
L(θ0;Y )
maxθ∈ΩL(θ;Y )
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.51
Likelihood ratio
Theorem (from mathematical statistics)
If n → ∞, then, under H0, −2 log Λ d→ χ2ν−ν0
ν is the dimension of the space Ω, ν0 is the dimension of Ω0: in our case ν = p,
ν0 = p− q
Maximum likelihood estimator in multiple regression
L(β, σ2;Y ) =1
(2πσ2)n/2e
−12σ2
(Y −Xβ)T (Y −Xβ)
logL(β, σ2;Y ) = −n
2log(2πσ2
)− 1
2σ2(Y −Xβ)T (Y −Xβ)
Taking derivatives for σ2 yields
σ2 = 1n(Y −Xβ)T (Y −Xβ) = n−p
n S2 where β follows from taking derivatives
for βj (= least squares as discussed before; but we don’t need this for the
subsequent analysis)
Plugging in leads to L(β, σ2;Y ) = 1(2πσ2)n/2
e−n/2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.52
Likelihood ratio (2)
Maximum likelihood estimator within Ω0
Maximise L(β, σ2;Y ) under the restriction that Aβ = c
Technique: Lagrange multipliers
Minimise LL0(β, σ2,λ) = logL(β, σ2;Y ) + 2λT (Aβ − c)
Term with λT does not depend on parameter σ2 ⇒ same expression as before for σ20:
σ20 =
1n(Y −Xβ0)
T (Y −Xβ0)
where β0 is now found by ∇βLL0(β, σ2,λ) = 0
Use ∇x‖Bx− b‖22 = 2BT (Bx− b) and ∇x(dTx) = d (mind the transpose!)
2XT (Xβ − Y )− 2ATλ = 0 which is (XTX)β = XTY + AT
λ or β = β + (XTX)−1ATλ
Substitution into Aβ = c leads to elimination of λ:
β0 = β + (XTX)−1AT[A(XTX)−1AT
]−1(c− Aβ)
And for this value of β0, the likelihood still simplifies to L(β0, σ20;Y ) = 1
(2πσ20)
n/2e−n/2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.53
Likelihood ratio (3)
Expression for the likelihood ratio
Λ =(σ2
σ20
)n/2
Large values of −2 log Λ, i.e., small values of Λ lead to rejection of null hypoth-
esis.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.54
Variance estimator under reduced model (1)
Starting from the reduced model H0 : Aβ = c with A ∈ Rq×p where q ≤ p and rank of A
equal to number of rows q
If H0 holds, then β0 = β + (XTX)−1AT[A(XTX)−1AT
]−1
(c− Aβ)
Define µ = Xβ and µ0 = Xβ0,
then µ0 − µ = X(β0 − β) = X(XTX)−1AT[A(XTX)−1AT
]−1
(c− Aβ)
and (Y − µ)T (µ0 − µ)
=[(I −X(XTX)−1XT )Y
]T(µ0 − µ)
= Y T (I −X(XTX)−1XT )X(XTX)−1AT[A(XTX)−1AT
]−1
(c− Aβ)
= Y T[(I −X(XTX)−1XT )X(XTX)−1
]· AT
[A(XTX)−1AT
]−1
(c− Aβ)
= Y T[X(XTX)−1 −X(XTX)−1(XTX)(XTX)−1
]· AT
[A(XTX)−1AT
]−1
(c− Aβ)
= Y T · 0 · AT[A(XTX)−1AT
]−1
(c− Aβ) = 0
Conclusion for Y −µ0 = (Y −µ)+(µ0−µ): ‖Y − µ0‖2 = ‖Y − µ‖2 + ‖µ0 − µ‖2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.55
Variance estimator under reduced model (2)
Conclusion from previous slide
‖Y − µ0‖2 = ‖Y − µ‖2 + ‖µ0 − µ‖2
Interpretation (Pythagoras) µ0 and µ are projections of Y onto X, µ being
the orthogonal projection. Hence, ∆µ0µY is a rectangular triangle.
So
σ20 = 1
n‖Y − µ0‖2
= 1n‖Y − µ‖2 + 1
n‖µ0 − µ‖2
= σ2 + 1n‖X(XTX)−1AT
[A(XTX)−1AT
]−1
(c− Aβ)‖2
= σ2 + 1n(c− Aβ)T
[A(XTX)−1AT
]−T
A(XTX)−TXTX(XTX)−1AT[A(XTX)−1AT
]−1
(c− Aβ)
= σ2 + 1n(c− Aβ)T
[A(XTX)−1AT
]−1 [A(XTX)−1AT
] [A(XTX)−1AT
]−1
(c− Aβ)
Conclusion σ20 − σ2 = 1
n(c− Aβ)T[A(XTX)−1AT
]−1(c− Aβ)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.56
An equivalent F -test
The distribution of −2 log Λ is only asymptotically χ2. The expression of −2 log Λ
leads to logarithms of sample variances. Intuitively, a ratio of sample variances
suggests the use of an F -test, which is exact, and without the need of taking
a log.
∼ ANOVA
Is the “extra sum of squares” (the part of the variance that is explained by
giving up the restrictions imposed by the null hypothesis) significant?
Formally: F = (σ20−σ2)/q
σ2/(n−p)
We have: F = n−pq
σ20−σ2
σ2 = n−pq
(σ20
σ2 − 1)= n−p
q
(Λ−n/2 − 1
)
We now show that, under H0, F ∼ Fq,n−p
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.57
The proof for the F -test
We have
1. σ2 = n−pn S2 = (Y −Xβ)T (Y −Xβ)
n
2. σ20 − σ2 = (Aβ − c)T
[A(XTX)−1AT
]−1(Aβ − c)/n
We have to prove that
1. nσ2/σ2 = (n− p)S2/σ2 ∼ χ2n−p (done before)
2. under H0: n(σ20 − σ2)/σ2 = (Aβ − c)T
[A(XTX)−1AT
]−1(Aβ − c)/σ2 ∼ χ2
q
3. S2 and β are independent
Therefore, all functions of S2 are also independent from all functions of β.
More precisely σ2 and σ20 − σ2 are mutually independent.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.58
The proof for the F -test: distribution under H0
As β ∼ N(β, (XTX)−1σ2)
we have that Aβ−c is a vector with length q, which is normally distributed with
expected value Aβ − c = 0 (under H0)
and with covariance matrix A(XTX)−1ATσ2.
Then (property covariance matrices)
(Aβ − c)T[A(XTX)−1AT
]−1(Aβ − c)/σ2 ∼ χ2
q
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.59
The proof for the F -test: independence S2 and β
We show that (Y − Xβ) is independent from β. As both are normally dis-
tributed, it suffices to show that they are uncorrelated.
The covariance matrix Mind the positions of the transposes (outer product) :
cov(Y −Xβ, β) = E[(Y −Xβ)βT
]− E(Y −Xβ) · EβT
= E[(Y −X(XTX)−1XTY ) · ((XTX)−1XTY )T
]− 0 · βT
= (I −X(XTX)−1XT ) · E(Y Y T ) · ((XTX)−1XT )T
= (I −X(XTX)−1XT ) · (σ2I +XββTXT )X(XTX)−1
= (I −X(XTX)−1XT ) · (X(XTX)−1σ2 +XββT ) = 0
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.60
Simultaneous confidence
We know that β and S2 are independent, so it follows immediately that
(β − β)T (XTX)(β − β)/p
S2∼ Fp,n−p
(This is the F -statistic for the full model, so in the test H0: all βi zero, i.e.,
q = p)
So
β :
(β − β)T (XTX)(β − β)
S2≤ pFp,n−p,α
is an 1− α confidence region for β
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.61
Testing a single linear combination of β
aT β ∼ N(aTβ,aT (XTX)−1aσ2) SoaT β − aTβ
S√aT (XTX)−1a
∼ tn−p
So P
(|aT (β − β)|
S√aT (XTX)−1a
≥ tn−p,α/2
)= α
Alternative argument (for H0 : aTβ = 0)
Take A = aT on slide 51, and c = 0, then q = 1, and F defined on slide 57
equals F =(aT β)2
S2aT (XTX)−1a∼ F1,n−p (under H0)
It holds that for B ∈ −1, 1 independent from F :
F ∼ F1,n−p ⇔ B√F ∼ tn−p
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.62
Testing multiple linear combinations of β (1)
We verify now that the simulateneous confidence interval follows indeed from
the F test for the full model.
We search for Mα so that P
(maxa
|aT (β − β)|S√aT (XTX)−1a
≥ Mα
)= α
We square the expression and get[aT (β − β)
]2=[aT (β − β)
]T [aT (β − β)
]=
(β − β)T (aaT )(β − β)
The value of the expression in a is the same as in ra with r ∈ R0, so we can
maximise under the normalisation that aT (XTX)−1a = 1
XTX is a p × p symmetric, full rank matrix and so it can be decomposed
as XTX = RRT with R a p × p invertible matrix. So, we can write that
aT (XTX)−1a = aT (R−TR−1)a = uTu with u = R−1a, so a = Ru.
We maximise over u:
maxa
(β − β)TaaT (β − β)
aT (XTX)−1a= max
u
(β − β)TRuuTRT (β − β)
uTu
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.63
Testing multiple linear combinations of β (2)
We maximise over u:
maxa
(β − β)TaaT (β − β)
aT (XTX)−1a= max
u
(β − β)TRuuTRT (β − β)
uTu
The numerator contains a squared inner product of u with RT (β−β). Accord-
ing to Cauchy-Schwarz’s inequality:
(β − β)TRuuTRT (β − β) ≤ ‖u‖2 · ‖RT (β − β)‖2,
So maxu
(β − β)TRuuTRT (β − β)
uTu= ‖RT (β − β)‖2 = (β − β)TRRT (β − β)
Since(β−β)TRRT (β−β)
S2 ∼ pFp,n−p We can take Mα =√pFp,n−p,α
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.64
Residuals and outliers
Residuals (See also slides 7 and 13)
e = Y − µ = (I − P )Y
Σe = σ2(I − P )
This allows to standardise/studentise the residuals and then test for normality.
Cook-distance; influence measure; influence points
measures the influence of observation i on the eventual estimated vector:
Di =(β(i) − β)T (XTX)(β(i) − β)/p
S2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.65
Transformations of the response
Model
Y (λ) = Xβ + ε
with Y(λ)i = Y λ
i −1λ
General: transformation of a random variable
If Y = g(X) then
fY (y) = fX(x(y))|J(y)| = fX(g−1(y))|J(y)|
with J = det
. . . . . . . . .
. . . ∂xi
∂yj. . .
. . . . . . . . .
and
∂xi∂yj
=∂g−1
i (y)
∂yj
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.66
Transformations of the response: our case
In our case Y (λ) ∼ N(Xβ, σ2I) takes the role of X, so∂xi
∂yi= 0 unless j = i, in that case ∂xi
∂yi= yλ−1
i
So J =∏n
i=1 yλ−1i
The likelihood becomes L(β, σ2, λ;Y ) = 1(2πσ2)n/2
e−12σ2
(Y (λ)−Xβ)T (Y (λ)−Xβ)∏ni=1 Y
λ−1i
Differentation for σ2 yields
σ2 = 1n(Y
(λ) −Xβ)T (Y (λ) −Xβ)
Plugging in
L(β, σ2, λ;Y ) = 1(2πσ2)n/2
e−n/2∏n
i=1 Yλ−1i
This function can be evaluated for different values of λ (note that σ2 in its turn
is also a function of λ)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.67
PART 2: VARIABLE AND MODEL SELECTION
2.1 Motivation and setup slide 69
2.2 Selection criteria slide 81
2.3 Minimisation of the Information criterion slide ??
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.68
2.1 Model selection: Motivation and setup
• Motivation slide 70
• Model selection and variable selection
• The use of likelihood in model selection
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.69
Motivation (1): full or large models are not the best
The analysis until now assumed that the full (global) model is correct. For
instance, the expressions Eβ = β and Σβ
= σ2(XTX)−1 are true only if
indeed Y can be written as Xβ and no explanatory variable is missing.
In order to check if a specific covariate xi should be considered as a candidate
in the global model, one could “play safe” put it in the full model and let the
hypothesis test H0 : βi = 0 decide whether or not it is significant.
But, if we have to play safe for a very wide range of possible explanatory
variables, a lot of parameters need to be estimated. This has two effects:
1. The estimators have larger standard errors
2. Few degrees of freedom left for estimation of variance σ2
So: larger errors AND more difficult to assess the uncertainty: hypothesis tests
will have weak power, and really significant parameters may be left undiscov-
ered.
Occam’s razor: Keep the model as simple as possible, but no simpler.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.70
Motivation (1bis): uncontrolled variance inflation
The variance in the full model is unnecessarily large (previous slide)
Sometimes, it is also arbitrarily large, if Σβ= σ2(XTX)−1 has large entries,
i.e., if XTX is close to being singular
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.71
Motivation (2): the true model is not always the best model
Previous slide: the full/largest model is not the best to work in
This slide: the true/physical model (=DGP = data generating process) may
not be the best statistical model
Example: assume that two quantities, x and µ, are related by the physical law
µ = β0 + β1x + β2x2
We observe Yi = µi + εi for µi = β0 + β1xi + β2x2i
• Estimate y0 = β0 + β1x0 + β2x20 for a given x0
If β2 is (very) small, the model µi = β0 + β1xi is incorrect, introduces bias
but an estimator β2 may add more variance than it reduces bias
• Estimate β2 or β2/β1, then we need an estimator β2
Conclusions
1. “Best” statistical model may not be the true, physical model
2. “Best” statistical model depends on “what you want to estimate”. This is
called the focus
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.72
Motivation (3): the larger the model, the more false
positives
Large models require more multiple testing issues
(Even if in large models the estimations had the same variance as in small
models, there would be more false discoveries)
Motivation (4): variable selection reduces collinearity
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.73
“All models are wrong, but some are useful”
George Box, Journal of the American Statistical Association, 1976:
Since all models are wrong the scientist cannot obtain a ”correct” one by ex-
cessive elaboration. On the contrary following William of Occam he should
seek an economical description of natural phenomena. Just as the ability to
devise simple but evocative models is the signature of the great scientist so
overelaboration and overparameterization is often the mark of mediocrity.
Conclusion: Full/large models are not the best
• Large standard errors in estimators
• Few degrees of freedom to estimate noise variance
• As a result of these two: hypothesis tests with low power
• More false positives
• Even if true model is large...
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.74
Objective and situation of model selection
• A procedure to limit the size of the full model
• before estimation + testing
• Compare nested models but also totally different models
• ↔ testing: no null hypothesis at input, and no statistical evidence at output
• Testing: asymmetry in H0 ↔ H1; model selection is symmetric between the
candidate models
• Compare model selection ∼ forensic profiling: profiling is not a proof of
guilt.
After profiling, suspects are innocent until proven guilty (=H0, presumption
of innocence → hypothesis testing)
• Information criterion expresses how well the data can be described in a
given model
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.75
Model selection and variable selection
Variable selection
• Selects covariates
• Covariates require parameter βi to be estimated
Model selection
• selects statistical model, given the covariates
– polynomial, exponential, periodic, . . .
– How many terms (degrees, frequencies)
(variables-within-model selection)
• Model for distribution of observations
– Example: exponential distribution versus Weibull
– Complicated models introduce parameters not linked to covariates
• Model for interactions between covariates
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.76
The use of likelihood in model selection
We know likelihood from statistical inference where it is used for
• parameter estimation within a given model by MLE, maximum likelihood
• Hypothesis testing, e.g.: χ2 test for goodness of fit, likelihood ratio tests
• Validation of the model with significant parameters after testing (e.g.:R2
statistics)
We want to use likelihood in model selection, i.e., before statistical inference
to compare different models
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.77
The use of likelihood in model selection (see also slide 103)
• Working principle:
Choose the model in which the maximum likelihood estimator maxi-
mizes the expected (log-)likelihood with respect to the true, underly-
ing data generating process (DGP) = true/physical model (see also
side 72)
• Problem: DGP is unknown, expected likelihood cannot be computed
• Observed likelihood measures closeness of fit within the model under
consideration
• Problem: large models lead to better fit, but more noisy parameter estima-
tors (this is exactly one of the arguments for model selection)
• Solution: likelihood has to be adjusted or penalized:
closeness-complexity trade-off (where likelihood = closeness)
• Closeness-complexity trade-off is equivalent (in expected value) to bias-
variance trade-off: both trade-offs measure
expected distance to DGP/true model = expected likelihood
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.78
Example of expected log-likelihood: normal model with
known variances
Let Y ∼ N.I.D.(µ, σ2I), with σ2 known, then use expression likelihood
on slide 34:
E[− logL(β;Y )
]=
1
2σ2E[(Y − µ)T (Y − µ)
]+
n
2log(2πσ2)
=1
2σ2E‖Y − µ‖22 +
n
2log(2πσ2)
=1
2σ2‖µ− µ‖22 +
1
2σ2E‖Y − µ‖22 +
n
2log(2πσ2)
=1
2σ2‖µ− µ‖22 +
1
2σ2nσ2 +
n
2log(2πσ2)
Definition: Prediction Error: replace µ by µ:
PE(β) = 1nE(‖µ− µ‖2
)= 1
nE(‖Xβ −Xβ‖2
)See also slide 86
This is a plot of PE as a function of the model
size in nested model selection
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.79
Prediction error and Risk in linear models
Definition (1) (slide 38): Risk or expected squared loss R(β) = 1nE(‖β − β‖2
)
This definition satisfies R(β) =1
n
m∑
j=1
var(βj) + E[βj − βj
]2
Result slide 38: Among all unbiased estimators, BLUE has lowest risk
Definition (2): Prediction error (PE) PE(β) = 1nE(‖µ− µ‖2
)= 1
nE(‖Xβ −Xβ‖2
)
This definition satisfies PE(β) =1
n
(n∑
i=1
var(µi) + E [µi − µi]2
)
Σµ − Σµ = X(Σβ− Σ
β)XT is pos. semi-def. so
Among all unbiased estimators, BLUE has lowest PE
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.80
2.2 Model selection methods and criteria
Expected likelihood (slide 78) is just one possible measure of quality, but other
(expected) distances with respect to the DGP are similar. We develop several
variants
• Adjusted R2: slides 82 and further on
• Prediction error and Mallows’s Cp: slides 86 and further on
• Kullback-Leibler distance and Akaike’s Information Criterion: slides
101 and further on
• The Bayesian Information Criterion or Schwarz Criterion: slides 114
and further on
• Generalised Cross Validation: a criterion mainly used in sparse variable
selection; see slide 140
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.81
Example of a validation of a given model after inference: R2
Least squares is an orthogonal projection: with µ = Xβ we have (Y −µ)TY = 0 and also (Y − µ)T1 = 0 since 1 is the first column of X and
the residual is orthogonal to all columns of X So
n∑
i=1
(Yi − µi)(µi − Y ) = 0
n∑
i=1
(Yi − Y )2 =n∑
i=1
(Yi − µi)2 +
n∑
i=1
(µi − Y )2
SST = SSR + SSE with
Sum of squares Total SST =∑n
i=1(Yi − Y )2
Sum of squares Regression SSR =∑n
i=1(µi − Y )2
Sum of squares Errors of Prediction or Sum of squared residuals
SSE =∑n
i=1(Yi − µi)2 =
∑ni=1 e
2i where (see also p.7,13) ei = Yi − µi
We define Coefficient of determination R2 =SSRSST
R2 = 1− SSESST
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.82
Using R2 before inference
R2 is a measure of how well the model explains the variance of the response
variables. However, a submodel of a full model has always a smaller R2,
even if the full model only adds unnecessary, insignificant parameters to the
submodel.
Therefore R2 cannot be used before inference: it would just promote large
models.
Note The denominator in R2 is the same for all models. The numerator is (up
to a constant) the log-likelihood of a normal model with known variance. So
R2 measures the likelihood of a model. The more parameters the model has,
the more likely the observations become, because there are more degrees of
freedom to make the obseravtions fit into the model.
Adjusted coefficient of determination
R2= 1− SSE/(n− p)
SST/(n− 1)Or 1−R
2= n−1
n−p(1−R2)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.83
Adjusted R2: interpretation
We know: S2 = (Y −µ)T (Y −µ)n−p = 1
n−p
∑ni=1(Yi − µi)
2 = 1n−pSSE
and σ2 = 1n
∑ni=1(Yi − µi)
2 = n−pn S2 = 1
nSSE = 1n(SST − SSR) =
1nSST(1−R2)
Suppose we have two nested models, i.e., one model containing all covariates
of the other (this is: a full model with p covariates and a reduced model with q
restrictions, so with p− q out of p free covariates), then the F -value in testing
the significance of the full model is then:
F =(σ2
p−q−σ2p)/q
σ2p/(n−p)
Dividing numerator and denominator by SST leads to:
F =(R2
p−R2p−q)/q
(1−R2p)/(n−p) =
(1−R2p−q)−(1−R2
p)
q(1−R2p)/(n−p)
We rewrite this in terms of “adjusted R2”
F = n−pq
[1−R
2p−q
1−R2p
− 1
]+
1−R2p−q
1−R2p
from which it follows that R2p−q ≤ R
2p ⇔ F ≥ 1
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.84
Adjusted R2: interpretation (2)
The criterion that decides which model has higher quality is thus equivalent
to a “hypothesis test” with critical value F = 1, instead of Fq,n−p,α. This illus-
trates the symmetry of the model selection criterion, compared to the testing
procedure, which uses null and alternative hypotheses, with clearly unequal
roles.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.85
Mallows’s Cp statistic: prediction error
Criterion Choose the model with smallest mean prediction error
The prediction for covariate values xp: µp = xTp βp = xT
p (XTp Xp)
−1XTp Y
where Xp is a model with p covariates (p columns, n rows of observations)
xTp contains the covariate values in one point of interest. xT
p may or may not
coincide with one of the rows in Xp
The prediction error is the expected difference E(µp − µ)2 between the pre-
diction and the true, error-free value of the response for the given covariate
values.
We want to assess the quality of the model Xp. The precise covariate values
in which we measure the quality are free in theory. However, since we observe
the model in a limited number of covariate values, we restrict to those values.
Indeed, it is hard in practice to say something about the quality of a prediction
of a response value, if we even don’t observe that response value. In the
points of observations, we still don’t know the exact (noise-free) response.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.86
The bias-variance trade-off
Prediction error = variance + bias2 E(µp − µ)2 = var(µp) + (Eµp − µ)2
with µ = xTβ in the correct model
Intuitively: the smaller the model, the smaller the prediction variance, but the
larger the bias.
Because of linearity of expectations
Eµp = E(xTp (X
Tp Xp)
−1XTp Y ) = xT
p (XTp Xp)
−1XTp µ
And var(µp) = xTpΣβp
xp = σ2xTp (X
Tp Xp)
−1xp
We search for the minimum average prediction error, where the average is
taken over all points of observation, so we want to minimize
PE(βp) =1
n
n∑
i=1
E(µpi − µi)2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.87
The prediction variance of a model
The average variance can we written as
1
σ2
n∑
i=1
var(µpi) =n∑
i=1
xTpi(X
Tp Xp)
−1xpi
=n∑
i=1
Tr(xTpi(X
Tp Xp)
−1xpi
)=
n∑
i=1
Tr((XT
p Xp)−1(xpix
Tpi))
= Tr
(n∑
i=1
(XTp Xp)
−1(xpixTpi)
)= Tr
((XT
p Xp)−1
n∑
i=1
(xpixTpi)
)
= Tr(Ip) = p
thereby using Xp =
...
xTpi...
so XT
p Xp =n∑
i=1
(xpixTpi)
The average variance is thus only dependent on the size of the model. It does
not depent on the exact locations of the points in which the prediction quality
is evaluated.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.88
The prediction bias of a model
Remark if we assume that the working model is correct (i.e., contains the
DGP=true model), then result on slide 35 applies: bias is zero
The average mean or expected error (bias)
1
σ2
n∑
i=1
(Eµpi − µi)2 =
1
σ2(Eµp − µ)T (Eµp − µ)
Now Eµp =
...
xTp (X
Tp Xp)
−1XTp µ
...
= Xp(X
Tp Xp)
−1XTp µ = Ppµ
so the average bias is 1σ2µ
T (I − P )T (I − P )µ = 1σ2µ
T (I − P )µ
This contains the unknown, error-free response variable. We try to estimate
the bias by plugging in the observed values instead. We find the expression
Y T (I − P )Y = SSEp.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.89
The expression of Mallows’ Cp
The estimator of the bias is biased again, as follows from E(Y T (I − P )Y ) =
E[(µ + ε)T (I − P )(µ + ε)
]
= µT (I − P )µ + E[εT (I − P )µ
]+ E
[µT (I − P )ε
]+ E
[εT (I − P )ε
]
= µT (I − P )µ + 0 + 0 + Tr(I − P )σ2 = µT (I − P )µ + (n− p)σ2
The bias of the estimator of the bias, (n− p)σ2, however only depends on the
model size, no longer on the unknown response.
So PE(βp) =1nE[SSEp + (2p− n)σ2
]
The random variable on the right hand side (between the expectation brackets)
is observable and it is an unbiased estimator of the prediction quality measure
in the left hand side of the expression.
In most practical situations σ2 has to be estimated by S2 and so the studentized
value
Cp =SSEpS2 + 2p− n is a consistent estimator for n
σ2PE(βp)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.90
Remark on the estimation of σ2
The variance of the errors did not appear in the definition of PE(βp), it comes
from the development of the expression.
We impose that the variance estimator must not depend on the selected model
(unlike βp)
Therefore, σ2 must be kept outside the optimisation of the likelihood within a
given model, S2 must not be model-dependent
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.91
The closeness-complexity trade-off
The Cp criterion reformulates the bias-variance trade-off on slide 87 as a
trade-off between closeness (SSEp) and complexity (2p)
From PE(βp) =1
nE‖µp − µ‖22 =
1
n
n∑
i=1
var(µpi) +1
n‖Eµp − µ‖22
to PE(βp) =1
nE[SSEp + (2p− n)σ2
]
The equivalence holds in expected value
1
nE‖µp − µ‖22 =
1
nE[SSEp + (2p− n)σ2
]
but1
n‖µp − µ‖22 6=
1
n
[SSEp + (2p− n)σ2
]
The closeness-complexity trade-off can also be found directly from the expres-
sion of the prediction error, see next slide
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.92
Degrees of freedom (DoF)
Mallows’s Cp without going by variance–bias trade-off (straight to closeness–
complexity)
PE(βp) = 1nE‖µp − µ‖22 = 1
nE‖µp − Y + Y − µ‖22 = 1nE‖ − ep + ε‖22
= 1n
[E‖ep‖22 + 2E(−eTp ε) + E‖ε‖22
]
= 1nE[SSE(βp)
]+ 2
nE[(ε− ep)
Tε]− 1
nE‖ε‖22
= 1nE[SSE(βp)
]+
2νpnσ2 − σ2
where νp are the degrees of freedom, defined as νp =1
σ2E[εT (ε− ep)
]
And define a variant of Mallows’s Cp based on degrees of freedom and without
studentisation:
∆(βp) =1
nSSE(βp) +
2νpn
σ2 − σ2 Then E[∆(βp)
]= PE(βp)
Note that DoF depends on the selected model. It is not a fixed number (as in
statistical inference)
Alternative expression for DoF: νp = E(εT µp)/σ2
Proof: νp =1σ2E[εT (ε− ep)
]= 1
σ2E[εT (Y −µ−Y + µp)
]= 1
σ2E[εT (µp−µ)
]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.93
DoF in linear smoothing
Linear smoothing in a given/fixed model: βp = PpY
The definition of DoF (slide 93) does not rely on orthogonality, so Pp may be
any projection or even any smoothing procedure.
Depending on the context, the matrix Pp is termed projection matrix or influ-
ence matrix
Then νp = Tr(Pp)
Proof:
νp = E(εT µp)/σ2 = E(εTPpY )/σ2 = E
[εTPp(µ + ε)
]/σ2
= 0 + E[Tr(Ppεε
T )]/σ2 = Tr(Ppσ
2I)/σ2
Special case: orthogonal projection (least squares) within fixed/given
model
νp = p
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.94
Degrees of freedom after optimisation
• For a least squares in a fixed model, we find νp = p (see p.94)
• If Xp (and Pp) result from optimisation of Mallows’s Cp, the selection de-
pends on the errors
• ∆(βp) (see definition p.93) is still unbiased, but νp 6= p, because the (se-
lected) projection Pp depends on the errors ε. Errors can cause false pos-
itives
• False positives are more likely when many true parameters are zero. There-
fore, the effect of the optimisation is important in sparse models. Other-
wise, it can be ignored. See further slide 139
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.95
The use of Cp
• In the notation Cp, the p refers to the size of the model under consideration.
It suggests that Cp is a function of p only. But Cp can be used to compare
two non-overlapping models with the same size.
• Cp measures the quality of the model in terms of prediction error. It does
not measure the correctness of the model.
prediction ↔ estimation
• Nested models, such as polynomial regression where we include all pow-
ers up to p: a larger model thus always includes a smaller one.
Then Cp can be seen as a function of p The curve of Cp has a typical
form with a minimum that is the optimal balance between bias (descending
as function of p) and variance (proportional to p). If DGP = truemodel ∈models, then bias → 0
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.96
Illustration of a nested model
Polynomial regression model Yi =
p−1∑
k=0
βkxki + εi
with n = 30 observations in [0, 1], and p = 8.
(Degree 7, specified in the simulation by 7 zeros ⇒ 6 extremes; two zeros are
outsize [0, 1])
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.97
Prediction error
The Cp plot
Model size (p)
0 10 20 30
PEC
p
The minimum prediction error estimator
0 0.2 0.4 0.6 0.8 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.98
Conclusions about the example
The true model (DGP) is a polynomial of degree 7. It has two zeros outside
the interval of observations. The presence of these zeros has little impact on
the observations, it can hardly be detected. As a consequence, the minimum
prediction error is reached for a simpler model: a polynomial of degree 5 (p =
6).
Question
1. Why is PE(βp) almost a linear function for p large?
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.99
Kullback-Leibler divergence
• Data generating density – data generating process – DGP gY (y; σ):
(joint density of observations Y )
• Models for selection fY (y; βp; σ2p): where βp has p parameters
• then Kullback–Leibler (KL) divergence
KL(gY , fY (·; βp, σ
2p))=
1
n
n∑
i=1
[E log gYi
(Yi, σ2)− E log fYi
(Yi; βp, σ2p)],
measures distance true model (DGP) – statistical model
• Again: bias–variance balance
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.100
Kullback-Leibler divergence is always positive
In general, if X ∼ gX (DGP – Data generating process), then
KL(gX, f) = E [log(gX(X))− log(f(X))]
For all models f(x), we have
KL(gX, f) ≥ 0
Indeed, because of Jensen’s inequality, we have E [log(X)] ≤ log(E(X)) ,
and so,
KL(gX, f) = E[log(gX(X)f(X)
)]= −E
[log(
f(X)gX(X)
)]
≥ − log(E[
f(X)gX(X)
])
= − log
(∫
R
f(x)
gX(x)gX(x)dx
)= − log
(∫
R
f(x)dx
)= − log(1) = 0
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.101
Kullback-Leibler divergence and expected log-likelihood
KL(gY , fY (·; βp, σ
2p))=
1
n
n∑
i=1
[E log gYi
(Yi, σ2)− E log fYi
(Yi; βp, σ2p)],
E log gYi(Yi, σ
2) does not depend on chosen model, so
KLgY , fY (·, βp, σ2p) = Constant − ELL(βp, σ
2p;Y )
with LL(βp, σ2p;Y ) =
1
n
n∑
i=1
log fYi(Yi; βp, σ
2p)
KL-distance = constant - expected log-likelihood
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.102
Expected log-likelihood in model selection
• See also slide 78
• Definition
ℓ(βp, σ2p) = ELL(βp, σ
2p;Y ) =
1
n
n∑
i=1
E log fYi(Yi; βp, σ
2p)
ℓ(βp, σ2p) =
1
n
n∑
i=1
∫ ∞
−∞log fYi
(u; βp, σ2p)gYi
(u; σ2)du
• Estimation
– Substitute βp and σ2p by sample based estimators:
ℓ(βp, σ2p) is random
– Estimate expected value (integral) from sample Y :
ℓ(βp, σ2p) =
1
n
n∑
i=1
log fYi(Yi; βp, σ
2p)
– E[ℓ(βp, σ
2p)]= ℓ(βp, σ
2p) but E
[ℓ(βp, σ
2p)]6= E
[ℓ(βp, σ
2p)]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.103
Estimating KL from a sample
Problem: using the same sample for
• parameter estimation within a model and
• distance of model to data generating density
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.104
Development of AIC for linear model (1)
Suppose linear model – normal errors; Let µ = E(Y ) and µp = Xpβp
ℓ(βp, σ2p) = −1
n
n∑
i=1
E
[(Yi − µpi)
2
2σ2p
+1
2log(2πσ2
p)
]
= −1
n
n∑
i=1
[(µi − µpi)
2
2σ2p
+E(Yi − µi)
2
2σ2p
+1
2log(2πσ2
p)
]
= −1
n
n∑
i=1
[(µi − µpi)
2
2σ2p
+σ2
2σ2p
+1
2log(2πσ2
p)
]
Note σ2p 6= σ2: (model ↔ data generating process)
ℓ(βp, σ2p) = −1
n
n∑
i=1
[(µi − µpi)
2
2σ2p
+σ2
2σ2p
+1
2log(2πσ2
p)
]
Taking σ2p =
1
n− νp
n∑
i=1
(Yi − µpi)2 (where νp to be chosen: unbiased, MLE or
other) we have
ℓ(βp, σ2p) = −1
n
n∑
i=1
[(Yi − µpi)
2
2σ2p
+1
2log(2πσ2
p)
]= −1
2
n− νpn
− 1
2log(2πσ2
p)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.105
Special case: σ2 known or easy to estimate
See also slide 79
Observed likelihood ∼ SSE
ℓ(βp) = −1
n
n∑
i=1
[(Yi − µpi)
2
2σ2+1
2log(2πσ2)
]= −c1
1
nSSE(βp) + c2
Expected likelihood ∼ Prediction Error
ℓ(βp) = −1
n
n∑
i=1
[(µi − µpi)
2
2σ2+1
2+1
2log(2πσ2)
]
Then Eℓ(βp) = −a1PE(βp) + a2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.106
Development of AIC for linear model (2)
For the computation of E[ℓ(βp, σ
2p)]
we use that µpi =(Xpβp
)iis independent
from σ2p = ‖Y −Xpβp‖22/(n− νp)
Proof is extension of slide 60 (not relying on the fact that in the true model/DGP,
the estimator is unbiased)
cov(Y −Xpβp, βp) = E[(Y −Xpβp)β
Tp
]− E(Y −Xpβp) · EβT
p
= E[(Y −Xp(X
Tp Xp)
−1XTp Y ) · ((XT
p Xp)−1XT
p Y )T]
−E(Y −Xp(XTp Xp)
−1XTp Y ) · E((XT
p Xp)−1XT
p Y )T
= (I −Xp(XTp Xp)
−1XTp ) · E(Y Y T ) · ((XT
p Xp)−1XT
p )T
−(I −Xp(XTp Xp)
−1XTp ) · E(Y )E(Y T ) · ((XT
p Xp)−1XT
p )T
= (I −Xp(XTp Xp)
−1XTp ) ·
[(σ2I +XββTXT )−XββTXT
]Xp(X
Tp Xp)
−1
= 0
Hence, any function of Y −Xpβp is independent from any function of βp
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.107
Development of AIC for linear model (3)
E[ℓ(βp, σ
2p)]= −1
n
n∑
i=1
E
[(µi − µpi)
2
2σ2p
+σ2
2σ2p
+1
2log(2πσ2
p)
]
= −1
2E
[σ2
σ2p
(1
nσ2‖µ− µp‖22 + 1
)+ log(2πσ2
p)
]
= −1
2E
(σ2
σ2p
)E
(1
nσ2‖µ− µp‖22 + 1
)− E
[log(2πσ2
p)]
We now use
1. Y ∼ χ2(ν) → E(1/Y ) = 1/(ν − 2)
2. (cfr.slides 88 and 89) E(‖µ− µp‖22
)= p + µT (I − P )µ
Hence, if βp contains the true model (DGP), then
E[ℓ(βp, σ
2p)]= −1
2
(n− νp
n− p− 2
)(pn+ 1)− 1
2E[log(2πσ2
p)]
while (p.105) E[ℓ(βp, σ
2p)]= −1
2
n− νpn
− 1
2E[log(2πσ2
p)]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.108
Akaike’s Information Criterion — AIC – definition
E[ℓ(βp, σ
2p)]= E
[ℓ(βp, σ
2p)]+
1
2
n− νpn
− 1
2
(n− νp
n− p− 2
)(pn+ 1)
= E[ℓ(βp, σ
2p)]−(p + 1
n
)(n− νp
n− p− 2
)
As (n− νp)/(n− p− 2) = O(1), we can define AIC as
AIC(βp) =2
n
n∑
i=1
log fYi(Yi; βp, σ
2p)− 2
(p + 1
n
)
AIC = 2 · log-likelihood - 2 · model size
Intuition the penalty comes from the fact that max.likelihood estimator sub-
stituted into sample likelihood expression fails to detect the noise variance
(second term on second line on p.105 disappears)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.109
AIC – sketch of general proof
• Let θp minimize ℓ(θp) =1
nlog fY (y|θp) (sample log-likelihood)
• Let θ∗p minimize ℓ(θp) =
1
nE (log fY (y;θp)) (expected log-likelihood)
• θ∗p is the least false parameter value
If the DGP is included in model p, then θ∗p is the true parameter value If gY (y; σ
2)
This is because the expected score is zero in the true parameter value, i.e.,
E
(∂
∂θjlog fY (Y ;θ∗)
)= 0
• E(ℓ(θ∗
p))= ℓ(θ∗
p) but E(ℓ(θp)
)6= E
(ℓ(θp)
)
• We use Taylor series approximation for both ℓ(θp) and ℓ(θp) around θ∗p in
which both functions have the same expected value
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.110
Taylor series for use in the proof of AIC
• Denote Hℓ(θ) is the Hessian matrix, i.e.,[Hℓ(θ)
]ij=
∂2ℓ(θ)
∂θi∂θj(similar for Hℓ(θ))
• ℓ(θp) = ℓ(θ∗p) +
1
2(θp − θ∗
p)THℓ(θ∗
p)(θp − θ∗p)
where we use that ∇θℓ(θ∗p) = 0 (least false value)
• ℓ(θp) = ℓ(θ∗p) +∇θℓ(θ
∗p)
T (θp − θ∗p) +
1
2(θp − θ∗
p)THℓ(θ∗
p)(θp − θ∗p)
• Together:
ℓ(θp)− ℓ(θp) = ℓ(θ∗p)− ℓ(θ∗
p) +∇θℓ(θ∗p)
T (θp − θ∗p)
+1
2(θp − θ∗
p)T(Hℓ(θ∗
p)−Hℓ(θ∗p))(θp − θ∗
p)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.111
Model misspecification
For the proof of AIC, we need some results under model misspecification
Taylor series for ∇θℓ(θ) around θ∗p: ∇θℓ(θp) ≈ ∇θℓ(θ
∗p) +Hℓ(θ∗
p)(θp − θ∗p)
By defintion, ∇θℓ(θp) = 0, so the score ∇θℓ(θ∗p) satisfies
∇θℓ(θ∗p) ≈ −Hℓ(θ∗
p)(θp − θ∗p) ≈ −Hℓ(θ∗
p)(θp − θ∗p)
Argument for the latter approximation:
• The score can be written as ∇θℓ(θ∗p) =
1
n
n∑
i=1
∇θ log fYi(Yi; θ∗p)
This has the form of a sample mean, fluctuating around E(∇θℓ(θ∗p)) = ∇θ(Eℓ(θ∗
p)) =
∇θℓ(θ∗p) = 0
hence, it can be expected that a standardised value converges in distribution to a
multivariate standard normal. (CLT)
• The Hessian: Hℓ(θ∗p) =
1
n
n∑
i=1
H log fYi(Yi; θ∗p)
P→ Hℓ(θ∗p) does not fluctuate around
zero, but converges in probability to a constant
• Convergence rates: score Op(1/√n); Hessian Op(1/n)
• Conclusion: approximate Hessian by Hℓ(θ∗p), keep score as a random vector
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.112
AIC - general proof (ctd.)
From slide 112, we have θp − θ∗p ≈ −Hℓ(θ∗
p)−1∇θℓ(θ
∗p)
Substitute into the expression for ℓ(θp)− ℓ(θp) on slide 111:
ℓ(θp)− ℓ(θp) ≈ ℓ(θ∗p)− ℓ(θ∗
p)−∇θℓ(θ∗p)
THℓ(θ∗p)
−1∇θℓ(θ∗p) + op(1/n)
Expected values
Use: scalar = Tr(scalar) — Tr(AB) = Tr(BA) — ∇θℓ(θ∗p) has zero mean
E(ℓ(θp)
)− E
(ℓ(θp)
)≈ 0− Tr
[Hℓ(θ∗
p)−1E
(∇θℓ(θ
∗p)∇θℓ(θ
∗p)
T)]
= −Tr[Hℓ(θ∗
p)−1Σ∇θℓ(θ∗
p)
]
If θ∗p is true model (DGP), then −Hℓ(θ∗
p) = nΣ∇θℓ(θ∗p)
(Fisher information ma-
trix), so E(ℓ(θp)
)− E
(ℓ(θp)
)≈ Tr(Ip/n) = p/n
So, take AIC(θp) = 2ℓ(θp)− 2p
n
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.113
Bayesian Information Criterion (BIC) or Schwarz Criterion
The Bayesian framework
• No true model, all models have prior probability of being the DGP
• Let Sm be the event that the data were generated in model m
• Let Θm be the set of possible values of θ within the mth model
• Let ϕ(θ|Sm) a prior density for the parameters in model m
• Then (Bayes) P (Sm|y) =P (Sm)fY (y|Sm)
fY (y)=
P (Sm)fY (y|Sm)K∑
k=1
P (Sk)fY (y|Sk)
with
fY (y|Sm) =
∫∫∫
Θm
fY |θm(y; θm|Sm)ϕ(θ|Sm)dθm =
∫∫∫
Θm
fY |θm(y; θm)ϕ(θ|Sm)dθm
• If P (Sm) is constant, then maximisation of P (Sm|y) amounts to maximisa-
tion of fY |Sm(y)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.114
Differences in setup AIC — BIC
AIC has the notion of a data generating process/true model
BIC has starts from the probability that the data were generated by model m,
for all models under consideration
AIC: Every model m has a true or least false parameter θ∗m
BIC: Within every model, the parameter values have random/prior distribution
AIC tries to find the model with best expected (log-)likelihood.
BIC tries to maximise the posterior probability, given the observations, in case
of non-informative prior, this is maximum sample likelihood (see slide 116
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.115
BIC - formula
Development of expression p.114, with θm, the max. likelihood estimator of
θm:
fY |θm(y;θm) = exp
[nℓ(θm)
]≈ exp
[nℓ(θm)−
1
2(θm − θm)
TnHℓ(θm)(θm − θm)
]
Then we can apply Laplace’s approximation for an integral (see slide 119)
fY (y|Sm) =
∫∫∫
Θm
fY |θm(y;θm)ϕ(θ|Sm)dθm
≈ exp[nℓ(θm)
]ϕ(θm|Sm)
(2π
n
)p/2 ∣∣∣det(Hℓ(θm)
)∣∣∣−1/2
= exp[(n/2)
(2ℓ(θm)− log(n) · p
n
)]·
exp
[(n/2)
(log(2π) · p
n+
2
nlog(ϕ(θm|Sm)
)− 1
nlog∣∣∣det
(Hℓ(θm)
)∣∣∣)]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.116
BIC - formula simplified
All terms on slide 116 have the factor exp(n/2) in common. Between brackets,
we see, for n → ∞,
• in the first term ℓ(θm) = −1n
∑ni=1 log fYi
(Yi; θm) = Op(1) (see p.103)
• in the second term O(log(n)/n)
• in all other terms O(1/n) or Op(1/n)
(In particular, the Hessian converges to constant (independent from n) in
probability, see p.112)
Defining BIC(θm) = 2ℓ(θm)− log(n) · pn=
2
n
n∑
i=1
log fYi(Yi; θm)− log(n) · p
n
we have P (Sm|y) ≈P (Sm) exp
[BIC(θm)/2
]
K∑
k=1
P (Sk) exp[BIC(θk)/2
]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.117
BIC as a Bayes factor
When all prior probabilities are equal (or unknown, and thus supposed to be
equal, for convenience), the ratio of conditional distributions controls the se-
lection.
This ratio is known as the Bayes factor
Bayes factor =fY (y|Sm)
fY (y|Sk)=
exp[BIC(θm)/2
]
exp[BIC(θk)/2
]
Bayes factors are the Bayesian analogues for likelihood ratios
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.118
Laplace’s approximation for integrals
Let I =
∫∫∫
R
n
f(x)eg(x)dx, and let x0 be the global maximum of g(x), then
g(x) ≈ g(x0)+1
2(x−x0)
THg(x0)(x−x0) = g(x0)−1
2(x−x0)
T |Hg(x0)|(x−x0)
Moreover f(x) ≈ f(x0) +∇Tf(x0)(x− x0) +1
2(x− x0)
THf(x0)(x− x0)
So
I ≈ eg(x0)
∫∫∫
R
n
exp
[−1
2(x− x0)
T |Hg(x0)|(x− x0)
] (f(x0) +∇Tf(x0)(x− x0) + . . .
)dx
The exponential factor can be read as the density of a multivariate normal
around x0 times the missing constant (2π)n/2∣∣det
(Hg(x0)
−1)∣∣1/2, hence
∫∫∫R
n
exp
[−1
2(x− x0)
T |Hg(x0)|(x− x0)
]∇Tf(x0)(x− x0)dx = 0
I ≈ eg(x0)f(x0)(2π)n/2 |det (Hg(x0))|−1/2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.119
A classical application of Laplace’s approximation:
Stirling’s formula
n! ≈√2πn(n/e)n Γ(z + 1) ≈
√zzz
ez
√2π
Proof
Γ(z + 1) =
∫ ∞
0e−xxzdx =
∫ ∞
0e−zu(zu)zzdu = zz+1
∫ ∞
0e−zuuzdu = zz+1
∫ ∞
0ez[log(u)−u]du
The function gz(u) = z [log(u)− u] has a global minimum in u = 1, with
g′z(u) = z(1/u− 1), g′z(1) = 0 and g′′z (u) = −z/u2, so
Γ(z + 1) ≈ zz+1√2πegz(1)
1√|g′′z (1)|
=√2πz(z/e)z
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.120
AIC and BIC: Consistency
Page 113: AIC(θp) = 2ℓ(θp)− 2p
n
Page 117: BIC(θm) = 2ℓ(θm)− log(n) · pn
Selection consistency
If the DGP exists and if it belongs to the models under consideration,
then an information criterion is (weakly/strongly) selection consistent if
(with probability tending to one/almost surely) the optimisation of the infor-
mation criterion selects the DGP.
BIC is consistent
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.121
AIC and BIC: Efficiency
Efficieny
An information criterion is asymptotically efficient if its optimisation finds a
model p for which the prediction error PE(βp) satisfies (for any ε)
limn→∞
P
(PE(βp)
PE(β∗)< 1 + ε
)= 1
with β∗ = argmin PE(β)
Note: Prediction Error is considered as a function of the selection. If the
selection is random, then PE(βp) is random
AIC is efficient
A criterion cannot be consistent and efficient at the same time
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.122
Algorithms for searching the best model
• Suppose the full model has K possible variables, then the number of pos-
sible models in 2K
• Computing all adjusted R2’s, Cp scores or other criterion is impossible (ex-
ponential computational complexity)
• Possible strategies:
1. Forward selection: Start with a zero model, then gradually add the most
significant parameter in a test with H0 the current model.
2. Backward elimination
3. Forward-backward alternation: after a variable has been added, it is
tested if any variable in the model can now be eliminated without signifi-
cant increase in RSS.
4. Grouped/structured selection: the routine may impose that some pa-
rameters must be included before others. For instance, no interaction
terms without main effects. Imposing nested models is an extreme ex-
ample.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.123
PART 3: HIGH DIM. MODEL SELECTION
Sparsity
• Parsimonious model selection: we search for models where “some” co-
variates have exactly zero contribution.
• Sparse model selection: the number of zeros dominates (by far) the num-
ber of nonzeros
Sparsity will be used as an assumption in high dimensional model selection,
thereby defining a solution.
On the other hand, sparsity will cause problems w.r.t. false positives
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.124
Model selection as a constraint optimization problem
Constraint optimization problem = regularized least squares
RegLS(β) = ‖Y −X · β‖2 + λ‖β‖ℓq
• Minimum over β
• Defines the estimator within a model
• Operates within full model, but estimator is parsimonious (if q < 2)
RegLS(β) depends on smoothing parameter λ. Parameter can be estimated
using penalized likelihood, for instance Mallows’ Cp: Cp =SSEpS2
+ 2p− n
• Information criterion
• Minimize over λ (or equivalently: over model size p)
• smoothing parameter λ determines model size p (and vice versa)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.125
Regularisation: from combinatorial to convex constraints
• In the constraint optimization problem RegLSℓ0(β) = ‖Y −X · β‖2 + λp
the constraint or penalty equals p =∑
i β0i = ‖β‖0, where we take 00 = 0.
ℓ0 regularisation = combinatorial integer programming problem
• Replace ℓ0 by ℓ1: → convex quadratic programming problem
RegLSℓ1(β) = ‖Y −X · β‖2 + λ∑
i
|βi|
Combinatorial constraint
X β = Y
I(|β1|>0) + I(| β
2|>0) = 1
Algebraic constraint
X β = Y
|β1|1/2
+ |β2|1/2
= 1
Convex constraint
X β = Y
|β1| + |β
2| = 1
Linear constraint
X β = Y
β1
2 + β
2
2 = r
2
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.126
Regularisation: linear vs. quadratic constraint
In this example, the design matrix overspecifies the parameter vector (X has
more rows than columns)
The dotted line depicts optimal balance as function of the regularisation pa-
rameter λ (λ = 0 corresponds to the centre of the ellipses)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.127
Regularisation: linear vs. quadratic constraint: high-dim
case
In this example, the system of normal equations is singular
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.128
Equivalence λ ↔ p
• For q < 2, the outcome of the regularized minimum sum of squared resid-
uals is a sparse model, with, say, p selected covariates: with each λ corre-
sponds one p.
• So, we can also regularize with p
• The outcomes are models that are not (necessarily) nested, but they can
be ordered (put on a line) from simple to complex.
• We can thus, for instance, plot PE(βp) as a function of p, just as in the case
with nested models
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.129
Solving the ℓ1 constraint problem: KKT conditions
Steepest descend for quadratic forms
Suppose F (β) = ‖Y −Xβ‖22 =n∑
i=1
Yi −
∑
j
Xijβj
2
then∂F
∂βk=
n∑
i=1
2
Yi −
∑
j
Xijβj
Xik
Sum over i is an inner product of the vectors Y −Xβ and the kth column of
X, so all k together, the gradient vector becomes XT (Y −Xβ)
Karush-Kuhn-Tucker (KKT) conditions
If β solves the ℓ1 constraint optimization problem RegLSℓ1(β) = ‖Y −Xβ‖22 +2λ‖β‖1 then
XTj (Y −Xβ) = λsign(βj) if βj 6= 0∣∣∣XTj (Y −Xβ)
∣∣∣ ≤ λ if βj = 0
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.130
Intuitive interpretation of the KKT conditions
Intuition
• βj in the model
If βj 6= 0, then∂RegLSℓ1(β)
∂βj(β) is continuous, and it must be zero. That is
expressed by the first line in the Karush-Kuhn-Tucker conditions
• βj not in the model
If βj = 0, then the partial derivative is discontinuous. The steepest decrease
of ‖Y −Xβ‖22 by letting βj deviate from zero, should then be smaller than
the increase in the penalty, i.e., the cost (price to pay) for introducing βj into
the model is higher than the gain:
gain < cost ⇔ ∂
∂βj‖Y −Xβ‖22 < 2λ
∂
∂βj‖β‖ℓ1 ⇔
∣∣∣XTj (Y −Xβ)
∣∣∣ < λ
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.131
Interpretation of the Karush-Kuhn-Tucker conditions
The solution of the ℓ1 constraint problem will be sparse, but it does not coin-
cide with the ℓ0 constraint solution. Indeed, the solution that satisfies the KKT
conditions cannot be an orthogonal projection (a least squares solution).
On the next slides is a simple, yet important, special case: soft- versus hard-
thresholding. The latter is a ℓ0, and thus a projection, the former is projection
plus shrinkage
Then, on slide 135, we extend the interpretation to the general case.
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.132
Special case of ℓ0 regularisation: Hard-Thresholding
• Suppose X = I, so we have the observational model
Y = β + ε
In this model, both ℓ0 and ℓ1-penalized closeness-of-fit can be optimized
componentwise
• ℓ0-penalized closeness-of-fit:
RegLSℓ0(β) =∑n
i=1(Yi − βi)2 + λ2I(|βi| > 0)
Solution is Hard-thresholding: βi = HTλ(Yi),
where HTλ(x) is the following function
- Y
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.133
Special case of ℓ1 regularisation: Soft-Thresholding
• ℓ1-penalized closeness-of-fit:
RegLSℓ1(β) =∑n
i=1(Yi − βi)2 + 2λ|βi|
Solution is Soft-thresholding: βi = STλ(Yi),
where STλ(x) is the following function
- Y
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.134
Shrinkage in ℓ1 constraint least squares for Y = Xβ + ε
Let I be the set of selected components, then the Karush-Kuhn-Tucker condi-
tions are equivalent to XTI XIβI = STλ(X
TI Y )
From which we conclude that within the optimal selection, the estimation is
not least squares, but it contains a shrinkage
Note that XTXβ 6= STλ(XTY ) (i.e., without selection I)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.135
LASSO/Basis pursuit
• ℓ1 penalization/regularization/constraint is called least absolute shrinkage
and selection operator (LASSO) or Basis Pursuit
• For given λ, ℓ1 penalization leads to the same degree of sparsity as ℓ0 (see
Figure on slide )
• For fixed λ, and if all nonzero β are large enough, ℓ1 penalization is vari-
able selection consistent: for n → ∞, the set of nonzero variables in the
selection equals the true set with probability tending to one
• The convex optimization problem can be solved by quadratic/dynamic pro-
gramming or by specific methods, such as
– Least Angle Regression (LARS), a direct solver
– Iterative Soft Thresholding (It. ST), an iterative solver
We discuss both on the subsequent slides
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.136
Least Angle Regression
• Brings in a new variable that with maximum projection of current residual,
i.e., component with highest magnitude in the vector XT (Y −Xβ0)
• Moves along equiangular vector uI : β1 = β0 + αuI with
uI = XI(XTI XI)
−11I/√1TI (X
TI XI)−11I
where I is set of active variables (variables currently in the model), until one
variable not in I has the same projection of the new residual
XT (Y −Xβ1)
• Stopping criterion: Mallows’s Cp: we stop if (we think that) Cp(p) has reached
a minimum
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.137
Iterative soft-thresholding
• General X, observational model Y = Xβ + ε
• We want ‖Y −Xβ‖22 as small as possible, under the constraint that ‖β‖1 =∑i |βi| is restricted
• Suppose we have an estimator β(r), then we improve this estimator in two
steps
– We proceed in the direction of the steepest descend in ‖Y −Xβ‖22.The steepest decsend is XT (Y −Xβ(r))
We define β(r+1/2) = β(r) +XT (Y −Xβ(r))
– We search for β(r+1) which is as close as possible to β(r+1/2), but which
has restricted ℓ1-norm, so, we take
β(r+1) = STλ(β(r+1/2))
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.138
Degrees of freedom in sparse variable selection
Degrees of freedom (see definition slide 93) νp =1
σ2E[εT (ε− ep)
]
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.139
Generalized Cross Validation (GCV):
compromise between CV and Cp
Definition GCV(βp) =1nSSE(βp)(1− νp
n
)2
This comes from a formalization + simplification of the routine of
cross validation
GCV is an estimator of the prediction error (like Cp)
The use of GCV assumes sparse models
GCV combines benefits CV and Cp
• Like Cp, GCV is much faster than CV
• Like CV, GCV does not need a variance estimator
Implicit variance estimator is quite robust
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.140
Generalised Cross Validation
GCV also estimates the prediction error, however without having to estimate
the variance of the errors.
GCV(βp) =1nSSE(βp)(1− νp
n
)2 while ∆(βp) =1
nSSE(βp) +
2νpnσ2 − σ2
(see slide 93 for definition of ∆(βp))
Working GCV is based on
GCV(βp)− σ2 =∆(βp)−
(νpn
)2σ2
(1− νp
n
)2
GCV(βp)− σ2 −∆(βp)
∆(βp)=
2νpn
−(νpn
)2
(1− νp
n
)2 −
(νpn
)2
(1− νp
n
)2 ·σ2
∆(βp)
Interpretation GCV(βp)−σ2−∆(βp) is small if νp/n is small: requires sparsity
assumption
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.141
Illustration: GCV in LASSO νλ = E(N1(λ))
λ
PE∆
p
PE+σ2GCV
GCVhat
small modelslarge models
GCV(λ) =
1
nSSE(βλ)
(1− νλ
n
)2 where: νλ = n1 = E(N1(λ))
GCV(λ) =
1
nSSE(βλ)
(1− νλ
n
)2 where: νλ = N1(λ)
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.142
Two results supporting the use of GCV
1. Uniform efficiency (requires sparsity)
There exists a sequence Λn of (large) subsets of R, so that
supλ∈Λn
|GCV(λ)− σ2 −∆λ|∆λ + Vn
P→ 0 as n → ∞
where Vn = max
(0, sup
λ∈Λn
(PE(βλ,p)−∆λ)
)
Interpretation
On a large subset of R, GCV is “equivalent” to Mallows’s Cp (see further)
2. Behavior near zero λ in LASSO (full model)
limλ→0
E [GCV(λ)] = cn > 0
Interpretation
The behavior near zero does not disturb the minimization procedure
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.143
Uniform efficiency: “equivalence” GCV and Cp
PE (Prediction Error) and Cp
E(∆p) = PE(βp)
(See slide 93)
Bias-var ⇔ Closeness-complexity
Equivalence in expected value
Cp and GCV
1. Asymptotic
2. No expectations
3. Requires sparsity
4. Conditions on design matrix X
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.144
PART 4: MODEL AVERAGING
c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 1: Multiple Regression p.145