Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB....
Transcript of Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB....
![Page 1: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/1.jpg)
Statistical Foundations of Data Science
Jianqing Fan
Princeton University
http://www.princeton.edu/∼jqfan
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 1 / 46
![Page 2: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/2.jpg)
Big Data are ubiquitous
Big Data
Internet
Business
Finance
Government
Medicine
2003 5EB 2010 1.2ZB 2012 2.7ZB 2015 8ZB 2020 40ZB
Engineering
Science
Digital Humanities
Biological ScienceFVolume
FVelocity
FVariety
“There were 5 exabytes of information created between the dawn of civilization through 2003,but that much information is now created every 2 days” — Eric Schmidt, CEO of Google
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 2 / 46
![Page 3: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/3.jpg)
Deep Impact of Data Tsunami
System: storage, communication, computation architectures
Analysis: statistics, computation, optimization, privacy
Acquisition & Storage
Data Science
Computing
Analysis
Applications Applications
Big Data =⇒ Smart Data
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 3 / 46
![Page 4: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/4.jpg)
What can big data do?
Hold great promises for understanding
F Heterogeneity: personalized medicine or services
F Commonality: in presence of large variations (noises)
from large pools of variables, factors, genes, environments and their
interactions as well as latent factors.
Aims of Data Science:
� Prediction: To construct as effective a method as possible to predict
future observations.(correlation)
� Inference and Prediction: To gain insight into relationship between
features and response for scientific purposes and to construct an
improved prediction method. (causation)
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 4 / 46
![Page 5: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/5.jpg)
Common Features and Techniques
Common Features of Big Data:
F Heterogeneity, endogeneity
F Dependence, heavy tails,
F Missing Data, measure errors, spurious correlation
F Survivor biases, sampling biases
Common Techniques for Data Science:
F Statistical Techniques: MLE, Least-Squares, M-estimation
F Regression: Parametric, Nonparametric, Sparse
F Principal Component Analysis: Supervised, Unsupervised.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 5 / 46
![Page 6: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/6.jpg)
1. Multiple and Nonparametric
Regression
1.1. Theory of Least-squares 1.2. Model Building
1.3. Ridge Regression 1.4. Regression in RKHS
1.5. Cross-validation
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 6 / 46
![Page 7: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/7.jpg)
1.1. Theory of Least-Squares
�Read materials and R-implementations here
http://orfe.princeton.edu/%7Ejqfan/fan/classes/245/chap11.pdf
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 7 / 46
![Page 8: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/8.jpg)
Purpose Multiple regression
F Study assoc. between dependent & indepen. variables
F Screen irrelevant and select useful variables
F Prediction
Example: Hong Kong environmental dataset
Interest: Assoc. between levels of pollutants and No. of daily
hospital admissions for circulatory and respiratory problems.
Response Y = Daily number of hospital admissions
CovariatesI level of pollutant Sulphur Dioxide X1
I level of pollutant Nitrogen Dioxide X2
I level of respirable suspended particles X3
I · · ·Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 8 / 46
![Page 9: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/9.jpg)
Multiple linear regression model
Y = β1X1 + β2X2 + · · ·+ βpXp + ε
Y : response / dependent variable
Xj ’s: explanatory / independent variables or covariates
ε: random error not explained / predicted by covariates
include intercept by setting X1 = 1
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
x1 x2 x3
0 1y xβ β= +0 1 1xβ β+
0 1 2xβ β+
0 1 3xβ β+
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 9 / 46
![Page 10: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/10.jpg)
Method of least-squares
Data:{(
xi1,xi2, · · ·xip,yi)}
1≤i≤n
Model: yi = ∑pj=1 xijβj + εi
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
x1 x2 x3
0 1y xβ β= +0 1 1xβ β+
0 1 2xβ β+
0 1 3xβ β+
Method of Least-Squares:
minimizeβ∈Rp RSS(
β
),
n
∑i=1
(yi −
p
∑j=1
xijβj)2
RSS stands for residual sum-of-squares
When εii.i.d∼ N (0,σ2), least-squares estimator is MLE
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 10 / 46
![Page 11: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/11.jpg)
Regression in matrix notation
y =
y1...
yn
, X =
x11 · · · x1p... · · ·
...
xn1 · · · xnp
, β =
β1...
βp
, ε =
ε1...
εn
Model becomes
y = Xβ + ε
RSS becomes
RSS(β) = ‖y−Xβ‖2
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 11 / 46
![Page 12: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/12.jpg)
Closed-form solution
Least-squares: Minimize wrt β ∈ Rp
‖y−Xβ‖2 = (y−Xβ)T (y−Xβ)
Setting gradients to zero yields normal equations:
XT y = XT Xβ
Least-squares estimator: (assume X has full column rank)
β = (XT X)−1XT y
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 12 / 46
![Page 13: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/13.jpg)
Geometric interpretation of least-squares
Fitted value: y = Xβ = X(XT X)−1XT︸ ︷︷ ︸,P∈Rn×n
y
Theorem 1.1 [Property of projection matrix]
F Pxj = xj , j = 1,2, · · · ,pF P2 = P or P(In−P) = 0
X1
X2
y
y
y − y
�project response vector y onto linear space spanned by X
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 13 / 46
![Page 14: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/14.jpg)
Statistical properties of least-squares estimator
Assumption:
Exogeneity: E(ε|X) = 0;
Homoscedasticity: var(ε|X) = σ2.
Statistical Properties:
F bias: E(β|X) = β
F variance: var(β|X) = σ2(XT X)−1
often dropped
�Recall cov(U,V) = E(U−µu)(V−µv)T and var(U) = cov(U,U)
cov(aT U,bT V) = aT cov(U,V)b, var(aT U) = aT var(U)a.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 14 / 46
![Page 15: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/15.jpg)
Gauss-Markov Theorem
How large is variance? compared with other estimators?
Theorem 1.2 [Gauss-Markov Theorem]
LSE β is best linear unbiased estimator (BLUE):
aT β is a linear unbiased estimator of parameter θ = aT β
for any linear unbiased estimator bT y of θ,
var(bT y|X)≥ var(aTβ|X)
Estimation of σ2: σ2 = RSSn−p = ‖y−Xβ‖2
n−p
σ2 is is an unbiased estimator of σ2
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 15 / 46
![Page 16: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/16.jpg)
Statistical inference
Additional assumption: N (0,σ2In)
Under fixed design or conditioning on X,
β = β + (XT X)−1XTε =⇒ β∼N (β,(XT X)−1
σ2)
F βj ∼N (βj ,vjσ2) where vj is j th diag of (XT X)−1
F (n−p)σ2 ∼ σ2χ2n−p and σ2 is indep. of β.
F 1−α CI for βj : βj ± tn−p(1−α/2)vj σ (homework)
F H0 : βj = 0: test statistics tj =βj√vj σ∼H0 tn−p.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 16 / 46
![Page 17: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/17.jpg)
Non-normal error
Appeal to asymptotic theory:
√n(β−β) = (n−1XT X)−1︸ ︷︷ ︸
n−1 ∑ni=1 Xi XT
i
n−1XTε︸ ︷︷ ︸
n−1/2∑
ni=1 Xi εi
LLN CLT
Using Slutsky’s theorem, (homework)
√n(β−β)
d−→ N(0,Σ−1) or βd−→N (β,(XT X)−1
σ2)
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 17 / 46
![Page 18: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/18.jpg)
Correlated errors
y = Xβ + ε, where var(ε|X) = σ2W
Transform data as follows
y∗ = W−1/2y, X∗ = W−1/2X, ε∗ = W−1/2
ε.
General Least-Squares:
minβ∈Rp
||y∗−X∗β||2 = (y−Xβ)T W−1(y−Xβ)
Heteroscedastic errors: Wi = σ2 diag(v1, · · · ,vn)
Weighted Least-squares: minβ ∑ni=1(yi −XT
i β)2/vi .
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 18 / 46
![Page 19: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/19.jpg)
1.2. Model Building
Nonlinear and nonparametric regression
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 19 / 46
![Page 20: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/20.jpg)
Nonlinear regression
Univariate: Y = f (X) + ε,
�f (·) has structural property: smooth, monotone, convex ...
Weierstrass theorem: any continuous f (X) on [0,1] can be uniformly
approximated by a polynomial function.
Polynomial regression:
Y =
≈f (X)︷ ︸︸ ︷β0 + β1X + · · ·+ βdX d +ε
Fmultiple regression with X1 = X , · · · ,Xd = X d
Drawback: not suitable for functions with varying degrees of smoothness
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 20 / 46
![Page 21: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/21.jpg)
Polynomial versus cubic spline regressions
**** *** **** ****** *****************
****
**
*
*****
*
*
**** *
*
*
**
*** *
*** **
***
****
***
*
*
*
*
**
***
*
**
*
*
*
*
** *
*
*
*
*
*
*
*
**
*
*
*
* ****
*
*
*
**
***
* *
** *
**
****
*
10 20 30 40 50
−10
0−
500
50
X0.
0000
0
Polynomial versus cubic spline fit
FRed: polynomial regression with degree 3
Blue: spline regression with with degree 3
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 21 / 46
![Page 22: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/22.jpg)
Spline regression
F piecewise polynomials with degree d , with continuous derivatives
up to order d−1.
F Knots: {τj}Kj=1 where discontinuity occurs.
Example: Linear splines with 0 = τ0 < τ1 < τ2 < τ3 = 1
Linearity on [0,τ1] yields l(x) = β0 + β1x , x ∈ [0,τ1].
Linearity on [τ1,τ2]+ continuity at τ1 gives
l(x) = β0 + β1x + β2(x− τ1)+, x ∈ [τ1,τ2]
Linearity on [τ1,τ2]+ continuity at τ1 gives [τ2,1]
l(x) = β0 + β1x + β2(x− τ1)+ + β3(x− τ2)+, x ∈ [τ2,1]
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 22 / 46
![Page 23: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/23.jpg)
Basis functions for Linear Splines
Basis functions for linear splines on 0 = τ0 < τ1 < τ2 < τ3 = 1:
B0(x) = 1,B1(x) = x ,B2(x) = (x− τ1)+,B3(x) = (x− τ2)+
Spline regression:
Y = β0B0(X) + β1B1(X) + β2B2(X) + β3B3(X)︸ ︷︷ ︸≈f (X)
+ε
�Multiple regression with X0 = B0(X),X1 = B1(X),X2 = B2(X),X3 = B3(X)
General case: {1,x ,(x− τj)+, j = 1, · · · ,K}Nonparametric: When K is large, Kn→ ∞
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 23 / 46
![Page 24: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/24.jpg)
Cubic splines
Piecewise cubic polynomial with cont. 1st and 2nd derivatives:
c(x) = β0 + β1x + β2x2 + β3x3, x ≤ τ1,
c(x) = β0 + β1x + β2x2 + β3x3 + β4(x− τ1)3+, x ∈ [τ1,τ2],
c(x) = β0 + β1x + β2x2 + β3x3 + β4(x− τ1)3+ + β5(x− τ2)3
+,
x > τ2
Basis functions:
B0(x) = 1,B1(x) = x ,B2(x) = x2,B3(x) = x3
B4(x) = (x− τ1)3+,B5(x) = (x− τ2)3
+.
Fwidely used; Fmultiple regression
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 24 / 46
![Page 25: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/25.jpg)
Extension to multiple covariates
�Bivariate quadratic regression model:
Y = β0 + β1X1 + β2X2 + β3X 21 + β4 X1X2︸︷︷︸
interaction
+β5X 22 + ε
�Multivariate quadratic regression:
Y =p
∑j=1
βjXj + ∑j≤k
βjkXjXk + ε
�Multivariate quadratic regression with main effect (linear terms) and
interactions
Y =p
∑j=1
βjXj + ∑j<k
βjkXjXk + ε
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 25 / 46
![Page 26: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/26.jpg)
Multivariate spline regression
Idea: Tensor products of univariate basis functions
Drawbacks: curse of dimensionality, namely, number of basis
functions scales exponentially with p
Remedy: Add additional structure to f (·)Example: Additive model
Y = f1(X1) + · · ·+ fp(Xp) + ε
�Number of basis functions scales linearly with p
Example: Bivariate interaction models:
Y = ∑1≤i≤j≤p
fij(Xi ,Xj) + ε
�Number of basis functions scales quadratically with pJianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 26 / 46
![Page 27: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/27.jpg)
1.3. Ridge Regression
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 27 / 46
![Page 28: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/28.jpg)
Ridge Regression
Drawbacks of OLS: Fn > p;Flarge variance when collinearity
Remedy: Ridge regression (Hoerl and Kennard, 1970)
βλ = (XT X + λI)−1XT y
Fλ > 0 is a regularization parameter.
Interpretation: Penalized LS ‖y−Xβ‖2 + λ‖β‖2.
—Setting the gradient to zero, we get XT (Xβ−y) + λ
2 β = 0
Bayesian estimator: with prior β∼ N(0, σ2
λIp).
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 28 / 46
![Page 29: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/29.jpg)
Bias-Variance Tradeoff
Smaller variances: (due to prior)
Var(βλ) = (XT X + λI)−1XT X(XT X + λI)−1σ
2 ≺ Var(β).
Larger biases: (due to prior)
E(βλ)−β = (XT X + λI)−1XT Xβ−β =−λ(XT X + λI)−1β.
Overall error: MSE(βλ) =
E‖βλ−β‖2 = tr{(XT X + λI)−2[λ2ββ
T + σ2XT X]}.
ddλ
MSE(βλ)|λ=0 < 0 ⇒ ∃ a λ > 0 outperforms OLS.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 29 / 46
![Page 30: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/30.jpg)
Generalization: `q Penalized Least Squares
`q penalized least-squares estimate:
minβ
= ‖y−Xβ‖2 + λ‖β‖qq, q > 0.
Known as Bridge estimator (Frank and Friedman, 1993);
When q = 1, called Lasso estimator (Tibshirani, 1996);
strictly concave when 0 < q < 1 and convex when q > 1;
Only q = 2 admits a closed-form solution.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 30 / 46
![Page 31: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/31.jpg)
Generalized Ridge Regression
Assume that β∼ N(0,Σ). Then
p(β|y,X) ∝ e−‖y−Xβ‖2/(2σ2)e−βTΣ−1
β/2.
MAP estimate is
βMAP
= argmaxβ p(β|y,X)
= (XT X + σ2Σ−1)−1XT y.
FΣ can take into account different scales of covariates.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 31 / 46
![Page 32: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/32.jpg)
Ridge Regression Solution Path
�βλ = (XT X + λI)−1XT y as a function of λ.
Efficient computation: Let X = UDVT be the SVD.
XT X = VDUT UDVT = VD2VT ;
βλ = V(D2 + λI)−1DUT y = ∑pj=1
dj
d2j +λ
uTj yvj ,
—uj and uj are j th columns of U and V.
Compute very fast for many values of {λk}.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 32 / 46
![Page 33: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/33.jpg)
Kernel Ridge Regression
Theorem 1.3. Alternative expression βλ = XT (XXT + λI)−1y
Prediction at x is y = xT βλ = xT XT (XXT + λI)−1y.
Note that (XXT )ij = 〈xi ,xj〉 and xT XT = (〈x,x1〉, · · · ,〈x,xn〉).
Prediction depends only pairwise inner products;
Generalize to other similarity measures K (·, ·), called kernel
trick. K(
,)
= +10 K(
,)
=−10
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 33 / 46
![Page 34: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/34.jpg)
Kernel regression
Kernel: (K (xi ,xj))n×n is PSD, for any {xi}ni=1.
Commonly used kernels:
Flinear 〈u,v〉 Fpolynomial (1 + 〈u,v〉)d, d = 2,3, · · · ;FGaussian e−γ‖u−v‖2
FLaplacian e−γ‖u−v‖
�Fit model y = ∑nj=1 αjK (x,xj) + ε. Basis: {K (·,xj)}n
j=1 by
minα∈Rn
{‖y−Kα‖2 + λα
T Kα
}
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 34 / 46
![Page 35: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/35.jpg)
Kernel ridge regression
Kernel ridge regression
With K = (K (xi ,xj)) ∈ Rn×n, prediction at x is
y = (K (x,x1), · · · ,K (x,xn))(K + λI)−1y,
F y =
pred︷︸︸︷f (x) = ∑
ni=1
weight︷︸︸︷αi K (x,xi), α = (K + λI)−1y;
testing testing training
F tune the parameter λ to minimize prediction errors.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 35 / 46
![Page 36: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/36.jpg)
1.4 Reproducing Kernel Hilbert
Spaces
Justification of Kernel Tricks by Representer Theorem
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 36 / 46
![Page 37: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/37.jpg)
Hilbert Space
Hilbert space: a space endowed with an inner product.
�X = set, H = a space of functions on X with inner product 〈·, ·〉H .
�Evaluation functional at x is Lx(f ) = f (x) ∀f ∈H .
Reproducing kernel Hilbert space(RKHS): For any x, ∃Cx s.t.
|Lx(f )|= |f (x)| ≤ Cx‖f‖H , ∀f ∈H .
�Lx is continuous linear. By Riesz representation theorem,
∃Kx ∈H s.t. 〈Kx, f 〉H = Lx(f ), ∀f ∈H .
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 37 / 46
![Page 38: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/38.jpg)
Reproducing Kernel
Reproducing kernel: K (x,x′) = 〈Kx,Kx′〉H .
Symmetry: K (x,x′) = K (x′,x);
PSD: matrix (K (xi ,xj))n×n is PSD, for all {xi}ni=1,
�K admits eigen-decomposition
K (x,x′) =∞
∑j=1
γjψj(x)ψj(x′),∞
∑j=1
γ2j < ∞
—{γj}∞j=1 are eigenvalues, and {ψj}∞
j=1 are eigen-functions.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 38 / 46
![Page 39: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/39.jpg)
Kernel Ridge Regression
� For any g,g′ ∈HK with g = ∑∞j=1 βjψj ,g′ = ∑
∞j=1 β′jψj , define
〈g,g′〉HK=
∞
∑j=1
γ−1j βjβ
′j ; ‖g‖HK
=√〈g,g〉HK
.
Reproducibility: 〈K (·,x ′),g〉HK= ∑j γj{γjψj(x′)}βj = g(x′).
Nonparametric Regression: yi = f (xi)︸︷︷︸f∈HK
+εi , i ∈ [n]. Find
f = argminf∈HK
{ n
∑i=1
[yi − f (xi)]2 + λ‖f‖2HK
}, λ > 0.
Using f = ∑∞j=1 βjψj , setting βj = βj/
√γj and ψj =
√γjψj , we get
min{βj}∞j=1
{ n
∑i=1
[yi −∞
∑j=1
βj ψj(xi)]2 +λ
∞
∑j=1
β2j
}.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 39 / 46
![Page 40: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/40.jpg)
Representer Theorem
Theorem 1.4
For a loss L(y , f (x)
)and increasing function Pλ(·), let
f = argminf∈HK
{ n
∑i=1
L(yi , f (xi)
)+ Pλ(‖f‖HK
)}, λ > 0,
Then (homework)
f =n
∑i=1
αiK (·,xi),
where α = (α1, · · · , αn)T solves
minα
{ n
∑i=1
L(
yi ,n
∑j=1
αjK (xi ,xj))
+ Pλ
(√αT Kα
)}.
F Infinite-dimensional regression problem;
F Finite-dimensional representation for the solution.Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 40 / 46
![Page 41: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/41.jpg)
Applications of Representer Theorem
Apply representer theorem to kernel ridge regression
f = argminf∈HK
{ n
∑i=1
(yi − f (xi)
)2+ λ‖f‖2
HK
}.
We must have f = ∑ni=1 αiK (·,xi) with α ∈ Rn solving
minα∈Rn
{‖y−Kα‖2 + λα
T Kα
}.
It is easily seen that
α = (K + λI)−1y.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 41 / 46
![Page 42: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/42.jpg)
1.5 Cross-Validation
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 42 / 46
![Page 43: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/43.jpg)
k -fold Cross-Validation
Purpose: To estimate Prediction Error for a procedure
k -fold Cross-Validation (CV)
F Divide data randomly and evenly into k subsets;
F Use one subset as testing set and remaining as training set to
compute a testing error;
F Repeat for each of k subsets and average testing errors.
Commonly-used: k = 5 or 10
Leave-one-out CV: k = n. CV = 1n ∑
ni=1[yi − f−i(xi)]2,
f−i(xi) = predicted value based on {(xj ,yj)}j 6=i .
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 43 / 46
![Page 44: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/44.jpg)
Linear smoother
�y = Sy for data {(xi ,yi)}ni=1, S depends only on X.
self-stable if f (x) = f (x), where f is estimated function based on data
{(xi ,yi)}ni=1 and (x, f (x)), and f based on {(xi ,yi)}n
i=1
Theorem 1.5. For a self-stable linear smoother y = Sy,
yi − f−i(xi) =yi − yi
1−Sii, ∀i ∈ [n], CV =
1n
n
∑i=1
( yi − yi
1−Sii
)2.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 44 / 46
![Page 45: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/45.jpg)
Generalized Cross-Validation
GCV (Golub et al., 1979): GCV =1n ∑
ni=1(yi−yi)
2
[1−tr(S)/n]2 .
�tr(S) is called effective degrees of freedom.
Self-stable Method S tr(S)
Multiple Linear Regression X(XT X)−1XT p
Ridge Regression X(XT X + λI)−1XT∑
pj=1
d2j
d2j +λ
Kernel Ridge Regression in RKHS K(K + λI)−1∑
nj=1
γjγj+λ
F{dj}
and {γj} are singular values of X and K.
Use GCV to choose λ by minimizing
GCV(λ) =1n yT (I−Sλ)y
[1− tr(Sλ)/n]2.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 45 / 46
![Page 46: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety](https://reader031.fdocuments.in/reader031/viewer/2022041417/5e1c71d450fc1663401975e5/html5/thumbnails/46.jpg)
Bias variance decomposition
Double Expectation: EZ = E{E(Z |X)}, for any X
Best prediction: E(Y |X) = argminf E(Y − f (X))2
Bias-var in prediction: Letting f ∗(X) = E(Y |X), then
E(Y − f (X))2 = E(Y − f ∗(X))2︸ ︷︷ ︸var=Eσ2(X)
+E(f ∗(X)− f (X)︸ ︷︷ ︸bias
)2.
Bias-var in est: letting f (x) = Efn(x), then
E (fn(X)− f (X))2 = E (fn(X)− f (X))2︸ ︷︷ ︸var
+E (f (X)− f ∗(X)︸ ︷︷ ︸bias
)2.
Fvariance is small when n large, big when no. of parameters are big
Fbiases are small when model is complex
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 46 / 46