Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB....

46
Statistical Foundations of Data Science Jianqing Fan Princeton University http://www.princeton.edu/jqfan Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 1 / 46

Transcript of Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB....

Page 1: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Statistical Foundations of Data Science

Jianqing Fan

Princeton University

http://www.princeton.edu/∼jqfan

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 1 / 46

Page 2: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Big Data are ubiquitous

Big Data

Internet

Business

Finance

Government

Medicine

2003 5EB 2010 1.2ZB 2012 2.7ZB 2015 8ZB 2020 40ZB

Engineering

Science

Digital Humanities

Biological ScienceFVolume

FVelocity

FVariety

“There were 5 exabytes of information created between the dawn of civilization through 2003,but that much information is now created every 2 days” — Eric Schmidt, CEO of Google

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 2 / 46

Page 3: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Deep Impact of Data Tsunami

System: storage, communication, computation architectures

Analysis: statistics, computation, optimization, privacy

Acquisition & Storage

Data Science

Computing

Analysis

Applications Applications

Big Data =⇒ Smart Data

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 3 / 46

Page 4: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

What can big data do?

Hold great promises for understanding

F Heterogeneity: personalized medicine or services

F Commonality: in presence of large variations (noises)

from large pools of variables, factors, genes, environments and their

interactions as well as latent factors.

Aims of Data Science:

� Prediction: To construct as effective a method as possible to predict

future observations.(correlation)

� Inference and Prediction: To gain insight into relationship between

features and response for scientific purposes and to construct an

improved prediction method. (causation)

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 4 / 46

Page 5: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Common Features and Techniques

Common Features of Big Data:

F Heterogeneity, endogeneity

F Dependence, heavy tails,

F Missing Data, measure errors, spurious correlation

F Survivor biases, sampling biases

Common Techniques for Data Science:

F Statistical Techniques: MLE, Least-Squares, M-estimation

F Regression: Parametric, Nonparametric, Sparse

F Principal Component Analysis: Supervised, Unsupervised.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 5 / 46

Page 6: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

1. Multiple and Nonparametric

Regression

1.1. Theory of Least-squares 1.2. Model Building

1.3. Ridge Regression 1.4. Regression in RKHS

1.5. Cross-validation

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 6 / 46

Page 7: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

1.1. Theory of Least-Squares

�Read materials and R-implementations here

http://orfe.princeton.edu/%7Ejqfan/fan/classes/245/chap11.pdf

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 7 / 46

Page 8: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Purpose Multiple regression

F Study assoc. between dependent & indepen. variables

F Screen irrelevant and select useful variables

F Prediction

Example: Hong Kong environmental dataset

Interest: Assoc. between levels of pollutants and No. of daily

hospital admissions for circulatory and respiratory problems.

Response Y = Daily number of hospital admissions

CovariatesI level of pollutant Sulphur Dioxide X1

I level of pollutant Nitrogen Dioxide X2

I level of respirable suspended particles X3

I · · ·Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 8 / 46

Page 9: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Multiple linear regression model

Y = β1X1 + β2X2 + · · ·+ βpXp + ε

Y : response / dependent variable

Xj ’s: explanatory / independent variables or covariates

ε: random error not explained / predicted by covariates

include intercept by setting X1 = 1

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.

x1 x2 x3

0 1y xβ β= +0 1 1xβ β+

0 1 2xβ β+

0 1 3xβ β+

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 9 / 46

Page 10: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Method of least-squares

Data:{(

xi1,xi2, · · ·xip,yi)}

1≤i≤n

Model: yi = ∑pj=1 xijβj + εi

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.

x1 x2 x3

0 1y xβ β= +0 1 1xβ β+

0 1 2xβ β+

0 1 3xβ β+

Method of Least-Squares:

minimizeβ∈Rp RSS(

β

),

n

∑i=1

(yi −

p

∑j=1

xijβj)2

RSS stands for residual sum-of-squares

When εii.i.d∼ N (0,σ2), least-squares estimator is MLE

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 10 / 46

Page 11: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Regression in matrix notation

y =

y1...

yn

, X =

x11 · · · x1p... · · ·

...

xn1 · · · xnp

, β =

β1...

βp

, ε =

ε1...

εn

Model becomes

y = Xβ + ε

RSS becomes

RSS(β) = ‖y−Xβ‖2

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 11 / 46

Page 12: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Closed-form solution

Least-squares: Minimize wrt β ∈ Rp

‖y−Xβ‖2 = (y−Xβ)T (y−Xβ)

Setting gradients to zero yields normal equations:

XT y = XT Xβ

Least-squares estimator: (assume X has full column rank)

β = (XT X)−1XT y

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 12 / 46

Page 13: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Geometric interpretation of least-squares

Fitted value: y = Xβ = X(XT X)−1XT︸ ︷︷ ︸,P∈Rn×n

y

Theorem 1.1 [Property of projection matrix]

F Pxj = xj , j = 1,2, · · · ,pF P2 = P or P(In−P) = 0

X1

X2

y

y

y − y

�project response vector y onto linear space spanned by X

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 13 / 46

Page 14: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Statistical properties of least-squares estimator

Assumption:

Exogeneity: E(ε|X) = 0;

Homoscedasticity: var(ε|X) = σ2.

Statistical Properties:

F bias: E(β|X) = β

F variance: var(β|X) = σ2(XT X)−1

often dropped

�Recall cov(U,V) = E(U−µu)(V−µv)T and var(U) = cov(U,U)

cov(aT U,bT V) = aT cov(U,V)b, var(aT U) = aT var(U)a.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 14 / 46

Page 15: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Gauss-Markov Theorem

How large is variance? compared with other estimators?

Theorem 1.2 [Gauss-Markov Theorem]

LSE β is best linear unbiased estimator (BLUE):

aT β is a linear unbiased estimator of parameter θ = aT β

for any linear unbiased estimator bT y of θ,

var(bT y|X)≥ var(aTβ|X)

Estimation of σ2: σ2 = RSSn−p = ‖y−Xβ‖2

n−p

σ2 is is an unbiased estimator of σ2

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 15 / 46

Page 16: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Statistical inference

Additional assumption: N (0,σ2In)

Under fixed design or conditioning on X,

β = β + (XT X)−1XTε =⇒ β∼N (β,(XT X)−1

σ2)

F βj ∼N (βj ,vjσ2) where vj is j th diag of (XT X)−1

F (n−p)σ2 ∼ σ2χ2n−p and σ2 is indep. of β.

F 1−α CI for βj : βj ± tn−p(1−α/2)vj σ (homework)

F H0 : βj = 0: test statistics tj =βj√vj σ∼H0 tn−p.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 16 / 46

Page 17: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Non-normal error

Appeal to asymptotic theory:

√n(β−β) = (n−1XT X)−1︸ ︷︷ ︸

n−1 ∑ni=1 Xi XT

i

n−1XTε︸ ︷︷ ︸

n−1/2∑

ni=1 Xi εi

LLN CLT

Using Slutsky’s theorem, (homework)

√n(β−β)

d−→ N(0,Σ−1) or βd−→N (β,(XT X)−1

σ2)

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 17 / 46

Page 18: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Correlated errors

y = Xβ + ε, where var(ε|X) = σ2W

Transform data as follows

y∗ = W−1/2y, X∗ = W−1/2X, ε∗ = W−1/2

ε.

General Least-Squares:

minβ∈Rp

||y∗−X∗β||2 = (y−Xβ)T W−1(y−Xβ)

Heteroscedastic errors: Wi = σ2 diag(v1, · · · ,vn)

Weighted Least-squares: minβ ∑ni=1(yi −XT

i β)2/vi .

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 18 / 46

Page 19: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

1.2. Model Building

Nonlinear and nonparametric regression

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 19 / 46

Page 20: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Nonlinear regression

Univariate: Y = f (X) + ε,

�f (·) has structural property: smooth, monotone, convex ...

Weierstrass theorem: any continuous f (X) on [0,1] can be uniformly

approximated by a polynomial function.

Polynomial regression:

Y =

≈f (X)︷ ︸︸ ︷β0 + β1X + · · ·+ βdX d +ε

Fmultiple regression with X1 = X , · · · ,Xd = X d

Drawback: not suitable for functions with varying degrees of smoothness

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 20 / 46

Page 21: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Polynomial versus cubic spline regressions

**** *** **** ****** *****************

****

**

*

*****

*

*

**** *

*

*

**

*** *

*** **

***

****

***

*

*

*

*

**

***

*

**

*

*

*

*

** *

*

*

*

*

*

*

*

**

*

*

*

* ****

*

*

*

**

***

* *

** *

**

****

*

10 20 30 40 50

−10

0−

500

50

X0.

0000

0

Polynomial versus cubic spline fit

FRed: polynomial regression with degree 3

Blue: spline regression with with degree 3

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 21 / 46

Page 22: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Spline regression

F piecewise polynomials with degree d , with continuous derivatives

up to order d−1.

F Knots: {τj}Kj=1 where discontinuity occurs.

Example: Linear splines with 0 = τ0 < τ1 < τ2 < τ3 = 1

Linearity on [0,τ1] yields l(x) = β0 + β1x , x ∈ [0,τ1].

Linearity on [τ1,τ2]+ continuity at τ1 gives

l(x) = β0 + β1x + β2(x− τ1)+, x ∈ [τ1,τ2]

Linearity on [τ1,τ2]+ continuity at τ1 gives [τ2,1]

l(x) = β0 + β1x + β2(x− τ1)+ + β3(x− τ2)+, x ∈ [τ2,1]

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 22 / 46

Page 23: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Basis functions for Linear Splines

Basis functions for linear splines on 0 = τ0 < τ1 < τ2 < τ3 = 1:

B0(x) = 1,B1(x) = x ,B2(x) = (x− τ1)+,B3(x) = (x− τ2)+

Spline regression:

Y = β0B0(X) + β1B1(X) + β2B2(X) + β3B3(X)︸ ︷︷ ︸≈f (X)

�Multiple regression with X0 = B0(X),X1 = B1(X),X2 = B2(X),X3 = B3(X)

General case: {1,x ,(x− τj)+, j = 1, · · · ,K}Nonparametric: When K is large, Kn→ ∞

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 23 / 46

Page 24: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Cubic splines

Piecewise cubic polynomial with cont. 1st and 2nd derivatives:

c(x) = β0 + β1x + β2x2 + β3x3, x ≤ τ1,

c(x) = β0 + β1x + β2x2 + β3x3 + β4(x− τ1)3+, x ∈ [τ1,τ2],

c(x) = β0 + β1x + β2x2 + β3x3 + β4(x− τ1)3+ + β5(x− τ2)3

+,

x > τ2

Basis functions:

B0(x) = 1,B1(x) = x ,B2(x) = x2,B3(x) = x3

B4(x) = (x− τ1)3+,B5(x) = (x− τ2)3

+.

Fwidely used; Fmultiple regression

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 24 / 46

Page 25: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Extension to multiple covariates

�Bivariate quadratic regression model:

Y = β0 + β1X1 + β2X2 + β3X 21 + β4 X1X2︸︷︷︸

interaction

+β5X 22 + ε

�Multivariate quadratic regression:

Y =p

∑j=1

βjXj + ∑j≤k

βjkXjXk + ε

�Multivariate quadratic regression with main effect (linear terms) and

interactions

Y =p

∑j=1

βjXj + ∑j<k

βjkXjXk + ε

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 25 / 46

Page 26: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Multivariate spline regression

Idea: Tensor products of univariate basis functions

Drawbacks: curse of dimensionality, namely, number of basis

functions scales exponentially with p

Remedy: Add additional structure to f (·)Example: Additive model

Y = f1(X1) + · · ·+ fp(Xp) + ε

�Number of basis functions scales linearly with p

Example: Bivariate interaction models:

Y = ∑1≤i≤j≤p

fij(Xi ,Xj) + ε

�Number of basis functions scales quadratically with pJianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 26 / 46

Page 27: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

1.3. Ridge Regression

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 27 / 46

Page 28: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Ridge Regression

Drawbacks of OLS: Fn > p;Flarge variance when collinearity

Remedy: Ridge regression (Hoerl and Kennard, 1970)

βλ = (XT X + λI)−1XT y

Fλ > 0 is a regularization parameter.

Interpretation: Penalized LS ‖y−Xβ‖2 + λ‖β‖2.

—Setting the gradient to zero, we get XT (Xβ−y) + λ

2 β = 0

Bayesian estimator: with prior β∼ N(0, σ2

λIp).

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 28 / 46

Page 29: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Bias-Variance Tradeoff

Smaller variances: (due to prior)

Var(βλ) = (XT X + λI)−1XT X(XT X + λI)−1σ

2 ≺ Var(β).

Larger biases: (due to prior)

E(βλ)−β = (XT X + λI)−1XT Xβ−β =−λ(XT X + λI)−1β.

Overall error: MSE(βλ) =

E‖βλ−β‖2 = tr{(XT X + λI)−2[λ2ββ

T + σ2XT X]}.

ddλ

MSE(βλ)|λ=0 < 0 ⇒ ∃ a λ > 0 outperforms OLS.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 29 / 46

Page 30: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Generalization: `q Penalized Least Squares

`q penalized least-squares estimate:

minβ

= ‖y−Xβ‖2 + λ‖β‖qq, q > 0.

Known as Bridge estimator (Frank and Friedman, 1993);

When q = 1, called Lasso estimator (Tibshirani, 1996);

strictly concave when 0 < q < 1 and convex when q > 1;

Only q = 2 admits a closed-form solution.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 30 / 46

Page 31: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Generalized Ridge Regression

Assume that β∼ N(0,Σ). Then

p(β|y,X) ∝ e−‖y−Xβ‖2/(2σ2)e−βTΣ−1

β/2.

MAP estimate is

βMAP

= argmaxβ p(β|y,X)

= (XT X + σ2Σ−1)−1XT y.

FΣ can take into account different scales of covariates.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 31 / 46

Page 32: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Ridge Regression Solution Path

�βλ = (XT X + λI)−1XT y as a function of λ.

Efficient computation: Let X = UDVT be the SVD.

XT X = VDUT UDVT = VD2VT ;

βλ = V(D2 + λI)−1DUT y = ∑pj=1

dj

d2j +λ

uTj yvj ,

—uj and uj are j th columns of U and V.

Compute very fast for many values of {λk}.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 32 / 46

Page 33: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Kernel Ridge Regression

Theorem 1.3. Alternative expression βλ = XT (XXT + λI)−1y

Prediction at x is y = xT βλ = xT XT (XXT + λI)−1y.

Note that (XXT )ij = 〈xi ,xj〉 and xT XT = (〈x,x1〉, · · · ,〈x,xn〉).

Prediction depends only pairwise inner products;

Generalize to other similarity measures K (·, ·), called kernel

trick. K(

,)

= +10 K(

,)

=−10

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 33 / 46

Page 34: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Kernel regression

Kernel: (K (xi ,xj))n×n is PSD, for any {xi}ni=1.

Commonly used kernels:

Flinear 〈u,v〉 Fpolynomial (1 + 〈u,v〉)d, d = 2,3, · · · ;FGaussian e−γ‖u−v‖2

FLaplacian e−γ‖u−v‖

�Fit model y = ∑nj=1 αjK (x,xj) + ε. Basis: {K (·,xj)}n

j=1 by

minα∈Rn

{‖y−Kα‖2 + λα

T Kα

}

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 34 / 46

Page 35: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Kernel ridge regression

Kernel ridge regression

With K = (K (xi ,xj)) ∈ Rn×n, prediction at x is

y = (K (x,x1), · · · ,K (x,xn))(K + λI)−1y,

F y =

pred︷︸︸︷f (x) = ∑

ni=1

weight︷︸︸︷αi K (x,xi), α = (K + λI)−1y;

testing testing training

F tune the parameter λ to minimize prediction errors.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 35 / 46

Page 36: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

1.4 Reproducing Kernel Hilbert

Spaces

Justification of Kernel Tricks by Representer Theorem

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 36 / 46

Page 37: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Hilbert Space

Hilbert space: a space endowed with an inner product.

�X = set, H = a space of functions on X with inner product 〈·, ·〉H .

�Evaluation functional at x is Lx(f ) = f (x) ∀f ∈H .

Reproducing kernel Hilbert space(RKHS): For any x, ∃Cx s.t.

|Lx(f )|= |f (x)| ≤ Cx‖f‖H , ∀f ∈H .

�Lx is continuous linear. By Riesz representation theorem,

∃Kx ∈H s.t. 〈Kx, f 〉H = Lx(f ), ∀f ∈H .

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 37 / 46

Page 38: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Reproducing Kernel

Reproducing kernel: K (x,x′) = 〈Kx,Kx′〉H .

Symmetry: K (x,x′) = K (x′,x);

PSD: matrix (K (xi ,xj))n×n is PSD, for all {xi}ni=1,

�K admits eigen-decomposition

K (x,x′) =∞

∑j=1

γjψj(x)ψj(x′),∞

∑j=1

γ2j < ∞

—{γj}∞j=1 are eigenvalues, and {ψj}∞

j=1 are eigen-functions.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 38 / 46

Page 39: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Kernel Ridge Regression

� For any g,g′ ∈HK with g = ∑∞j=1 βjψj ,g′ = ∑

∞j=1 β′jψj , define

〈g,g′〉HK=

∑j=1

γ−1j βjβ

′j ; ‖g‖HK

=√〈g,g〉HK

.

Reproducibility: 〈K (·,x ′),g〉HK= ∑j γj{γjψj(x′)}βj = g(x′).

Nonparametric Regression: yi = f (xi)︸︷︷︸f∈HK

+εi , i ∈ [n]. Find

f = argminf∈HK

{ n

∑i=1

[yi − f (xi)]2 + λ‖f‖2HK

}, λ > 0.

Using f = ∑∞j=1 βjψj , setting βj = βj/

√γj and ψj =

√γjψj , we get

min{βj}∞j=1

{ n

∑i=1

[yi −∞

∑j=1

βj ψj(xi)]2 +λ

∑j=1

β2j

}.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 39 / 46

Page 40: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Representer Theorem

Theorem 1.4

For a loss L(y , f (x)

)and increasing function Pλ(·), let

f = argminf∈HK

{ n

∑i=1

L(yi , f (xi)

)+ Pλ(‖f‖HK

)}, λ > 0,

Then (homework)

f =n

∑i=1

αiK (·,xi),

where α = (α1, · · · , αn)T solves

minα

{ n

∑i=1

L(

yi ,n

∑j=1

αjK (xi ,xj))

+ Pλ

(√αT Kα

)}.

F Infinite-dimensional regression problem;

F Finite-dimensional representation for the solution.Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 40 / 46

Page 41: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Applications of Representer Theorem

Apply representer theorem to kernel ridge regression

f = argminf∈HK

{ n

∑i=1

(yi − f (xi)

)2+ λ‖f‖2

HK

}.

We must have f = ∑ni=1 αiK (·,xi) with α ∈ Rn solving

minα∈Rn

{‖y−Kα‖2 + λα

T Kα

}.

It is easily seen that

α = (K + λI)−1y.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 41 / 46

Page 42: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

1.5 Cross-Validation

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 42 / 46

Page 43: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

k -fold Cross-Validation

Purpose: To estimate Prediction Error for a procedure

k -fold Cross-Validation (CV)

F Divide data randomly and evenly into k subsets;

F Use one subset as testing set and remaining as training set to

compute a testing error;

F Repeat for each of k subsets and average testing errors.

Commonly-used: k = 5 or 10

Leave-one-out CV: k = n. CV = 1n ∑

ni=1[yi − f−i(xi)]2,

f−i(xi) = predicted value based on {(xj ,yj)}j 6=i .

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 43 / 46

Page 44: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Linear smoother

�y = Sy for data {(xi ,yi)}ni=1, S depends only on X.

self-stable if f (x) = f (x), where f is estimated function based on data

{(xi ,yi)}ni=1 and (x, f (x)), and f based on {(xi ,yi)}n

i=1

Theorem 1.5. For a self-stable linear smoother y = Sy,

yi − f−i(xi) =yi − yi

1−Sii, ∀i ∈ [n], CV =

1n

n

∑i=1

( yi − yi

1−Sii

)2.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 44 / 46

Page 45: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Generalized Cross-Validation

GCV (Golub et al., 1979): GCV =1n ∑

ni=1(yi−yi)

2

[1−tr(S)/n]2 .

�tr(S) is called effective degrees of freedom.

Self-stable Method S tr(S)

Multiple Linear Regression X(XT X)−1XT p

Ridge Regression X(XT X + λI)−1XT∑

pj=1

d2j

d2j +λ

Kernel Ridge Regression in RKHS K(K + λI)−1∑

nj=1

γjγj+λ

F{dj}

and {γj} are singular values of X and K.

Use GCV to choose λ by minimizing

GCV(λ) =1n yT (I−Sλ)y

[1− tr(Sλ)/n]2.

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 45 / 46

Page 46: Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB. Engineering. Science. Digital Humanities. Biological Science. FVolume FVelocity FVariety

Bias variance decomposition

Double Expectation: EZ = E{E(Z |X)}, for any X

Best prediction: E(Y |X) = argminf E(Y − f (X))2

Bias-var in prediction: Letting f ∗(X) = E(Y |X), then

E(Y − f (X))2 = E(Y − f ∗(X))2︸ ︷︷ ︸var=Eσ2(X)

+E(f ∗(X)− f (X)︸ ︷︷ ︸bias

)2.

Bias-var in est: letting f (x) = Efn(x), then

E (fn(X)− f (X))2 = E (fn(X)− f (X))2︸ ︷︷ ︸var

+E (f (X)− f ∗(X)︸ ︷︷ ︸bias

)2.

Fvariance is small when n large, big when no. of parameters are big

Fbiases are small when model is complex

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 46 / 46