Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB....

Statistical Foundations of Data Science

Jianqing Fan

Princeton University

http://www.princeton.edu/∼jqfan

Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 1 / 46

http://www.princeton.edu/~jqfan

Big Data are ubiquitous

Big Data

Internet

Business

Finance

Government

Medicine

2003 5EB 2010 1.2ZB 2012 2.7ZB 2015 8ZB 2020 40ZB

Engineering

Science

Digital Humanities

Biological ScienceFVolume

FVelocity

FVariety

“There were 5 exabytes of information created between the dawn of civilization through 2003,but that much information is now created every 2 days” — Eric Schmidt, CEO of Google


Deep Impact of Data Tsunami

System: storage, communication, computation architectures

Analysis: statistics, computation, optimization, privacy

Acquisition & Storage

Data Science

Computing

Analysis

Applications Applications

Big Data =⇒ Smart Data


What can big data do?

Hold great promises for understanding

F Heterogeneity: personalized medicine or services

F Commonality: in presence of large variations (noises)

from large pools of variables, factors, genes, environments and their

interactions as well as latent factors.

Aims of Data Science:

� Prediction: To construct as effective a method as possible to predict

future observations.(correlation)

� Inference and Prediction: To gain insight into relationship between

features and response for scientific purposes and to construct an

improved prediction method. (causation)


Common Features and Techniques

Common Features of Big Data:

F Heterogeneity, endogeneity

F Dependence, heavy tails,

F Missing Data, measure errors, spurious correlation

F Survivor biases, sampling biases

Common Techniques for Data Science:

F Statistical Techniques: MLE, Least-Squares, M-estimation

F Regression: Parametric, Nonparametric, Sparse

F Principal Component Analysis: Supervised, Unsupervised.


1. Multiple and Nonparametric

Regression

1.1. Theory of Least-squares 1.2. Model Building

1.3. Ridge Regression 1.4. Regression in RKHS

1.5. Cross-validation


1.1. Theory of Least-Squares

�Read materials and R-implementations here

http://orfe.princeton.edu/%7Ejqfan/fan/classes/245/chap11.pdf


Purpose Multiple regression

F Study assoc. between dependent & indepen. variables

F Screen irrelevant and select useful variables

F Prediction

Example: Hong Kong environmental dataset

Interest: Assoc. between levels of pollutants and No. of daily

hospital admissions for circulatory and respiratory problems.

Response Y = Daily number of hospital admissions

CovariatesI level of pollutant Sulphur Dioxide X1

I level of pollutant Nitrogen Dioxide X2

I level of respirable suspended particles X3

I · · ·Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 8 / 46

Multiple linear regression model

Y = β1X1 + β2X2 + · · ·+ βpXp + ε

Y : response / dependent variable

Xj ’s: explanatory / independent variables or covariates

ε: random error not explained / predicted by covariates

include intercept by setting X1 = 1

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.

x1 x2 x3

0 1y xβ β= +0 1 1xβ β+

0 1 2xβ β+

0 1 3xβ β+


Method of least-squares

Data:{(

xi1,xi2, · · ·xip,yi)}

1≤i≤n

Model: yi = ∑pj=1 xijβj + εi

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.

x1 x2 x3

0 1y xβ β= +0 1 1xβ β+

0 1 2xβ β+

0 1 3xβ β+

Method of Least-Squares:

minimizeβ∈Rp RSS(

β

),

n

∑i=1

(yi −

p

∑j=1

xijβj)2

RSS stands for residual sum-of-squares

When εii.i.d∼ N (0,σ2), least-squares estimator is MLE


Regression in matrix notation

y =

y1...

yn

, X =

x11 · · · x1p... · · ·

...

xn1 · · · xnp

, β =

β1...

βp

, ε =

ε1...

εn

Model becomes

y = Xβ + ε

RSS becomes

RSS(β) = ‖y−Xβ‖2


Closed-form solution

Least-squares: Minimize wrt β ∈ Rp

‖y−Xβ‖2 = (y−Xβ)T (y−Xβ)

Setting gradients to zero yields normal equations:

XT y = XT Xβ

Least-squares estimator: (assume X has full column rank)

β = (XT X)−1XT y


Geometric interpretation of least-squares

Fitted value: y = Xβ = X(XT X)−1XT︸︷︷︸,P∈Rn×n

y

Theorem 1.1 [Property of projection matrix]

F Pxj = xj , j = 1,2, · · · ,pF P2 = P or P(In−P) = 0

X1

X2

y

y

y − y

�project response vector y onto linear space spanned by X


Statistical properties of least-squares estimator

Assumption:

Exogeneity: E(ε|X) = 0;

Homoscedasticity: var(ε|X) = σ2.

Statistical Properties:

F bias: E(β|X) = β

F variance: var(β|X) = σ2(XT X)−1

often dropped

�Recall cov(U,V) = E(U−µu)(V−µv)T and var(U) = cov(U,U)

cov(aT U,bT V) = aT cov(U,V)b, var(aT U) = aT var(U)a.


Gauss-Markov Theorem

How large is variance? compared with other estimators?

Theorem 1.2 [Gauss-Markov Theorem]

LSE β is best linear unbiased estimator (BLUE):

aT β is a linear unbiased estimator of parameter θ = aT β

for any linear unbiased estimator bT y of θ,

var(bT y|X)≥ var(aTβ|X)

Estimation of σ2: σ2 = RSSn−p = ‖y−Xβ‖2

n−p

σ2 is is an unbiased estimator of σ2


Statistical inference

Additional assumption: N (0,σ2In)

Under fixed design or conditioning on X,

β = β + (XT X)−1XTε =⇒ β∼N (β,(XT X)−1

σ2)

F βj ∼N (βj ,vjσ2) where vj is j th diag of (XT X)−1

F (n−p)σ2 ∼ σ2χ2n−p and σ2 is indep. of β.

F 1−α CI for βj : βj ± tn−p(1−α/2)vj σ (homework)

F H0 : βj = 0: test statistics tj =βj√vj σ∼H0 tn−p.


Non-normal error

Appeal to asymptotic theory:

√n(β−β) = (n−1XT X)−1︸︷︷︸

n−1 ∑ni=1 Xi XT

i

n−1XTε︸︷︷︸

n−1/2∑

ni=1 Xi εi

LLN CLT

Using Slutsky’s theorem, (homework)

√n(β−β)

d−→ N(0,Σ−1) or βd−→N (β,(XT X)−1

σ2)


Correlated errors

y = Xβ + ε, where var(ε|X) = σ2W

Transform data as follows

y∗ = W−1/2y, X∗ = W−1/2X, ε∗ = W−1/2

ε.

General Least-Squares:

minβ∈Rp

||y∗−X∗β||2 = (y−Xβ)T W−1(y−Xβ)

Heteroscedastic errors: Wi = σ2 diag(v1, · · · ,vn)

Weighted Least-squares: minβ ∑ni=1(yi −XT

i β)2/vi .


1.2. Model Building

Nonlinear and nonparametric regression


Nonlinear regression

Univariate: Y = f (X) + ε,

�f (·) has structural property: smooth, monotone, convex ...

Weierstrass theorem: any continuous f (X) on [0,1] can be uniformly

approximated by a polynomial function.

Polynomial regression:

Y =

≈f (X)︷︸︸︷β0 + β1X + · · ·+ βdX d +ε

Fmultiple regression with X1 = X , · · · ,Xd = X d

Drawback: not suitable for functions with varying degrees of smoothness


Polynomial versus cubic spline regressions

**** *** **** ****** *****************

****

**

*

*****

*

*

**** *

*

*

**

*** *

*** **

***

****

***

*

*

*

*

**

***

*

**

*

*

*

*

** *

*

*

*

*

*

*

*

**

*

*

*

* ****

*

*

*

**

***

* *

** *

**

****

*

10 20 30 40 50

−10

0−

500

50

X0.

0000

0

Polynomial versus cubic spline fit

FRed: polynomial regression with degree 3

Blue: spline regression with with degree 3


Spline regression

F piecewise polynomials with degree d , with continuous derivatives

up to order d−1.

F Knots: {τj}Kj=1 where discontinuity occurs.

Example: Linear splines with 0 = τ0 < τ1 < τ2 < τ3 = 1

Linearity on [0,τ1] yields l(x) = β0 + β1x , x ∈ [0,τ1].

Linearity on [τ1,τ2]+ continuity at τ1 gives

l(x) = β0 + β1x + β2(x− τ1)+, x ∈ [τ1,τ2]

Linearity on [τ1,τ2]+ continuity at τ1 gives [τ2,1]

l(x) = β0 + β1x + β2(x− τ1)+ + β3(x− τ2)+, x ∈ [τ2,1]


Basis functions for Linear Splines

Basis functions for linear splines on 0 = τ0 < τ1 < τ2 < τ3 = 1:

B0(x) = 1,B1(x) = x ,B2(x) = (x− τ1)+,B3(x) = (x− τ2)+

Spline regression:

Y = β0B0(X) + β1B1(X) + β2B2(X) + β3B3(X)︸︷︷︸≈f (X)

+ε

�Multiple regression with X0 = B0(X),X1 = B1(X),X2 = B2(X),X3 = B3(X)

General case: {1,x ,(x− τj)+, j = 1, · · · ,K}Nonparametric: When K is large, Kn→ ∞


Cubic splines

Piecewise cubic polynomial with cont. 1st and 2nd derivatives:

c(x) = β0 + β1x + β2x2 + β3x3, x ≤ τ1,

c(x) = β0 + β1x + β2x2 + β3x3 + β4(x− τ1)3+, x ∈ [τ1,τ2],

c(x) = β0 + β1x + β2x2 + β3x3 + β4(x− τ1)3+ + β5(x− τ2)3

+,

x > τ2

Basis functions:

B0(x) = 1,B1(x) = x ,B2(x) = x2,B3(x) = x3

B4(x) = (x− τ1)3+,B5(x) = (x− τ2)3

+.

Fwidely used; Fmultiple regression


Extension to multiple covariates

�Bivariate quadratic regression model:

Y = β0 + β1X1 + β2X2 + β3X 21 + β4 X1X2︸︷︷︸

interaction

+β5X 22 + ε

�Multivariate quadratic regression:

Y =p

∑j=1

βjXj + ∑j≤k

βjkXjXk + ε

�Multivariate quadratic regression with main effect (linear terms) and

interactions

Y =p

∑j=1

βjXj + ∑j<k

βjkXjXk + ε


Multivariate spline regression

Idea: Tensor products of univariate basis functions

Drawbacks: curse of dimensionality, namely, number of basis

functions scales exponentially with p

Remedy: Add additional structure to f (·)Example: Additive model

Y = f1(X1) + · · ·+ fp(Xp) + ε

�Number of basis functions scales linearly with p

Example: Bivariate interaction models:

Y = ∑1≤i≤j≤p

fij(Xi ,Xj) + ε

�Number of basis functions scales quadratically with pJianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 26 / 46

1.3. Ridge Regression


Ridge Regression

Drawbacks of OLS: Fn > p;Flarge variance when collinearity

Remedy: Ridge regression (Hoerl and Kennard, 1970)

βλ = (XT X + λI)−1XT y

Fλ > 0 is a regularization parameter.

Interpretation: Penalized LS ‖y−Xβ‖2 + λ‖β‖2.

—Setting the gradient to zero, we get XT (Xβ−y) + λ

2 β = 0

Bayesian estimator: with prior β∼ N(0, σ2

λIp).


Bias-Variance Tradeoff

Smaller variances: (due to prior)

Var(βλ) = (XT X + λI)−1XT X(XT X + λI)−1σ

2 ≺ Var(β).

Larger biases: (due to prior)

E(βλ)−β = (XT X + λI)−1XT Xβ−β =−λ(XT X + λI)−1β.

Overall error: MSE(βλ) =

E‖βλ−β‖2 = tr{(XT X + λI)−2[λ2ββ

T + σ2XT X]}.

ddλ

MSE(βλ)|λ=0 < 0 ⇒ ∃ a λ > 0 outperforms OLS.


Generalization: `q Penalized Least Squares

`q penalized least-squares estimate:

minβ

= ‖y−Xβ‖2 + λ‖β‖qq, q > 0.

Known as Bridge estimator (Frank and Friedman, 1993);

When q = 1, called Lasso estimator (Tibshirani, 1996);

strictly concave when 0 < q < 1 and convex when q > 1;

Only q = 2 admits a closed-form solution.


Generalized Ridge Regression

Assume that β∼ N(0,Σ). Then

p(β|y,X) ∝ e−‖y−Xβ‖2/(2σ2)e−βTΣ−1

β/2.

MAP estimate is

βMAP

= argmaxβ p(β|y,X)

= (XT X + σ2Σ−1)−1XT y.

FΣ can take into account different scales of covariates.


Ridge Regression Solution Path

�βλ = (XT X + λI)−1XT y as a function of λ.

Efficient computation: Let X = UDVT be the SVD.

XT X = VDUT UDVT = VD2VT ;

βλ = V(D2 + λI)−1DUT y = ∑pj=1

dj

d2j +λ

uTj yvj ,

—uj and uj are j th columns of U and V.

Compute very fast for many values of {λk}.


Kernel Ridge Regression

Theorem 1.3. Alternative expression βλ = XT (XXT + λI)−1y

Prediction at x is y = xT βλ = xT XT (XXT + λI)−1y.

Note that (XXT )ij = 〈xi ,xj〉 and xT XT = (〈x,x1〉, · · · ,〈x,xn〉).

Prediction depends only pairwise inner products;

Generalize to other similarity measures K (·, ·), called kernel

trick. K(

,)

= +10 K(

,)

=−10


Kernel regression

Kernel: (K (xi ,xj))n×n is PSD, for any {xi}ni=1.

Commonly used kernels:

Flinear 〈u,v〉 Fpolynomial (1 + 〈u,v〉)d, d = 2,3, · · · ;FGaussian e−γ‖u−v‖2

FLaplacian e−γ‖u−v‖

�Fit model y = ∑nj=1 αjK (x,xj) + ε. Basis: {K (·,xj)}n

j=1 by

minα∈Rn

{‖y−Kα‖2 + λα

T Kα

}


Kernel ridge regression

Kernel ridge regression

With K = (K (xi ,xj)) ∈ Rn×n, prediction at x is

y = (K (x,x1), · · · ,K (x,xn))(K + λI)−1y,

F y =

pred︷︸︸︷f (x) = ∑

ni=1

weight︷︸︸︷αi K (x,xi), α = (K + λI)−1y;

testing testing training

F tune the parameter λ to minimize prediction errors.


1.4 Reproducing Kernel Hilbert

Spaces

Justification of Kernel Tricks by Representer Theorem


Hilbert Space

Hilbert space: a space endowed with an inner product.

�X = set, H = a space of functions on X with inner product 〈·, ·〉H .

�Evaluation functional at x is Lx(f ) = f (x) ∀f ∈H .

Reproducing kernel Hilbert space(RKHS): For any x, ∃Cx s.t.

|Lx(f )|= |f (x)| ≤ Cx‖f‖H , ∀f ∈H .

�Lx is continuous linear. By Riesz representation theorem,

∃Kx ∈H s.t. 〈Kx, f 〉H = Lx(f ), ∀f ∈H .


Reproducing Kernel

Reproducing kernel: K (x,x′) = 〈Kx,Kx′〉H .

Symmetry: K (x,x′) = K (x′,x);

PSD: matrix (K (xi ,xj))n×n is PSD, for all {xi}ni=1,

�K admits eigen-decomposition

K (x,x′) =∞

∑j=1

γjψj(x)ψj(x′),∞

∑j=1

γ2j < ∞

—{γj}∞j=1 are eigenvalues, and {ψj}∞

j=1 are eigen-functions.


Kernel Ridge Regression

� For any g,g′ ∈HK with g = ∑∞j=1 βjψj ,g′ = ∑

∞j=1 β′jψj , define

〈g,g′〉HK=

∞

∑j=1

γ−1j βjβ

′j ; ‖g‖HK

=√〈g,g〉HK

.

Reproducibility: 〈K (·,x ′),g〉HK= ∑j γj{γjψj(x′)}βj = g(x′).

Nonparametric Regression: yi = f (xi)︸︷︷︸f∈HK

+εi , i ∈ [n]. Find

f = argminf∈HK

{ n

∑i=1

[yi − f (xi)]2 + λ‖f‖2HK

}, λ > 0.

Using f = ∑∞j=1 βjψj , setting βj = βj/

√γj and ψj =

√γjψj , we get

min{βj}∞j=1

{ n

∑i=1

[yi −∞

∑j=1

βj ψj(xi)]2 +λ

∞

∑j=1

β2j

}.


Representer Theorem

Theorem 1.4

For a loss L(y , f (x)

)and increasing function Pλ(·), let

f = argminf∈HK

{ n

∑i=1

L(yi , f (xi)

)+ Pλ(‖f‖HK

)}, λ > 0,

Then (homework)

f =n

∑i=1

αiK (·,xi),

where α = (α1, · · · , αn)T solves

minα

{ n

∑i=1

L(

yi ,n

∑j=1

αjK (xi ,xj))

+ Pλ

(√αT Kα

)}.

F Infinite-dimensional regression problem;

F Finite-dimensional representation for the solution.Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 40 / 46

Applications of Representer Theorem

Apply representer theorem to kernel ridge regression

f = argminf∈HK

{ n

∑i=1

(yi − f (xi)

)2+ λ‖f‖2

HK

}.

We must have f = ∑ni=1 αiK (·,xi) with α ∈ Rn solving

minα∈Rn

{‖y−Kα‖2 + λα

T Kα

}.

It is easily seen that

α = (K + λI)−1y.


1.5 Cross-Validation


k -fold Cross-Validation

Purpose: To estimate Prediction Error for a procedure

k -fold Cross-Validation (CV)

F Divide data randomly and evenly into k subsets;

F Use one subset as testing set and remaining as training set to

compute a testing error;

F Repeat for each of k subsets and average testing errors.

Commonly-used: k = 5 or 10

Leave-one-out CV: k = n. CV = 1n ∑

ni=1[yi − f−i(xi)]2,

f−i(xi) = predicted value based on {(xj ,yj)}j 6=i .


Linear smoother

�y = Sy for data {(xi ,yi)}ni=1, S depends only on X.

self-stable if f (x) = f (x), where f is estimated function based on data

{(xi ,yi)}ni=1 and (x, f (x)), and f based on {(xi ,yi)}n

i=1

Theorem 1.5. For a self-stable linear smoother y = Sy,

yi − f−i(xi) =yi − yi

1−Sii, ∀i ∈ [n], CV =

1n

n

∑i=1

( yi − yi

1−Sii

)2.


Generalized Cross-Validation

GCV (Golub et al., 1979): GCV =1n ∑

ni=1(yi−yi)

2

[1−tr(S)/n]2 .

�tr(S) is called effective degrees of freedom.

Self-stable Method S tr(S)

Multiple Linear Regression X(XT X)−1XT p

Ridge Regression X(XT X + λI)−1XT∑

pj=1

d2j

d2j +λ

Kernel Ridge Regression in RKHS K(K + λI)−1∑

nj=1

γjγj+λ

F{dj}

and {γj} are singular values of X and K.

Use GCV to choose λ by minimizing

GCV(λ) =1n yT (I−Sλ)y

[1− tr(Sλ)/n]2.


Bias variance decomposition

Double Expectation: EZ = E{E(Z |X)}, for any X

Best prediction: E(Y |X) = argminf E(Y − f (X))2

Bias-var in prediction: Letting f ∗(X) = E(Y |X), then

E(Y − f (X))2 = E(Y − f ∗(X))2︸︷︷︸var=Eσ2(X)

+E(f ∗(X)− f (X)︸︷︷︸bias

)2.

Bias-var in est: letting f (x) = Efn(x), then

E (fn(X)− f (X))2 = E (fn(X)− f (X))2︸︷︷︸var

+E (f (X)− f ∗(X)︸︷︷︸bias

)2.

Fvariance is small when n large, big when no. of parameters are big

Fbiases are small when model is complex


Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB....

Documents

Transcript of Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB....