Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1)...

28
Nonlinear Regression (Part 1) Christof Seiler Stanford University, Spring 2016, STATS 205

Transcript of Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1)...

Page 1: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

Nonlinear Regression (Part 1)

Christof Seiler

Stanford University, Spring 2016, STATS 205

Page 2: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

Overview

I Smoothing or estimating curvesI Density estimationI Nonlinear regression

I Rank-based linear regression

Page 3: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

Curve Estimation

I A curve of interest can be a probability density function fI In density estimation, we observe X1, . . . ,Xn from some

unknown cdf F with density f

X1, . . . ,Xn ∼ f

I The goal is to estimate density f

Page 4: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

Density Estimation

0.0

0.1

0.2

0.3

0.4

−2 0 2 4rating

dens

ity

Page 5: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

Density Estimation

0.0

0.1

0.2

0.3

0.4

−2 0 2 4rating

dens

ity

Page 6: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

Nonlinear Regression

I A curve of interest can be a regression function rI In regression, we observe pairs (x1,Y1), . . . , (xn,Yn) that are

related asYi = r(xi) + εi

with E (εi) = 0I The goal is to estimate the regression function r

Page 7: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

Nonlinear Regression

0.0

0.1

0.2

10 15 20 25Age

Cha

nge

in B

MD

gender

female

male

Bone Mineral Density Data

Page 8: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

Nonlinear Regression

−0.1

0.0

0.1

0.2

10 15 20 25Age

Cha

nge

in B

MD

gender

female

male

Bone Mineral Density Data

Page 9: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

Nonlinear Regression

0.0

0.1

0.2

10 15 20 25Age

Cha

nge

in B

MD

gender

female

male

Bone Mineral Density Data

Page 10: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

Nonlinear Regression

0.0

0.1

0.2

10 15 20 25Age

Cha

nge

in B

MD

gender

female

male

Bone Mineral Density Data

Page 11: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff

I Let f̂n(x) be an estimate of a function f (x)I Define the squared error loss function as

Loss = L(f (x), f̂n(x)) = (f (x)− f̂n(x))2

I Define average of this loss as risk or Mean Squared Error(MSE)

MSE = R(f (x), f̂n(x)) = E(Loss)I The expectation is taken with respect to f̂n which is randomI The MSE can be decomposed into a bias and variance term

MSE = Bias2 +Var

I The decomposition is easy to show

Page 12: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff

I Expand

E((f − f̂ )2) = E(f 2 + f̂ 2 + 2f f̂ ) = E(f 2) + E(f̂ 2)− E(2f f̂ )

I Use Var(X ) = E(X 2)− E(X )2

E((f − f̂ )2) = Var(f ) + E(f )2 + Var(f̂ ) + E(f̂ )2 − E(2f f̂ )

I Use E(f ) = f and Var(f ) = 0

E((f − f̂ )2) = f 2 + Var(f̂ ) + E(f̂ )2 − 2f E(f̂ )

I Use (E(f̂ )− f )2 = f 2 + E(f̂ )2 − 2f E(f̂ )

E((f − f̂ )2) = (E(f̂ )− f )2 + Var(f̂ ) = Bias2 +Var

Page 13: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff

I This described the risk at one pointI To summarize the risk, for density problems, we need to

integrateR(f , f̂n) =

∫R(f (x), f̂n(x))dx

I For regression problems, we sum over all

R(r , r̂n) =n∑

i=1R(r(xi), r̂n(xi))

Page 14: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff

I Consider the regression model

Yi = r(xi) + εi

I Suppose we draw new observation Y ∗i = r(xi) + ε∗i for each xiI If we predict Y ∗i with r̂n(xi) then the squared prediction

error is

(Y ∗i − r̂n(xi))2 = (r(xi) + ε∗i − r̂n(xi))2

I Define predictive risk as

E(1n

n∑i=1

(Y ∗i − r̂n(xi))2)

Page 15: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff

I Up to a constant, the average risk and the predictive risk arethe same

E(1n

n∑i=1

(Y ∗i − r̂n(xi))2)

= R(r , r̂n) + 1n

n∑i=1

E((ε∗i )2)

I and in particular, if error εi has variance σ2, then

E(1n

n∑i=1

(Y ∗i − r̂n(xi))2)

= R(r , r̂n) + σ2

Page 16: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff

I Challenge in smoothing is to determine how much smoothingto do

I When the data are oversmoothed, the bias term is large andthe variance is small

I When the data are undersmoothed the opposite is trueI This is called the bias–variance tradeoffI Minimizing risk corresponds to balancing bias and variance

Page 17: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff

Source: Wassermann (2006)

Page 18: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff (Example)

I Let f be a pdfI Consider estimating f (0)I Let h be a small and positive numberI Define

ph := P(−h2 < X <

h2

)=∫ h/2

−h/2f (x)dx ≈ hf (0)

I Hencef (0) ≈ ph

h

Page 19: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff (Example)

I Let X be the number of observations in the interval(−h/2, h/2)

I Then X ∼ Binom(n, ph)I An estimate of ph is p̂h = X/n and estimate of f (0) is

f̂n(0) = p̂hh = X

nh

I We now show that the MSE of f̂n(0) is (for some constants Aand B)

MSE = Ah4 + Bnh = Bias2 +Variance

Page 20: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff (Example)

I Taylor expand around 0

f (x) ≈ f (0) + xf ′(0) + x2

2 f ′′(0)

I Plugin

ph =∫ h/2

−h/2f (x)dx ≈

∫ h/2

−h/2

(f (0) + xf ′(0) + x2

2 f ′′(0))

dx

= hf (0) + f ′′(0)h3

24

Page 21: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff (Example)

I Since X is binomial, we have E(X ) = nphI Use Taylor approximation ph ≈ hf (0) + f ′′(0)h3

24 and combine

E(f̂n(0)) = E(X )nh = ph

h ≈ f (0) + f ′′(0)h2

24I After rearranging, the bias is

Bias = E(f̂n(0))− f (0) ≈ f ′′(0)h2

24

Page 22: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff (Example)

I For the variance term, note that Var(X ) = nph(1− ph), then

Var(f̂n(0)) = Var(X )n2h2 = ph(1− ph)

nh2

I Use 1− ph ≈ 1 since h is small

Var(f̂n(0)) ≈ phnh2

I Combine with Taylor expansion

Var(f̂n(0)) ≈hf (0) + f ′′(0)h3

24nh2 = f (0)

nh + f ′′(0)h24n ≈ f (0)

nh

Page 23: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Bias–Variance Tradeoff (Example)

I And combinding both terms

MSE = Bias2 +Var(f̂n(0)) = (f ′′(0))2h4

576 + f (0)nh ≡ Ah4 + B

nhI As we smooth more (increase h),

the bias term increases and the variance term decreasesI As we smooth less (decrease h),

the bias term decreases and the variance term increasesI This is a typical bias–variance analysis

Page 24: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Curse of DimensionalityI Problem with smoothing is the curse of dimensionality in high

dimensionsI Estimation gets harder as the dimensions of the observations

increasesI Computational: Computational burden increases

exponentially with dimension, andI Statistical: If data have dimension d then we need sample size

n to grow exponentially with dI The MSE of any nonparametric estimator of a smooth curve

has form (for c > 0)

MSE ≈ cn4/(4+d)

I If we want to have a fixed MSE = δ equal to some smallnumber δ, then solving for n

n ∝(cδ

)d/4

Page 25: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Curse of Dimensionality

I We see that n ∝( cδ

)d/4 grows exponentially with dimension dI The reason for this is that smoothing involves estimating a

function in a local neighborhoodI But in high-dimensional problems the data are very sparse, so

local neighborhood contain very few points

Page 26: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Curse of Dimensionality (Example)I Suppose n data points uniformly distributed on the interval

[0, 1]I How many data points will we find in the interval [0, 0.1]?I The answer: about n/10 pointsI Now suppose n point in 10 dimensional unit cube [0, 1]10

I How many data points in [0, 0.1]10?I The answer: about

n ×(0.1

1

)10= n

10, 000, 000, 000

I Thus, n has to be huge to ensure that small neighborhoodshave any data in them

I Smoothing methods can in principle be used in high-dimensionsproblems

I But estimator won’t be accurate, confidence interval aroundthe estimate will be distressingly large

Page 27: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

The Curse of Dimensionality (Example)

Source: Hastie, Tibshirani, and Friedman (2009)I In ten dimensions 80% of range to capture 10% of the data

Page 28: Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1) ChristofSeiler StanfordUniversity,Spring2016,STATS205. Overview I Smoothingorestimatingcurves

References

I Wassermann (2006). All of Nonparametric StatisticsI Hastie, Tibshirani, Friedman (2009). The Elements of

Statistical Learning