Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1)...

Nonlinear Regression (Part 1)

Christof Seiler

Stanford University, Spring 2016, STATS 205

Overview

I Smoothing or estimating curvesI Density estimationI Nonlinear regression

I Rank-based linear regression

Curve Estimation

I A curve of interest can be a probability density function fI In density estimation, we observe X1, . . . ,Xn from some

unknown cdf F with density f

X1, . . . ,Xn ∼ f

I The goal is to estimate density f

Density Estimation

0.0

0.1

0.2

0.3

0.4

−2 0 2 4rating

dens

ity

Nonlinear Regression

I A curve of interest can be a regression function rI In regression, we observe pairs (x1,Y1), . . . , (xn,Yn) that are

related asYi = r(xi) + εi

with E (εi) = 0I The goal is to estimate the regression function r


0.0

0.1

0.2

10 15 20 25Age

Cha

nge

in B

MD

gender

female

male

Bone Mineral Density Data


−0.1

0.0

0.1

0.2

10 15 20 25Age

Cha

nge

in B

MD

gender

female

male



0.0

0.1

0.2

10 15 20 25Age

Cha

nge

in B

MD

gender

female

male


The Bias–Variance Tradeoff

I Let f̂n(x) be an estimate of a function f (x)I Define the squared error loss function as

Loss = L(f (x), f̂n(x)) = (f (x)− f̂n(x))2

I Define average of this loss as risk or Mean Squared Error(MSE)

MSE = R(f (x), f̂n(x)) = E(Loss)I The expectation is taken with respect to f̂n which is randomI The MSE can be decomposed into a bias and variance term

MSE = Bias2 +Var

I The decomposition is easy to show


I Expand

E((f − f̂ )2) = E(f 2 + f̂ 2 + 2f f̂ ) = E(f 2) + E(f̂ 2)− E(2f f̂ )

I Use Var(X ) = E(X 2)− E(X )2

E((f − f̂ )2) = Var(f ) + E(f )2 + Var(f̂ ) + E(f̂ )2 − E(2f f̂ )

I Use E(f ) = f and Var(f ) = 0

E((f − f̂ )2) = f 2 + Var(f̂ ) + E(f̂ )2 − 2f E(f̂ )

I Use (E(f̂ )− f )2 = f 2 + E(f̂ )2 − 2f E(f̂ )

E((f − f̂ )2) = (E(f̂ )− f )2 + Var(f̂ ) = Bias2 +Var


I This described the risk at one pointI To summarize the risk, for density problems, we need to

integrateR(f , f̂n) =

∫R(f (x), f̂n(x))dx

I For regression problems, we sum over all

R(r , r̂n) =n∑

i=1R(r(xi), r̂n(xi))


I Consider the regression model

Yi = r(xi) + εi

I Suppose we draw new observation Y ∗i = r(xi) + ε∗i for each xiI If we predict Y ∗i with r̂n(xi) then the squared prediction

error is

(Y ∗i − r̂n(xi))2 = (r(xi) + ε∗i − r̂n(xi))2

I Define predictive risk as

E(1n

n∑i=1

(Y ∗i − r̂n(xi))2)


I Up to a constant, the average risk and the predictive risk arethe same

E(1n

n∑i=1

(Y ∗i − r̂n(xi))2)

= R(r , r̂n) + 1n

n∑i=1

E((ε∗i )2)

I and in particular, if error εi has variance σ2, then

E(1n

n∑i=1

(Y ∗i − r̂n(xi))2)

= R(r , r̂n) + σ2


I Challenge in smoothing is to determine how much smoothingto do

I When the data are oversmoothed, the bias term is large andthe variance is small

I When the data are undersmoothed the opposite is trueI This is called the bias–variance tradeoffI Minimizing risk corresponds to balancing bias and variance


Source: Wassermann (2006)

The Bias–Variance Tradeoff (Example)

I Let f be a pdfI Consider estimating f (0)I Let h be a small and positive numberI Define

ph := P(−h2 < X <

h2

)=∫ h/2

−h/2f (x)dx ≈ hf (0)

I Hencef (0) ≈ ph

h


I Let X be the number of observations in the interval(−h/2, h/2)

I Then X ∼ Binom(n, ph)I An estimate of ph is p̂h = X/n and estimate of f (0) is

f̂n(0) = p̂hh = X

nh

I We now show that the MSE of f̂n(0) is (for some constants Aand B)

MSE = Ah4 + Bnh = Bias2 +Variance


I Taylor expand around 0

f (x) ≈ f (0) + xf ′(0) + x2

2 f ′′(0)

I Plugin

ph =∫ h/2

−h/2f (x)dx ≈

∫ h/2

−h/2

(f (0) + xf ′(0) + x2

2 f ′′(0))

dx

= hf (0) + f ′′(0)h3

24


I Since X is binomial, we have E(X ) = nphI Use Taylor approximation ph ≈ hf (0) + f ′′(0)h3

24 and combine

E(f̂n(0)) = E(X )nh = ph

h ≈ f (0) + f ′′(0)h2

24I After rearranging, the bias is

Bias = E(f̂n(0))− f (0) ≈ f ′′(0)h2

24


I For the variance term, note that Var(X ) = nph(1− ph), then

Var(f̂n(0)) = Var(X )n2h2 = ph(1− ph)

nh2

I Use 1− ph ≈ 1 since h is small

Var(f̂n(0)) ≈ phnh2

I Combine with Taylor expansion

Var(f̂n(0)) ≈hf (0) + f ′′(0)h3

24nh2 = f (0)

nh + f ′′(0)h24n ≈ f (0)

nh


I And combinding both terms

MSE = Bias2 +Var(f̂n(0)) = (f ′′(0))2h4

576 + f (0)nh ≡ Ah4 + B

nhI As we smooth more (increase h),

the bias term increases and the variance term decreasesI As we smooth less (decrease h),

the bias term decreases and the variance term increasesI This is a typical bias–variance analysis

The Curse of DimensionalityI Problem with smoothing is the curse of dimensionality in high

dimensionsI Estimation gets harder as the dimensions of the observations

increasesI Computational: Computational burden increases

exponentially with dimension, andI Statistical: If data have dimension d then we need sample size

n to grow exponentially with dI The MSE of any nonparametric estimator of a smooth curve

has form (for c > 0)

MSE ≈ cn4/(4+d)

I If we want to have a fixed MSE = δ equal to some smallnumber δ, then solving for n

n ∝(cδ

)d/4

The Curse of Dimensionality

I We see that n ∝( cδ

)d/4 grows exponentially with dimension dI The reason for this is that smoothing involves estimating a

function in a local neighborhoodI But in high-dimensional problems the data are very sparse, so

local neighborhood contain very few points

The Curse of Dimensionality (Example)I Suppose n data points uniformly distributed on the interval

[0, 1]I How many data points will we find in the interval [0, 0.1]?I The answer: about n/10 pointsI Now suppose n point in 10 dimensional unit cube [0, 1]10

I How many data points in [0, 0.1]10?I The answer: about

n ×(0.1

1

)10= n

10, 000, 000, 000

I Thus, n has to be huge to ensure that small neighborhoodshave any data in them

I Smoothing methods can in principle be used in high-dimensionsproblems

I But estimator won’t be accurate, confidence interval aroundthe estimate will be distressingly large

The Curse of Dimensionality (Example)

Source: Hastie, Tibshirani, and Friedman (2009)I In ten dimensions 80% of range to capture 10% of the data

References

I Wassermann (2006). All of Nonparametric StatisticsI Hastie, Tibshirani, Friedman (2009). The Elements of

Statistical Learning

Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1)...

Documents

Transcript of Nonlinear Regression (Part 1) - Christof Seiler · 2021. 3. 29. · NonlinearRegression(Part1)...