Computational Statistics Lectures 10-13: Smoothing and ...

Post on 08-Feb-2022

3 views 0 download

Transcript of Computational Statistics Lectures 10-13: Smoothing and ...

Computational StatisticsLectures 10-13: Smoothing and

Nonparametric Inference

Dr Jennifer Rogers

Hilary Term 2017

Background

Smoothing and nonparametric methods

I Approximating function that attempts to capture importantpatterns in datasets or images

I Leaves out noiseI Aid data analysis by being able to extract more information

from the dataI Analyses are flexible and robust

I Should always just use a nonparametric estimator?

Smoothing and nonparametric methods

No!

I There is no miracle!I There is a price to pay for the gain in generalityI When we have clear evidence of a good parametric model

for the data, we should use itI Nonparametric estimators converge to true curve slower

VALID parametric estimatorsI But as soon as the parametric model is incorrect, a

parametric estimator will never converge to the true curve

So nonparametric methods have their place!

The regression problem

I Goal of regression→ discover the relationship betweentwo variables, X and Y

I Wish to find a curve m that passes “in the middle” of thepoints

I Observations (xi ,Yi) for i = 1, . . . ,nI xi is a real-valued variableI Yi a real-valued random response

I Yi = m(xi) + εi for i = 1, . . . ,nI E(εi | Xi ) = 0I Var(εi | Xi ) = σ2(Xi )I m(x) = E(Y | X = x)

I m(·): regression functionI Reflects the relationship between X and YI Curve of interest and “lies in the middle” of all the points

I Goal is to infer m(x) from observations (xi ,Yi)

Example: Cosmic microwave background data

Example: FTSE stock market index

Linear smoothers

Linear smoothers

I In the parametric context, we assume we know the shapeof m(·)

I Linear model: Y = α + βX + εI m(x) = E(Y | X = x) = α + βx

I We estimate α and β from the dataI Least squares estimator:

(α, β) = argmin(α,β)

∑i

(Yi − α− βXi)2

I Consider m(xi) = 1 + 2xi

I We can fit a linear model to the data and obtainα = 0.9905 and β = 2.0025

Linear modelling

> x <- seq(from=0,to=1,length.out=1000)> e <- rnorm(1000,0,0.2)

> y1 <- 2*x+1+e

> lm(y1˜x)

Call:lm(formula = y1 ˜ x)

Coefficients:(Intercept) x

0.9905 2.0025

Linear modelling

Linear modelling

I Linear model: linear in the parameters!I No higher order terms such as αβ or β2

I Not necessarily linear in XiI Examples of linear models:

I Yi = 8X 3i − 13.6X 2

i + 7.28Xi − 1.176 + εI Yi = cos(20Xi ) + ε

Linear modelling

> lm(y2˜x+I(xˆ2)+I(xˆ3))

Linear modelling

> lm(y3˜I(cos(20*x)))

Linear modelling

I Can still use linear modellingI Requires knowledge of functional form of explanatory

variablesI May not always be obvious

I Consider linear smoothers - much more generalI Obtain a non-trivial smoothing matrix even for just a single

‘predictor’ variable (p = 1)

Linear smoothersI For some n × n-matrix S, Y = SYI Fitted value Yi at design point xi is a linear combination of

measurements

Yi =n∑

j=1

SijYj

I Linear regression with p predictor variables:

θ = argminθ∑

i

(Yi − Xiθ)2

= argminθ(Y − θX )(Y − θX )

= argminθ(Y T Y − 2θT X T Y + θT X T Xθ)

I Differentiate with respect to θ and setting to zero

−2X T Y + 2X T Xθ = 0

θ = (X T X )−1X T Y

Linear smoothers

θ = (X T X )−1X T Y

I Estimated (fitted) values Y are X θ

Y = X (X T X )−1X T Y = S Y ,

I S is a n × n-matrixI Hat-matrix, H, from linear regression

Linear smoothersI Degrees of freedom for linear regression

tr(X (X T X )−1X T ) = tr(X T X (X T X )−1) = tr(1p) = p

I How large is the expected residual sum of squares

E(RSS) = E( n∑

i=1

(Yi − Yi)2)

?

I If Y = Xθ, then E(RSS) = E(∑n

i=1 ε2i ) = nσ2

I If Y = SY , then Y − Y = Sε− ε

E(RSS) = σ2(n − p)

A good estimator of σ2 is thus

σ2 =RSSn − p

=RSS

n − df

Linear smoothers

Y = m(x) = S Y ,

I Serum data, taken in connection with Diabetes researchI Y : log-concentration of a serumI x : age of children in monthsI Various ways how S can be chosen:

Linear smoothers

Methods considered fall into two categories

I Local regression, includingI Kernel estimators andI Local polynomial regression.

I Penalized estimators, mainlyI Smoothing splines.

Local estimation: Kernelestimators

Histogram

I X ∼ FI P(x ≤ X ≤ x + ∆x) =

∫ x+∆xx fX (v)dv

I Thus, for any u ∈ [x , x + ∆x ]:

P(x ≤ X ≤ x + ∆x) ≈ ∆x · fX (u),

I This implies

fX (u) ≈ P(x ≤ X ≤ x + ∆x)

∆x

fX (u) =#Xi : x ≤ Xi ≤ x + ∆x

n∆x

This is the idea behind histograms.

Histogram

I Choose an origin t0I Choose a bin size, hI Partition real line into intervals Ik = [tk , tk+1] of equal length

hI Histogram estimator:

fH(x) =#Xi : tk ≤ Xi ≤ tk+1

nh.

Step function that depends heavily on both the origin, t0, andthe bin width, h

Example: Old Faithful geyserDuration in minutes of 272 eruptions of the Old Faithful geyserin Yellowstone National Park

> hist(faithful$eruptions,probability = T)

Example: Old Faithful geyser

What happens if we change the time origin and the bin width?

> hist(faithful$eruptions,breaks=seq(1.5,5.5,1),probability = T,xlim=c(1,5.5))

> hist(faithful$eruptions,breaks=seq(1.1,5.1,1),probability = T,xlim=c(1,5.5))

> hist(faithful$eruptions,breaks=seq(0.5,5.5,0.5),probability = T,xlim=c(1,5.5))

> hist(faithful$eruptions,breaks=seq(0.75,5.75,0.5),probability = T,xlim=c(1,5.5))

Example: Old Faithful geyser

Density estimatorCan we do better? We can get rid of time origin.

fX (x) = limh→0

FX (x + h)− FX (x)

h

= limh→0

FX (x)− FX (x − h)

h

Combining the two expressions:

fX (x) = limh→0

FX (x + h)− FX (x − h)

2h= lim

h→0

P(x − h < X < x + h)

2h

Which we can estimate using proportions:

fX (x) =1

nh

n∑i=1

K(x − Xi

h

),

With K (x) = 1/2 · I| x |< 1

Density estimator

I Similar to the histogramI No longer have the origin, t0I More flexible

I Constructs a box of length 2h around each observation Xi

I Estimator is then the sum of the boxes at xI Density that depends on the bandwidth, h

Kernel estimators

I Put a smooth symmetric ‘bump’ of shape K around eachobservation

I Estimator at x is now the sum of the bumps at xI We define

fX (x) =1

nh

n∑i=1

K(x − Xi

h

),

I K : ‘kernel’ functionI Estimator has the same properties of K

I If K is continuous and differentiable→ so is the estimatorI Estimator is a density if K is a density

I Shape of K does not influence the resulting estimatorI Estimator does depend heavily on h

Kernel estimators

Kernels

A kernel is a real-valued function K (x) such thatI K (x) ≥ 0 for all x ∈ R,I∫

K (x) dx = 1,I∫

xK (x) dx = 0.

In practice, the choice of K does not influence the results much,but the value of h is crucial

Kernels

Commonly used kernels includeI Boxcar: K (x) = I(x)/2

I Gaussian: K (x) = (2π)−1/2 exp(−x2/2)

I Epanechnikov: K (x) = 34(1− x2)I(x)

I Biweight: K (x) = 1516(1− x2)2I(x)

I Triweight: K (x) = 3532(1− x2)3I(x)

I Uniform: 12 I(x)

I(x) = 1 if |x | ≤ 1 and I(x) = 0 otherwise

Example: Old Faithful geyser

Kernel regression

Kernel regression

Y = m(x) + ε

1. We want to estimate E(Y |X = x). Naive estimator:

m(x) =

∑ni=1 Yi

n.

Same for all x

2. Average the Yis of only those Xis that are close to x (localaverage):

m(x) =

∑ni=1 Yi · I|Xi − x | < h∑n

i=1 I|Xi − x | < h.

h: bandwidth, determines the size of the neighbourhoodaround x

Kernel regression

Y = m(x) + ε

3. Give a slowly decreasing weight to Xi as it gets far from x ,rather than giving the same weight to all observationsclose to x :

m(x) =n∑

i=1

YiW (x − Xi),

W (·): weight function that decreases as x increases and∑ni=1 W (x − Xi) = 1

Nadaraya-Watson estimator

W (x − Xi) = K(x − Xi

h

)/

n∑j=1

K(x − Xj

h

)Hence, the Nadaraya-Watson kernel estimator is

m(x) =

∑ni=1 YiK (x−Xi

h )∑nj=1 K (

x−Xjh )

The estimated function values Yj = m(xj) at the madeobservations are given by

Yj =∑

i

SijYj , where Sij =K (

xj−xih )∑

k K (xj−xk

h ),

The kernel smoother is thus a linear smoother

Local least squares

We can rewrite the kernel regression estimator as

m(x) = argminmx∈R

n∑i=1

K(

x − xi

h

)(Yi −mx )2.

Exercise: Can be verified by solvingd

dmx

∑ni=1 K (x−xi

h )(Yi −mx )2 = 0

I Thus, for every fixed x , have to search for the best localconstant mx such that the localized sum of squares isminimized

I Localization is here described by the kernel and gives alarge weight to those observations (xi ,Yi) where xi is closeto the point x of interest.

The choice of the bandwidth h is crucial

Example: FTSE stock market index

Example: Cosmic microwave background data

Choosing the bandwidth

Choosing the bandwidth

I Measure “success” of fit using mean squared error on newobservations (MSE),

MSE(h) = E

(Y − mh(x))2,I Splitting into noise, bias and variance

MSE(h) = Noise + Bias2 + Variance

I Bias decreases if h ↓ 0I Variance increases if h ↓ 0

I Choosing the bandwidth is a bias-variance trade-off.

Choosing the bandwidth

I But...we don’t know MSE(h), as we just have a nobservations and cannot generate new randomobservations Y

I First idea is to compute, for various values of thebandwidth h:

I The estimator mh for a training sampleI The error

n−1n∑

i=1

(Yi − mh(x))2 = n−1n∑

i=1

(Yi − Yi )2,

I Choose the bandwidth with the smallest training error

CMB data

Choosing the bandwidth

I We would choose a bandwidth close to h = 0, giving nearperfect interpolation of the data, that is m(xi) ≈ Yi

I This is unsurprisingI Parametric context

I Shape of the model is fixedI Minimising the MSE makes the parametric model as close

as possible to the dataI Nonparametric setting

I Don’t have a fixed shapeI Value of h dictates the modelI Minimising the MSE→ fitted model as close as possible to

the dataI Lead us to choose h as small as possibleI Interpolation of the data

I Misleading result→ Only noise is fitted for very smallbandwidths

Cross-validation

I Solution...don’t use Xi to construct m(Xi)I This is the idea behind cross-validationI Leave-one-out cross-validationI Least squares cross-validation

I For each value of hI For each i = 1, . . . ,n, compute the estimator m(−i)

h (x),where m(−i)

h (x) is computed without using observation iI The estimated MSE is then given by

MSE(h) = n−1∑

i

(Yi − m(−i)h (xi ))2

CMB data

CMB data

A bandwidth of 44 minimises the estimated MSE

Cross-validation

I A drawback of LOO-CV is that it is expensive to computeI Fit has to be recalculated n times (once for each left-out

observation)I We can avoid needing to calculate m(x)(−i) for all iI For some n × n-matrix S, the linear smoother fulfills

Y = SY

The risk (MSE) under LOO-CV can subsequently be written as

MSE(h) = n−1n∑

i=1

(Yi − mh(xi)

1− Sii

)2

Cross-validation

I Do not need to recompute mh while leaving out each of then observations in turn

I Results can much faster be obtained by rescaling theresiduals

Yi − mh(xi)

with the factor (1− Sii)

I Sii is the i-th diagonal entry in the smoothing matrix

Generalized Cross-Validation

MSE(h) = n−1n∑

i=1

(Yi − mh(xi)

1− Sii

)2,

Replace Sii by its average ν/n (where ν =∑

i Sii )

Choose bandwidth h that minimizes

GCV (h) = n−1n∑

i=1

(Yi − mh(xi)

1− ν/n

)2.

Local polynomial regression

Nadaraya-Watson kernel estimator

I Major disadvantage of the Nadaraya-Watson kernelestimator→ boundary bias

I Bias is of large order at the boundaries

Local polynomial regression

I Even when a curve doesn’t look like a polynomialI Restrict to a small neighbourhood of a given point, xI Approximate the curve by a polynomial in that

neighbourhood

I Fit its coefficients using only observations Xi close to xI (or rather, putting more weight to observations close to x)

I Repeat this procedure at every point x where we want toestimate m(x)

Local polynomial regression

I Kernel estimator approximates the data by taking localaverages within small bandwidths

I Use local linear regression to obtain an approximation

Kernel regression estimator

Recall that the kernel regression estimator is the solution to:

m(x) = argminm(x)∈R

n∑i=1

K(

x − xi

h

)(Yi −m(x))2

This given by

m(x) =

∑ni=1 YiK (x−Xi

h )∑nj=1 K (

x−Xjh )

.

Thus estimation corresponds to the solution to the weightedsum of squares problem

Local polynomial regression

Using Taylor Series, we can approximate m(x), where x isclose to a point x0 using the following polynomial:

m(x) ≈ m(x0) + m(1)(x0)(x − x0) +m(2)(x0)

2!(x − x0)2 + . . .

. . .+m(p)(x0)

p!(x − x0)p

= m(x0) + β1(x − x0) + β2(x − x0)2 + · · ·+ βp(x − x0)p

where m(k)(x0) = k !βk , provided that all the requiredderivatives exist

Local polynomial regression

I Use the data to estimate that polynomial of degree p whichbest approximates m(xi) in a small neighbourhood aroundthe point x

I Minimise with respect to β0, β1, . . . , βp the function:

n∑i

Yi − β0 − β1(xi − x)− . . .− βp(xi − x)p2K(x − xi

h

)

I Weighted least squares problem, where the weights aregiven by the kernel functions K ((x − xi)/h)

I As m(k)(x) = k !βk , we then have that m(x) = β0, or

m(x) = β0

CMB data

Red: Kernel smoother (p = 0)Green: Local linear regression (p = 1)

Boundary bias: kernel estimator

Let `i(x) = ωi(x)/∑

j ωj(x), so that

m(x) =∑

i

`i(x)Yi

For the Kernel smoother (p = 0), the bias of the linear smootheris thus

Bias = E(m(x))−m(x) = m′(x)∑

i

(xi − x)`i(x) +

m′′(x)

2

∑i

(xi − x)2`i(x) + R,

Boundary bias: Kernel estimator

First term in the expansion is equal to

m′(x)∑

i

(xi − x)K( x−xi

h

)K(x−xi

h

)I vanishes if the design points xi are centred symmetrically

around xI does not vanish if x sits at the boundary (all xi − x will have

the same sign)

Boundary bias: polynomial estimator

I m(x) is truly a local polynomial of degree pI At least p + 1 points with positive weights in the

neighbourhood of x

Bias will hence be of order

Bias = E(m(x))−m(x) =m(p+1)(x)

(p + 1)!

∑j

(xj − x)p+1`j(x) + R

Why not choose p = 20?

Boundary bias: polynomial estimator

I Yi = m(x) + σεi with εi ∼ N (0,1)

I Variance of the linear smoother, m(x) =∑

j `j(x)Yi , is,

Var(m(x)) = σ2∑

j

`2j (x) = σ2‖`(x)‖2

I ‖`(x)‖2 tends to be large if p is large

I In practice, p = 1 is a good choice

Example: Doppler function

m(x) =√

x(1− x) sin( 2.1π

x + .05

), 0 ≤ x ≤ 1.

Example: Doppler function

> n <- 1000> x <- seq(0,1,length=n)> m <- sqrt(x*(1-x))*sin(2.1*pi/(x+0.05))> plot(x,m,type=’l’)> y <- m+rnorm(n)*0.075> plot(x,y)> fit <- locpoly(x,y,bandwidth=dpill(x,y)*2,degree=1)> lines(fit,col=2)> plot(x,y)> fit2 <- locpoly(x,y,bandwidth=dpill(x,y)/2,degree=1)> lines(fit2,col=2)> plot(x,y)> fit3 <- locpoly(x,y,bandwidth=dpill(x,y)/4,degree=1)> lines(fit3,col=2)

Example: Doppler function

Penalised regression

Penalised regression

Regression model i = 1, . . . ,n,

Yi = m(xi) + εi

E(εi) = 0Estimating m by choosing m(x) to minimize

n∑i=1

(Yi − m(xi))2

leads toI linear regression estimate if minimizing over all linear

functionsI an interpolation of the data if minimizing over all

functions.

Penalised regression

Estimate m by choosing m(x) to minimize

n∑i=1

(Yi − m(xi))2 + λJ(m),

J(m): roughness penalty

Typically

J(m) =

∫(m′′(x))2 dx .

Parameter λ controls trade-off between fit and penaltyI For λ = 0: interpolationI For λ→∞: linear least squares line

Example: Doppler function

Splines

Splines

I Kernel regressionI Researcher isn’t interested in actually calculating m(x) for a

single location xI m(x) calculated on sufficiently small grid of x-valuesI Curve obtained by interpolationI Local polynomial regression: unknown mean function was

assumed to be locally well approximated by a polynomialI Alternative approach

I Represent the fit as a piecewise polynomialI Pieces connecting at points called knotsI Once the knots are selected, an estimator can be

computed globallyI Manner similar to that for a parametrically specified mean

function

This is the idea behind splines

Splines

IID sample (X1,Y1), (X2,Y2), . . . (Xn,Yn) coming from the model

Yi = m(Xi) + εi

Want to estimate the mean of the variable Y withm(x) = E(Y |X = x)

A very naive estimator of E(Y |X = x) would be the samplemean of x :

m(x) =

∑ni=1 Yi

nNot very good (same for all x)

Splines

Approximate m by piecewise polynomials, each on a smallinterval:

m(x) =

c1 if x < ξ1c2 if ξ1 ≤ x < ξ2. . .ck if ξk−1 ≤ x < ξkck+1 if x ≥ ξk

Splines

Use more general lines, which join at the ξs:

m(x) =

a1 + b1x if x < ξ1a2 + b2x if ξ1 ≤ x < ξ2. . .ak + bkx if ξk−1 ≤ x < ξkak+1 + bk+1x if x ≥ ξk

a and b are such that the lines join at each ξ

Splines

Approximate m(x) by polynomials

m(x) =

∑pj=0 β1,jx j if x < ξ1∑pj=0 β2,jx j if ξ1 ≤ x < ξ2

. . .∑pj=0 βk ,jx j if ξk−1 ≤ x < ξk∑pj=0 βk+1,jx j if x ≥ ξk

βjs are such that the polynomials join at each ξ and theapproximation has p − 1 derivatives

Splines which are piecewise polynomials of order pI Splines of order p + 1I Splines of degree pI ξ : knots

Splines

Piecewise constant splines

Knots

How many knots should we have?I Choose a lot of knots well widespread over the data range→ reduce the bias of the estimator

I If we make it too local→ estimator will be too wiggly

I Overcome the bias problem without increasing thevariance→ take a lot of knots, but constrain their influence

I We can do this using penalised regression

Spline order

What order spline should we use?I Increase the value of p→ make the estimator mp smoother (since it has p − 1continuous derivatives)

I If we have p too large→ increase the number of parameters to estimate

I In practice rarely useful to take p > 3I p = 2

I Splines of order three or quadratic splinesI p = 3

I Splines of order 4 or cubic splinesI p-th order spline is a piecewise p − 1 degree polynomial

with p − 2 continuous derivatives at the knots

Natural splines

I Natural spline: linear beyond the boundary is called anatural spline

I Why this constraint?I We usually have very few observations beyond the two

extreme knotsI Want to obtain an estimator of the regression curve thereI Cannot reasonably estimate anything correct thereI Rather use a simplified model (e.g. linear)I Often gives more or less reasonable results

Natural cubic splines

ξ1 < ξ2 < . . . < ξn set of ordered points, so-called knots,contained in an interval (a,b)

A cubic spline is a continuous function m such that(i) m is a cubic polynomial over (ξ1, ξ2), . . ., and(ii) m has a continuous first and second derivatives at the

knots.The solution to

n∑i=1

(yi − m(xi))2 + λ

∫(m′′(x))2 dx

is a natural cubic spline with knots at the data points

m(x) is called a smoothing spline

Natural cubic splines

I Sequence of values f1, . . . , fn at specified locationsx1 < x2 < · · · < xn

I Find a smooth curve g(x) that passes through the points(x1, f1), (x2, f2), . . . , (xn, fn)

I The natural cubic spline g is an interpolating function thatsatisfies the following conditions:

(i) g(xj ) = fj , j = 1, . . . ,n,(ii) g(x) is cubic on each subinterval

(xk , xk+1), k = 1, . . . , (n − 1),(iii) g(x) is continuous and has continuous first and second

derivatives,(iv) g′′(x1) = g′′(xn) = 0.

B-splines

I Need a basis for natural polynomial splinesI Convenient is the so-called B-spline basis

I Data points a = ξ0 < ξ1 < ξ2 < . . . , ξn ≤ ξn+1 = b in (a,b)I There are n + 2 real values

I The n ≥ 0 are called ‘interior knots’ or ‘control points’I And there are always two endpoints, ξ0 and ξn+1

I When the knots are equidistant they are said to be‘uniform’

B-splines

Now define new knots τ asI τ1 ≤ . . . ≤ τp = ξ0 = aI τj+p = ξjI b = ξn+1 = τn+p+1 ≤ τn+p+2 ≤ . . . ≤ τn+2p

I p: order of the polynomialI p + 1 is the order of the splineI Append lower and upper boundary knots ξ0 and ξn+1 p

timesI Needed due to the recursive nature of B-splines

B-slines

Define recursively

I For k = 0 and i = 1, . . . ,n + 2p

Bi,0(x) =

1 τi ≤ x < τi+10 otherwise

I For k = 1,2, . . . ,p and i = 1, . . . ,n + 2p

Bi,k (x) =x − τi

τi+k−1 − τiBi,k−1(x) +

τi+k − xτi+k − τi+1

Bi+1,k−1(x)

Support of Bi,k (x) is [τi , τi+k ]

B-splines

Solving

I Solution depends on regularization parameter λI Determines the amount of “roughness”

I Choosing λ isn’t necessarily intuitiveI Degrees of freedom = trace of the smoothing parameter S

I Sum of the eigenvalues

S = B(BT B + λΩ)−1BT

I Monotone relationship between df and λI Search for a value of λ for desired df

I df=2→ linear regressionI df=n→ interpolate data exactly

Example: Doppler function

Example: Doppler function

Could of course choose λ by LOO-CV or GCV

Cross validation> plot(x,y)> fitcv <- smooth.spline(x,y,cv=T)> lines(fitcv,col=2)> fitcvCall:smooth.spline(x = x, y = y, cv = T)

Smoothing Parameter spar= 0.157514lambda= 2.291527e-08 (16 iterations)Equivalent Degrees of Freedom (Df): 124.738Penalized Criterion: 6.071742PRESS: 0.007898575

Generalised cross validation> plot(x,y)> fitgcv <- smooth.spline(x,y,cv=F)> lines(fitgcv,col=4)> fitgcvCall:smooth.spline(x = x, y = y, cv = F)

Smoothing Parameter spar= 0.1597504lambda= 2.378386e-08 (15 iterations)Equivalent Degrees of Freedom (Df): 124.2353Penalized Criterion: 6.078626GCV: 0.007925571

Multivariate smoothing

Multivariate smoothing

I So far we have only considered univariate functionsI Suppose there are several predictors that we would like to

treat nonparametrically

I Most ‘interesting’ statistical problems nowadays arehigh-dimensional with, easily, p > 1000

I Biology: Microarrays, Gene maps, Network inferenceI Finance: Prediction from multi-variate time-seriesI Physics: Climate models

I Can we just extend the methods and model functionsRp 7→ R nonparametrically?

Curse of dimensionality

I One might consider multidimensional smoothers aimed atestimating:

Y = m(x1, x2, . . . , xp)

I Considered methods rely on ‘local’ approximationsI Examine behaviour of data-points in the neighbourhood of

the point of interestI What is ‘local’ and ‘neighbourhood’ if p →∞ and n

constant

Curse of dimensionality

x = (x (1), x (2), . . . , x (p)) ∈ [0,1]p.

To get 5% of all n sample points into a cube-shapedneighbourhood of x , we need a cube with side-length 0 < ` < 1such that

`p ≤ 0.05

Dimension p side length `1 0.052 0.225 0.54

10 0.741000 0.997

Additive models

Require the function m : Rp 7→ R to be of the form

madd (x) = µ+ m1(x (1)) + m2(x (2)) + . . .+ mp(x (p))

= µ+

p∑j=1

mj(x (j)), m ∈ R

mj(·) : R 7→ R just a univariate nonparametric function

E [mj(x (j))] = 0 j = 1, . . . ,p

I Choice of smoother is left openI Avoids curse of dimensionality→ ‘less flexible’I Functions can be estimated by ‘backfitting’

Backfitting

Data x (j)i , 1 ≤ i ≤ n and 1 ≤ j ≤ p

A linear smoother for variable j can be described by an × n-matrix Sj , so that

mj = S(j)Y ,

I Y = (Y1, . . . ,Yn)T : observed vector of responses

I mj = (mj(x(j)1 ), . . . , mj(x

(j)n )): regression fit

I S(j) smoother with bandwidth estimated by LOO-CV orGCV

Backfitting

madd (x) = µ+

p∑j=1

mj(x (j)),

Suppose µ and mk given for all k 6= j

madd (xi) =(µ+

∑k 6=j

mk (x (k)i ))

+ mj(x(j)i )

Now to apply the smoother S(j) to

Y −(µ+

∑k 6=j

mk

)Cycle through all j = 1, . . . ,p to get

madd (x) = µ+

p∑j=1

mj(x(j)i ), m ∈ R.

Backfitting

1. Use µ← n−1∑ni=1 Yi . Start with mj ≡ 0 for all j = 1, . . . ,p

2. Cycle through the indices j = 1,2, . . . ,p,1,2, . . . ,p, . . .,

mj ← S(j)(Y − µ1−∑k 6=j

mk .

Also normalize

mj(·)← mj(·)− n−1n∑

i=1

mj(x(j)i )

update µ← n−1∑ni=1(Yi −

∑k mk (x (k)

i ))Stop iterations if functions do not change very much

3. Return

madd (xi)← µ+

p∑j=1

mj(x(j)i )

Example: Ozone data

Example: Ozone data

Iteration 1

Iteration 2

Iteration 3

Iteration 7