Linear Regression

Lecture Notes for 36-707Linear Regression

Fall 2010

Revised from Larry Wasserman’s 707 notes

1 PredictionRegression analysis is used to answer questions about how one variable depends on the level ofone or more other variables. Does diet correlate with cholesterol level, and does this relationshipdepend on other factors, such as age, smoking status, and level of exercise?

We start by studying linear regression. Virtually all other methods for studying dependenceamong variables are variations on the idea of linear regression. For this reason the text book focuseson linear regression. The notes start with linear regession and then go on to cover other modernregression-based techniques, such as nonparametric regression, generalized linear regression, tree-based methods, and classification.

In the simplest scenario we have one response variable (Y ) and one predictor variable (X). Forexample, we might predict a son’s height, based on his father’s height (Figure 1). Or we mightpredict a cat’s heart weight, based on its total body weight (Figure 2).

Suppose (X, Y ) have a joint distribution f(x, y). You observe X = x. What is your bestprediction of Y ? Let g(x) be any prediction function, for instance a linear relationship. Theprediction error (or risk) is

R(g) = E(Y − g(X))2

where E is the expected value with respect to the joint distribution f(x, y). Condition on X = xand let

r(x) = E(Y |X = x) =

∫yf(y|x)dy be the regression function.

Let ε = Y − r(X). Then,E(ε) = E[E[Y − r(X)|X = x] = 0

and we can writeY = r(X) + ε. (1)

Key result: for any g, the regression function r(x) minimizes theprediction error

R(r) ≤ R(g).

We don’t know r(x), so we estimate it from the data. This is the fundamental problem inregression analysis.

1

1.1 Some TerminologyGiven data (X1, Y1), . . . , (Xn, Yn) we have two goals:

estimation: Find an estimate r(x) of the regression function r(x).prediction: Given a new X , predict Y ; we use Y = r(X) as the prediction.

At first we assume that Yi ∈ R. Later in the course, we consider other cases such as Yi ∈ 0, 1.

r linear r arbitraryX scalar r(x) = β0 + β1x r(x) is some smooth function

simple linear regression nonparametric regressionX vector r(x) = β0 +

∑j βjxj r(x1, . . . , xp) is some smooth function

multiple linear regression multiple nonparametric regression

60 65 70 75

6065

7075

father

son

Figure 1: Galton data. Predict Son’s height from Father’s height.

2 Simple Linear Regression: X scalar and r(x) linearSuppose that Yi ∈ R, Xi ∈ R and that

h r(x) = β0 + β1x. (2)

2

2.0 2.5 3.0 3.5

68

1012

1416

1820

X = Body Weight

Y = H

eart W

eight

Figure 2: Cat example

This model is only an approximation to the truth, but often times it is close enough tocorrect that it is worth it to see what we can learn with a simple model. Later on we’ll learnthat we needn’t assume that r is linear. I use the hsymbol to alert you to model-basedstatements.

We can writeYi = β0 + β1Xi + εi (3)

where E(εi) = 0 and ε1, . . . , εn are independent. We also assume that V(εi) = σ2 does not dependon x. (Homoskedasticity.) The unknown parameters are: β0, β1, σ

2. Define the residual sums ofsquares

RSS(β0, β1) =n∑

i=1

(Yi − (β0 + β1Xi)

)2

. (4)

The least squares estimators (LS) minimize: RSS(β0, β1).

2.1 Theorem. The LS estimators are

β1 =

∑ni=1(Xi −X)(Yi − Y )∑n

i=1(Xi −X)2(5)

β0 = Y − β1X (6)

where X = n−1∑n

i=1Xi and Y = n−1∑n

i=1 Yi.

For details, see Weisberg, p. 273.

3

Compare RSS to the risk. The former is an empirical version of the latter calculated for g(X) =β0 + β1X .

We define:The fitted line: r(x) = β0 + β1x

The predicted or fitted values: Yi = r(Xi) = β0 + β1Xi

The residuals: εi = Yi − YiThe residual sums of squares: RSS =

∑ni=1 ε

2i

An unbiased estimate of σ2 is

h σ2 =RSS

n− 2(7)

The estimators are random variables and have the following properties (conditional onX1, . . . , Xn):

E(β0) = β0, E(β1) = β1, V(β1) =σ2

n s2x

where s2x = n−1

∑ni=1(Xi −X)2.

Let’s derive some of these facts. Let

di =(Xi − X)∑ni=1(Xi − X)2

.

Note that∑

i di = 0. Then

E(β1) = En∑

i=1

di(Yi − Y )

=n∑

i=1

diE(Yi) + Yn∑

i=1

di

=n∑

i=1

di[β0 + β1X + β1(Xi − X)]

= β1

V(β1) =n∑

i=1

d2iV(Yi)

=σ2

ns2x

4

E(β0) =1

n

n∑

i=1

(β0 + β1Xi)− XE[β1]

= β0 + β11

n

n∑

i=1

(Xi)− XE[β1]

= β0

Also, E(σ2) = σ2. The standard error:

se(β1) =σ√n sx

.

Both β0 and β1 are linear combinations of Y1, . . . , Yn, so it follows from the Central LimitTheorem that they are approximately normally distributed.

Approximate Normality

h β0 ∼ N(β0, se

2(β0)), h β1 ∼ N

(β1, se

2(β1))

(8)

If εi ∼ N(0, σ2) then:

1. Equation (8) is exact.

2. The least squares estimators are the maximum likelihood estimators.

3. The variance estimator satisfies:

σ2 ∼ σ2χ2n−2

n− 2

And E[σ2] = σ2(n−2)n−2

= σ2.

Note: To verify these results again assume calculations are performed conditional onX1, . . . , Xn.Then

Yi ∼ N(β0 + β1xi, σ2).

The likelihood is

f(y1, . . . , yn; β0, β1, σ2) =

n∏

i=1

f(yi; β0, β1, σ2).

If we write out the likelihood of the normal model, the result follows directly.

5

2.1 InferenceIt follows from (8) that an approximate 1− α confidence interval for β1 is

h β1 ± zα/2se(β1) (9)

where zα/2 is the upper α/2 quantile of a standard Normal:

P(Z > zα/2) =α

2, where Z ∼ N(0, 1).

For α = .05, zα/2 = 1.96 ≈ 2, so, an approximate 95 per cent confidence interval for β1 is

β1 ± 2se(β1). (10)

2.2 Remark. If the residuals are Normal, then an exact 1− α confidence interval for β1 is

β1 ± tα/2,n−2se(β1) (11)

where tα/2,n−2 is the upper α/2 quantile of a t with n−2 degrees of freedom. Although technicallycorrect, and used in practice, this interval is bogus. If n is large, tα/2,n−2 ≈ zα/2 so one may aswell just use the Normal interval. If n is so small, that tα/2,n−2 is much different than zα/2, the nis too small to be doing statistical inference. (Do you really believe that the residuals are exactlyNormal?)

To testh H0 : β1 = 0 versus H1 : β1 6= 0 (12)

use the test statistic

z =β1 − 0

se(β1). (13)

Under H0, z ≈ N(0, 1). The p-value is

p− value = P(|Z| > |z|) = 2Φ(−|z|) (14)

where Z ∼ N(0, 1). Reject H0 if p-value is small.

6

2.2 R Statistical PackageR is a flexible, powerful, statistical programming language. This flexibility is its biggest strength,but also its greatest challenge. Like the English language, there are many ways to say the samething. Throughout these notes examples are provided that help you to learn at least one way to ac-complish the tasks at hand. To find out more about any function in R by using help(functionname).To search for a function you don’t know the name of, use help.search(”keyword”). Refer to the Rdocuments on the class web page.

2.2.1 Matrices

> satdata = matrix(c(61, 70, 63, 72, 66, 1100, 1320, 1500, 1230, 1400),ncol = 2, dimnames = list(rep(" ", 5), c("Height", "SAT")) )> satdata

Height SAT61 110070 132063 150072 123066 1400

> satdata[,1]61 70 63 72 66> satdata[4,]Height SAT

72 1230> satdata[3,2][1] 1500

> score = satdata[,2]> height = satdata[,1]> which(score >= 1300)2 3 5> any(score > 1510)[1] FALSE

More commands: max(height) min(score) var(score) sd(height) median(score) sum(height)

2.2.2 Basic Graphs

> plot(satdata, xlab = "Height (inches)", ylab = "SAT Score",main = "Scatterplot of Height vs. SAT Score", pch = 19, col = "red")

> abline(h = mean(satdata[,2]) )> legend(66, 1150, "Mean SAT Score", lty = 1)

7

Figure 3:

2.2.3 Extras

heightcm = height*2.54#scalar operators apply to entire vector.

samplescores = rnorm(10, mean = 1200, sd = 100)

ls()rm(heightcm)

regressionline = lm(score˜height)summary(regressionline)

read.table(‘‘filename.txt’’,....)#reads in data from a file calld filename.txt and creates an R object.

2.2.4 More on Plotting

These commands can help you to make attractive graphs with more meaningful labels, etc.

Commands to use before ’plot’:par(mfrow = c(numrow, numcol)): Sets the number of graphs per window.

For example, par(mfrow = c(3, 2)) will make a window for 3 rows and 2 columnsof graphs. You can return your graphing window to its default withpar(mfrow = c(1, 1)).

8

Graphics Parameters for Plot Commands:Labels:xlab: set the x-axis labelylab: set the y-axis labelmain: set the main labele.g. plot(x, y, xlab = "Height (cm)", ylab = "Weight (kg)",

main = "Scatterplot of Height v. Weight")Points:pch: set the shape of plotted pointscex: set the size of plotted points, 1 is defaultcol: set the color of plotted points.Each of these can be vectors. For example, if you have four points, you couldset each color as follows:plot(x, y, col = c("red", "orange", "cyan", "sandybrown")).

Window modifiers:xlim: set the x-range of the window. e.g. plot(..., xlim = c(10, 30))

will only show points in the x range of 10 to 30.ylim: set the y-range of the window.

Some other graph types:hist(x): plots a histogrampie(y): pie graph for categorical variablesboxplot(x): plots a boxplotbarplot(x): bar plot for categorical variables

Plotting Additional Lines:abline(intercept, slope); plots a line on the previous graph. You cangraph vertical lines with abline(v = xvalue) and horizontal lines withabline(h = yvalue). Additionally, you can use this command to plot regression lineson graphs. e.g.:regline = lm(y˜x)plot(x, y)abline(regline)You can modify the line’s width and type:lwd: set line width, 1 is default.lty: set line typeFor example, abline(regline, lwd = 2.5, lty = 2, col = "purple") plots a thickdashed purple line on the graph.

9

Plotting Additional Points:points(x, y): plot additional points.Useful if you have different vectors for different groups. e.g.plot(x1, y1, col = "green")points(x2, y2, col = "orange")

Adding a legend:legend(xposition, yposition, c("label1", "label2", ...), ...)Add a legend to your plot. For example:legend(10, 2, c("Freshmen", "Sophomores", "Juniors", "Seniors"), col = c("yellow","orange", "red", "blue"), pch = 19)

2.3 Example. Here is an example based on the relationship between a cat’s body weight andheart weight (Figure 2). To get the data we use in this example we call the library “MASS”. Thisis the bundle of functions and datasets to support Venables and Ripley, ’Modern Applied Statisticswith S’.

### Cat Example###> library(MASS) #load library containing data> attach(cats)> dim(cats)[1] 144 3 #There are 144 cats in the study.> help(cats)

The help file displays basic information about a data set. It is very important to learn howthe variables in the data set are recorded before proceeding with any analysis. For example, whendealing with population data, some variables might be recorded as a per-capita measurement, whileothers might be raw counts.

DescriptionThe heart and body weights of samples of male and female cats used fordigitalis experiments. The cats were all adult, over 2 kg body weight.

FormatThis data frame contains the following columns:

SexSex factor. Levels "F" and "M".BwtBody weight in kg.HwtHeart weight in g.

10

> names(cats)[1] "Sex" "Bwt" "Hwt"> summary(cats)Sex Bwt HwtF:47 Min. :2.000 Min. : 6.30M:97 1st Qu.:2.300 1st Qu.: 8.95

Median :2.700 Median :10.10Mean :2.724 Mean :10.633rd Qu.:3.025 3rd Qu.:12.12Max. :3.900 Max. :20.50

> par(mfrow = c(2,2)) #configure output window for 4 plots, 2x2> boxplot(Bwt, Hwt, names = c("Body Weight (kg)", "Heart Weight (g)"),+ main = "Boxplot of Body and Heart Weights")

> regline = lm(Hwt˜Bwt)> summary(regline)

Call:lm(formula = Hwt ˜ Bwt)

Residuals:Min 1Q Median 3Q Max

-3.56937 -0.96341 -0.09212 1.04255 5.12382

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.3567 0.6923 -0.515 0.607Bwt 4.0341 0.2503 16.119 <2e-16 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.452 on 142 degrees of freedomMultiple R-squared: 0.6466, Adjusted R-squared: 0.6441F-statistic: 259.8 on 1 and 142 DF, p-value: < 2.2e-16

The most important section of the R summary output is ”Coefficients:”. Here we see thatthe fitted regression line is (Hwt) = −0.3567 + 4.0341 ∗ Bwt. The final column gives symbolsindicating the significance of a coefficient. For example, the three stars (***) for Bwt indicatethat the significance of the Bwt coefficient is between 0 and 0.001. That is, if Bwt and Hwt wereindependent, the probability of observing a Bwt coefficient as large in magnitude as 4.0341 is lessthan 0.001. In fact, we see that the associated p-value is < 2e− 16.

Generally we are not concerned with the significance of the intercept coefficient.

11

> plot(Bwt, Hwt, xlab = "Body Weight (kg)", ylab = "Heart Weight (g)",+ main = "Scatterplot of Body Weight v. Heart Weight", pch = 19)> abline(regline, lwd = 2)

> names(regline)[1] "coefficients" "residuals" "effects" "rank"[5] "fitted.values" "assign" "qr" "df.residual"[9] "xlevels" "call" "terms" "model"> r = regline$residuals> plot(Bwt, r, pch = 19, xlab = "Body Weight (kg)", ylab = "Residuals",+ main = "Plot of Body Weight v. Residuals")> abline(h = 0, col = "red", lwd = 2)

> r = scale(r)> qqnorm(r)> qqline(r)

Plots are shown in Figure 4. A Q-Q plot displays the sample (observed) quantiles againsttheoretical quantiles. Sample quantiles are scaled residuals. Theoretical quantiles are quantilesdrawn from the standard normal distribution. An ideal Q-Q plot has points falling more or less onthe diagonal line, indicating that our residuals are approximately normally distributed. If the pointsfall far from the line, a transformation may improve the reliability of the inferences (more later).

2.4 Example (Example: Election 2000). Background: In 2000 Bush and Gore were the maincandidates for President. Buchanan, a strongly conservative candidate, was also on the ballot. Inthe state of Florida, Bush and Gore essentially tied, hence the counts were examined carefullycounty by county. Palm Beach County exhibited strange results. Even though the people in thiscounty are not conservative, many votes were cast for Buchanon. Examination of the voting ballotrevealed that it was easy to mistakenly vote for Buchanon when intending to vote for Gore. Let’slook at the count of votes by county.

Figure 5 shows the plot of votes for Buchanan (Y) versus votes for Bush (X) in Florida. Theleast squares estimates (omitting Palm Beach County) and the standard errors are

β0 = 66.0991 se(β0) = 17.2926

β1 = 0.0035 se(β1) = 0.0002.

The fitted line isBuchanan = 66.0991 + 0.0035 Bush.

Figure 5 also shows the residuals. The inferences from linear regression are most accurate whenthe residuals behave like random normal numbers. Based on the residual plot, this appears not tobe the case in this example. If we repeat the analysis replacing votes with log(votes) we get

β0 = −2.3298 se(β0) = 0.3529

β1 = 0.730300 se(β1) = 0.0358.

12

Figure 4:

13

0 125000 250000

01500

3000

0 125000 250000

−5

00

05

00

7 8 9 10 11 12 13

23

45

67

8

7 8 9 10 11 12 13

−1

01

Figure 5: Voting Data for Election 2000. (Top row, left) Bush versus Buchanon (vertical); (toprow, right) Bush versus residuals. The bottom row replaces votes with log votes.

This gives the fitlog(Buchanan) = −2.3298 + 0.7303 log(Bush).

The residuals look much healthier. Later, we shall address the following questions: how do we seeif Palm Beach County has a statistically plausible outcome?

The statistic for testing H0 : β1 = 0 versus H1 : β1 6= 0 is |Z| = |.7303 − 0|/.0358 = 20.40with a p-value of P(|Z| > 20.40) ≈ 0. This is strong evidence that that the true slope is not 0.

14

2.3 h ANOVA and R2

In the olden days, statisticians were obsessed with summarizing things in analysis of variance(ANOVA) tables. The entries are called “sum of squares” and the sum of squares deviation betweenthe observed and fitted values are called the residual sum of squares (RSS). It works like this. Wecan write

n∑

i=1

(Yi − Y )2 =n∑

i=1

(Yi − Yi)2 +n∑

i=1

(Yi − Y )2

SStotal = RSS + SSreg

Then we create this table:

Source df SS MS FRegression 1 SSreg SSreg/1 MSreg/MSEResidual n-2 RSS RSS/(n-2)Total n-1 SStotal

Under H0 : β1 = 0, F ∼ F1,n−2. This is just another (equivalent) way to test this hypothesis.The coefficient of determination is

R2 =SSregSStot

= 1− RSS

SStot(15)

Amount of variability in Y explained by X . Also, R2 = r2 where

r =

∑ni=1(Yi − Y )(Xi −X)√∑n

i=1(Yi − Y )2∑n

i=1(Xi −X)2

is the sample correlation. This is an estimate of the correlation

ρ =

E(

(X − µX)(Y − µY ))

σXσY.

Note that−1 ≤ ρ ≤ 1.

2.4 Prediction IntervalsGiven new value X∗, we want to predict

Y∗ = β0 + β1X∗ + ε.

15

The prediction isY∗ = β0 + β1X∗. (16)

The stardard error of the estimated regression line at X∗ is

seline(Y∗) = σ

√1

n+

(X∗ −X)2

∑ni=1(Xi −X)2

. (17)

The variance of a predicted value at X∗ is σ2 plus the square of the variance of the estimatedregression line at X∗. Hence

sepred(Y∗) = σ

√1 +

1

n+

(X∗ −X)2

∑ni=1(Xi −X)2

. (18)

A confidence interval for Y∗ is

h Y∗ ± zα/2sepred(Y∗).

2.5 Remark. This is not really the standard error of the quantity Y∗. It is the standard error ofβ0 + β1X∗ + ε∗. Note that sepred(Y∗) does not go to 0 as n→∞. Why?

2.6 Example (Election Data Revisited). On the log-scale, our linear regression gives the fol-lowing prediction equation:

log(Buchanan) = −2.3298 + 0.7303 log(Bush).

In Palm Beach, Bush had 152,954 votes and Buchanan had 3,467 votes. On the log scale this is11.93789 and 8.151045. How likely is this outcome, assuming our regression model is appropriate?Our prediction for log Buchanan votes -2.3298 + .7303 (11.93789)=6.388441. Now, 8.151045 isbigger than 6.388441 but is is “significantly” bigger? Let us compute a prediction interval. We findthat sepred = .093775 and the approximate 95 per cent prediction interval is (6.200,6.578) whichclearly excludes 8.151. Indeed, 8.151 is nearly 20 standard errors from Y∗. Going back to thevote scale by exponentiating, the confidence interval is (493,717) compared to the actual numberof votes which is 3,467.

2.5 Confidence Bands2.7 Theorem (Scheffe, 1959). Let

I(x) =

(r(x)− c σ, r(x) + c σ

)(19)

where

r(x) = β0 + β1x

c =√

2Fα,2,n−2

√1

n+

(x− x)2

∑i(xi − x)2

.

16

Then,

h P(r(x) ∈ I(x) for all x

)≥ 1− α. (20)

2.8 Example. Let us return to the cat example. The R code is:

library(MASS)attach(cats)plot(Bwt,Hwt, xlab = "Body Weight (kg)", ylab = "Heart Weight (g)",main = "Body Weight vs. Heart Weight in Cats")regression.line = lm(Hwt˜Bwt)

abline(regression.line,lwd=3)r = regression.line$residualsn = length(Bwt)x = seq(min(Bwt),max(Bwt),length=1000)#Creates a sequence of 1000 numbers equally spaced between#the smallest and largest body weights.

d = qf(.95,2,n-2)#Finds the critical value (quantile of .95) for an f distribution with degrees#of freedom 2 and n-2. All major distributions have a ’q’ function, e.g.#qt, qnorm, qbinom, etc.beta = regression.line$coeffxbar = mean(Bwt)ssx = sum( (Bwt-xbar)ˆ2 )sigma.hat = sqrt(sum(rˆ2)/(n-2))stuff = sqrt(2*d)*sqrt( (1/n) + ((x-xbar)ˆ2/ssx))*sigma.hat### Important: Note that these are all scalars except that x is a vector.

r.hat = beta[1] + beta[2]*xupper = r.hat + stufflower = r.hat - stufflines(x,upper,lty=2,col=2,lwd=3)

The bands are shown in Figure 6.

17

Figure 6: Confidence Band for Cat Example.

2.6 Why Are We Doing This If The Model is Wrong?The model Y = β0 + β1x + ε is certainly false. There is no reason why r(x) should be exactlylinear. Nonetheless, the linear assumption might be adequate. But how do we assess whether thelinear assumption is adequate? There are three ways.

1. We can do a goodness-of-fit test.

2. Second, we can do a nonparametric regression that does not assume linearity.

3. We can take a purely predictive point and view β0 + β1x as an estimate of the best linearpredictor not as an estimate of the true regression function.

We will return to these points later.

3 Association Versus CausationThere is much confusion about the difference between causation and association. Roughly speak-ing, the statement “X causes Y ” means that changing the value of X will change the distribution

18

of Y . When X causes Y , X and Y will be associated but the reverse is not, in general, true.Association does not necessarily imply causation.

For example, there is a strong linear relationship between death rate due to breast cancer andfat intake. So,

RISK OF DEATH = β0 + β1FAT + ε (21)

where β1 > 0. Does that mean that FAT causes breast cancer? Consider two interpretations of(21).

ASSOCIATION (or correlation). Fat intake and breast cancer are associated. There-fore, if I observe someone’s fat intake, I can use equation (21) to predict their chanceof dying from breast cancer.

CAUSATION. Fat intake causes Breast cancer. Therefore, if I observe someone’s fatintake, I can use equation (21) to predict their chance of dying from breast cancer.Moreover, if I change someone’s fat intake by one unit, their risk of death from breastcancer changes by β1.

If the data are from a randomized study (X is randomly assigned) then the causal interpre-tation is correct. If the data are from an observational study, (X is not randomly assigned) thenthe association interpretation is correct. To see why the causal interpretation is wrong in the ob-servational study, suppose that people with high fat intake are the rich people. And suppose, forthe sake of the example, that rich people smoke a lot. Further, suppose that smoking does causecancer. Then it will be true that high fat intake predicts high cancer rate. But changing someone’sfat intake will not change their cancer risk.

How can we make these ideas precise? The answer is to use either counterfactuals or directedacyclic graphs.

Look at the top left plot in Figure 7. These are observed data on vitamin C (X) and colds (Y).You conclude that increasing vitamin C decreases colds. You tell everyone to take more vitamin Cbut the prevalence of colds stays the same. Why? Look at the second plot. The dotted lines showthe counterfactuals. The counterfactual yi(x) is the value Y person i would have had if they hadtaken dose X = x. Note that

Yi = yi(Xi). (22)

In other words:

Yi is the function yi(·) evaluated at Xi.

The causal regression is the average of the counterfactual curves yi(x):

19

0 1 2 3 4 5 6 7

02

46

datax

y

0 1 2 3 4 5 6 7

02

46

counterfactualsx

y

0 1 2 3 4 5 6 7

02

46

causal regression functionx

c(x

)

0 1 2 3 4 5 6 7

02

46

datax

y

0 1 2 3 4 5 6 7

02

46

counterfactualsx

y

0 1 2 3 4 5 6 7

02

46

causal regression functionx

c(x

)

Figure 7: Causation

c(x) = E(yi(x)). (23)

The average is over the population. In other words, fix a value of x then average yi(x) over allindividuals. In general:

r(x) 6= c(x) association does not equal causation (24)

In this example, changing everyone’s dose does not change the outcome. The causal regressioncurve c(x) is shown in the third plot. In the second example (right side of Figure 7) it is worse.You tell everyone to take more vitamin C but the prevalence of colds increases.

Suppose that we randomly assign dose X . Then Xi is independent of the counterfactualsyi(x) : x ∈ R. In that case:

c(x) = E(y(x)) (25)= E(y(x)|X = x) since X is indep of y(x) : x ∈ R (26)= E(Y |X = x) since Y = y(X) (27)= r(x). (28)

Thus, if X is randomly assigned then association is equal to causation.In an observational (non randomized) study, the best we can do is try to measure confounding

variables. These are variables that affect both X and Y . If we can find all the confoundingvariables Z then y(x) : x ∈ R is independent of X given Z. In other words, given Z, the

20

problem is like a randomized experiment. Consider the breast cancer scenario. Suppose Z = 0 forthe poor and Z = 1 for the rich, and that smoking (not fat) causes cancer. If the rich smoke andthe poor do not, then c(x) = β0 + 0× x+ β2z. Formally,

c(x) = E(y(x)) (29)

=

∫E(y(x)|Z = z)f(z)dz (30)

=

∫E(y(x)|Z = z,X = x)f(z)dz since X is indep of yi(x) : x ∈ R|Z (31)

=

∫E(Y |X = x, Z = z)f(z)dz (32)

=

∫ (β1x+ β2z

)f(z)dz if linear (33)

= β1x+ β2E(Z). (34)

This is called adjusting for confounders. Specifically, we regress Y on X and Z, obtainingβ1x + β2z, which approximates c(x). Of course, we can never be sure we have included allconfounders. This is why obervational studies have to be treated with caution.

Note the following difference:

c(x) =

∫E(Y |Z = z,X = x)f(z)dz (35)

E(Y |X = x) =

∫E(Y |Z = z,X = x)f(z|x)dz. (36)

In the former, f(z) smoothes over the distribution of the confounding variable. In the latter, if Xand Z are correlated, f(z|x) does not smooth over the likely spectrum of values. This enhancesthe impression that X causes Y , when in fact it might be that Z causes Y .

4 Review of Linear AlgebraBefore starting multiple regression, we will briefly review some linear algebra. Read pages 278-287 of Weisberg. The inner product of two vectors x and y is

〈x, y〉 = xTy =∑

j

xjyj.

Two vectors are orthogonal is 〈x, y〉 = 0. We then write x ⊥ y. The norm of a vector is

||x|| =√〈x, x〉 =

√∑

j

x2j .

If A is a matrix, denote its inverse by A−1 and its transpose by AT . The trace of a square matrix,denoted by tr(A) is the sum of its diagonal elements.

21

PROJECTIONS. We will make extensive use of projections. Let us start with a simple example.Let

e1 = (1, 0), e2 = (0, 1)

and note that R2 is the linear span of e1 and e2: any vector (a, b) ∈ R2 is a linear combination ofe1 and e2. Let

L = ae1 : a ∈ Rbe the set of vectors of the form (a, 0). Note that L is a linear subspace of R2. Given a vectorx = (a, b) ∈ R2, the projection x of x onto L is the vector in L that is closest to x. In other words,x minimizes ||x− x|| among all vectors in L. Another characterization of x is this: it is the uniquevector such that (i) x ∈ L and (ii) x− x ⊥ y for all y ∈ L.

It is easy to see, in our simple example, that the projection of x = (a, b) is just (a, 0). Note thatwe can write

x = Px

where

P =

(1 00 0

).

This is the projection matrix.In general, given a vector space V and a linear subspace L there is a projection matrix P that

maps any vector v into its projection Pv.

The projection matrix satisfies these properties:• Pv exists and is unique.

• P is linear: if a and b are scalars then P (ax+ by) = aPx+ bPy.

• P is symmetric.

• P is idempotent: P 2 = P .

• If x ∈ L then Px = x.

Now let X be some n× q matrix and suppose that XTX is invertible. The column space is thespace L of all vectors that can be obtained by taking linear combinations of the columns of X. Itcan be shown that the projection matrix for the column space is

P = X(XTX)−1XT .

Exercise: check that P is idempotent and that if x ∈ L then Px = x.

Recall that E[∑n

i=1 aiYi] =∑n

i=1 aiE[Yi] and V[∑n

i=1 aiYi] =∑n

i=1 a2iV[Yi]+

∑i<j 2aiajCov[Yi, Yj].

22

RANDOM VECTORS. Let Y be a random vector. Denote the mean vectorby µ and the covariance matrix by V(Y ) or Cov(Y ). If a is a vector then

E(aTY ) = aTµ, V(aTY ) = aTΣa. (37)

If A is a matrix then

E(AY ) = Aµ, V(AY ) = AΣAT . (38)

5 Multiple Linear Regression

5.1 Fitting the modelIf Y depends on several variables, then we can extend our simple linear regression model to includemore X’s. For example we might predict the height of a child based on the height of father, heightof mother, and sex of the child. The multiple linear regression model is

h Y = β0 + β1X1 + · · ·+ βpXp + ε = βTX + ε (39)

where β = (β0, . . . , βp)T and X = (1, X1, . . . , Xp)

T . The value of the jth covariate for the ith

subject is denoted by Xij . Thus

Yi = β0 + β1Xi1 + · · ·+ βpXip + εi. (40)

At this point, it is convenient to use matrix notation. Let

Xn×q

=

1 X11 X12 . . . X1p

1 X21 X22 . . . X2p...

......

...1 Xn1 Xn2 . . . Xnp

.

Each subject corresponds to one row. The number of columns of X corresponds to the numberof features plus 1 for the intercept q = p+ 1 Now define,

Y =

Y1

Y2...Yn

ε =

ε1ε2...εn

β =

β0

β1...βp

. (41)

We can then rewrite (39) as

23

Y = Xβ + ε (42)

Notational conventions.Following the notational conventions used in Hastie et al., we will denote a feature by the

symbol X . If X is a vector, its components can be accessed by the subscripts Xj . An output, orresponse variable, is denoted by Y . We use uppercase letters such as X and Y when referringto the variables. Observed values are written as lower case, for example, the i’th observation ofX is xi. Matrices are represented using “mathbold font”, for example, a set of n input q-vectors,xi, i = 1, . . . , n would be represented by the n × q matrix X. In general vectors will not be bold,except when they have n components; this convention distinguishes a q-vector of inputs xi for thei’th observation from the n-vector xj consisting of all the observations on the variable Xj . Sinceall vectors are assumed to be column vectors, the i’th row of X is xTi , the vector transpose of xi.

5.1 Theorem. The least squares estimator is

β = SY (43)

whereS = (XTX)−1XT (44)

assuming that (XTX) is invertible.

The fitted values are

Y = Xβ= X(XTX)−1XTY

= HY,

where H is the projection matrix that maps Y onto L, the set of vectors that can be written as Xa(where a is a column vector). The residuals are ε = Y − Y . Of course Y ∈ L and ε is orthogonalto L. Thus, RSS = ||ε||2 = εT ε. The variance is estimated by

σ2 =RSS

n− p− 1=

RSS

n− q . (45)

24

5.2 Theorem. h The estimators satisfy the following properties.

1. E(β) = β.

2. V(β) = σ2(XTX)−1 ≡ Σ.

3. β ≈MN(β, σ2(XTX)−1).

4. An approximate 1− α confidence interval for βj is

βj ± zα/2 se(βj) (46)

where se(βj) is the square root of the appropriate diagonal elementof the matrix σ2(XTX)−1.

Let’s prove the first two assertions. Note that

E(β) = E(SY ) = SE(Y ) = SXβ = (XTX)−1XTXβ = β.

Also, by assumption V(Y ) = σ2I , where I is the identity matrix,

V(β) = V(SY ) = SV(Y )ST = σ2SST = σ2(XTX)−1XT(

(XTX)−1XT)T

= σ2(XTX)−1XTX(XTX)−1 = σ2(XTX)−1.

The ANOVA table is

Source df SS MS FRegression q − 1 = p SSreg SSreg/p MSreg/MSEResidual n− q = n− p− 1 RSS RSS/(n-p-1)Total n-1 SStotal

where SSreg =∑

i(Yi − Y )2 and SStotal =∑

i(Yi − Y )2. We often loosely refer to “the degreesof freedom”, but we should indicate whether we mean the df model (p) or the df error (n− p− 1).

The F test F = MSreg/MSE is distributed Fp,n−p−1. This is testing the hypothesis

H0 : β1 = · · · βp = 0

Testing this hypothesis is of limited value. More frequently we test H0 : βj = 0. Based on anassumption of asymptotic normality one typically performs a t-test. The test statistic is of the form

T =βjsebβj .

25

Reject H0 if |T | is large relative to a t-statistic with (n− q) degrees of freedom.

5.3 Example. Example: SAT Data (Sleuth text)

Reading in the Data

> data = read.table("CASE1201.ASC", header = TRUE)> data[1:4,]

state sat takers income years public expend rank1 Iowa 1088 3 326 16.79 87.8 25.60 89.72 SouthDakota 1075 2 264 16.07 86.2 19.95 90.63 NorthDakota 1068 3 317 16.57 88.3 20.62 89.84 Kansas 1045 5 338 16.30 83.9 27.14 86.3> dim(data)[1] 50 8

Description of DataIn 1982, average SAT scores were published with breakdowns of state-by-state performance in

the United States. The average SAT scores varied considerably by state, with mean scores fallingbetween 790 (South Carolina) to 1088 (Iowa).

Two researchers examined compositional and demographic variables to examine to what extentthese characteristics were tied to SAT scores. The variables in the data set were: state: state namesat: mean SAT score (verbal and quantitative combined) takers: percentage of total eligible stu-dents (high school seniors) in the state who took the exam income: median income of families oftest takers, in hundreds of dollars. years: average number of years that test takers had in social sci-ences, natural sciences, and humanities (combined) public: percentage of test takers who attendedpublic schools expend: state expenditure on secondary schools, in hundreds of dollars per studentrank: median percentile of ranking of test takers within their secondary school classes. Possiblevalues range from 0-99, with 99th percentile students being the highest achieving.

”Notice that the states with high average SATs had low percentages of takers. One reason isthat these are mostly midwestern states that administer other tests to students bound for collegein-state. Only their best students planning to attend college out of state take the SAT exams. Asthe percentage of takers increases for other states, so does the likelihood that the takers includelower-qualified students.”

Research Question: ”After accounting for the percentage of students who took the test andthe median class rank of the test takers (to adjust, somewhat, for the selection bias in the samplesfrom each state), which variables are associated with state SAT scores? After accounting for thepercentage of takers and the median class rank of the takers, how do the states rank? Which statesperform best for the amount of money they spend?

Exploratory Data Analysis

26

> par(mfrow = c(2, 4))> hist(sat, main = "Histogram of SAT Scores", xlab = "Mean SAT Score", col = 1)> hist(takers, main = "Histogram of Takers",xlab = "Percentage of students tested", col = 2)> hist(income, main = "Histogram of Income",xlab = "Mean Household Income ($100s)", col = 3)> hist(years, main = "Histogram of Years",xlab = "Mean Years of Sciences and Humanities", col = 4)> hist(public, main = "Public Schools Percentage",xlab = "Percentage of Students in Public Schools", col = 5)> hist(expend, main = "Histogram of Expenditures",xlab = "Schooling Expenditures per Student ($100s)", col = 6)> hist(rank, main = "Histogram of Class Rank",xlab = "Median Class Ranking Percentile", col = 7)

Exploratory data analysis allows us to look at the variables contained in the data set beforebeginning any formal analysis. First we examine the variables individually through histograms(Fig. 8). Here we can see the general range of the data, shape (skewed, gapped, symmetric, etc.),as well as any other trends. For example, we note that one state has almost double the amountof secondary schooling expenditures of all the other states. We may be interested in determiningwhich state this is, and can do so in one line of code:

> data[which(expend == max(expend)), ]state sat takers income years public expend rank

29 Alaska 923 31 401 15.32 96.5 50.1 79.6

Next we look the variables together.

> par(mfrow = c(1, 1))> plot(data[,-1]) #scatterplot matrix of ’data’, ignoring the first column.

> round(cor(data[,-1]), 2)sat takers income years public expend rank

sat 1.00 -0.86 0.58 0.33 -0.08 -0.06 0.88takers -0.86 1.00 -0.66 -0.10 0.12 0.28 -0.94income 0.58 -0.66 1.00 0.13 -0.31 0.13 0.53years 0.33 -0.10 0.13 1.00 -0.42 0.06 0.07public -0.08 0.12 -0.31 -0.42 1.00 0.28 0.05expend -0.06 0.28 0.13 0.06 0.28 1.00 -0.26rank 0.88 -0.94 0.53 0.07 0.05 -0.26 1.00

The scatterplot matrix shows the relationships between the variables at a glance. Generally weare looking for trends here. Doed the value of one variable tend to affect the value of another? Ifso, is that relationship linear? These types of questions help us think of what type of interactionand higher order terms we might want to include in the regression model.

27

Figure 8: Histogram of SAT data

28

Here we can confirm some of the observations of the problem statement. The scatterplot matrixshows clear relationships between sat, takers, and rank (Fig. 9). Interestingly, we can also noteAlaska’s features, since we know it’s the state with the very high ’expend’ value. We can see thatAlaska has a rather average sat score despite its very high levels of spending. For now we willleave Alaska in the data set, but a more complete analysis would seek to remove outliers and highinfluence points (to be discussed in later sections of the notes). In fact, this data-set contains tworather obvious outliers.

One feature visible in both the scatterplot and the histogram is the gap in the distribution oftakers. When there is such a distinct gap in a variable’s distribution, sometimes it is a good idea toconsider a transformation from a continuous variable to an indicator variable.

Since subtle trends are often difficult to spot in scatterplot matrices, sometimes a correlationmatrix can be useful, as seen above. Correlation matrices usually print 8-10 significant digits, sothe use of the ’round’ command tends to make the output more easily readible. We note that boththe income and the years variables have moderately strong positive correlations with the responsevariable (sat). The respective correlations of 0.58 and 0.33 indicate that higher levels of incomeand years of education in sciences and humanities are generally associated with higher mean satscores. However, this does not imply causation, and each of these trends may be nullified or evenreversed when accounting for the other variables in the data set!

A variable such a ’years’ may be of particular interest to researches. Although neither sciencenor humanities are directly tested on the SAT, researchers may be interested in whether an increasein the number of years of such classes is associated with a significant increase in SAT score. Thismay help them make recommendation to schools as to how to plan their cirricula.

Full Regression Line

#Fit a full regression line> attach(data)> regression.line = lm(sat ˜ takers + income + years + public + expend + rank)> summary(regression.line)

Call:lm(formula = sat ˜ takers + income + years + public + expend +

rank)


-60.046 -6.768 0.972 13.947 46.332


(Intercept) -94.659109 211.509584 -0.448 0.656731takers -0.480080 0.693711 -0.692 0.492628income -0.008195 0.152358 -0.054 0.957353years 22.610082 6.314577 3.581 0.000866 ***

29

Figure 9: Scatterplot of SAT data

30

public -0.464152 0.579104 -0.802 0.427249expend 2.212005 0.845972 2.615 0.012263 *rank 8.476217 2.107807 4.021 0.000230 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 26.34 on 43 degrees of freedomMultiple R-squared: 0.8787, Adjusted R-squared: 0.8618F-statistic: 51.91 on 6 and 43 DF, p-value: < 2.2e-16

> resid = regression.line$residuals> qqnorm(scale(resid))> qqline(scale(resid))

The q-q plot indicates that the residuals in our regression model have heavy tails (Fig. 10). Onthe both the negative and positive side, observed (sample) quantiles are much larger than theoreticalquantiles.

5.2 Testing Subsets of CoefficientsSuppose you want to test if a set of coefficients is 0. Use,

F =(RSSsmall −RSSbig)/(dfsmall − dfbig)

RSSbig/dfbig

(47)

which has a Fa,b distribution underH0, where df means degrees of freedom error, a = dfsmall−dfbigand b = dfbig. (Note: we often say “reduced” for the small model, and “full” for the big model.)

5.4 Example. In this example we’ll use the anova command, rather than summary. This givessequential sums of squares. The order matters. It gives the SS explained by the first variable,then the second variable, conditional on including the first, then the third variable, conditional onthe first and second, and so forth. For the SAT data, let’s try dropping income, years, public andexpend.

> regression.line = lm(sat ˜ takers + income + years + public + expend + rank)> anova(regression.line)Analysis of Variance Table

Response: satDf Sum Sq Mean Sq F value Pr(>F)

takers 1 181024 181024 260.8380 < 2.2e-16 ***income 1 121 121 0.1749 0.6778321years 1 14661 14661 21.1253 3.753e-05 ***public 1 5155 5155 7.4272 0.0092545 **expend 1 3984 3984 5.7409 0.0209970 *rank 1 11223 11223 16.1712 0.0002295 ***

31

Figure 10: qqplot of residuals for SAT analysis

32

Residuals 43 29842 694---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> reduced.line = lm(sat ˜ takers + rank)> anova(reduced.line)Analysis of Variance Table

Response: satDf Sum Sq Mean Sq F value Pr(>F)

takers 1 181024 181024 158.2095 < 2.2e-16 ***rank 1 11209 11209 9.7964 0.003003 **Residuals 47 53778 1144---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> top = (53778 - 29842)/(47 - 43)> bottom = 29842/43> f = top/bottom> f[1] 8.622478> p = 1-pf(f, 4, 43)> p[1] 3.348509e-05

Since p is small, we conclude that the set of variables not included in the reduced model col-lectively contain valuable information about the relationship with SAT score. We don’t know yetwhich are important, but the p-value indicates that removing them all would be unwise.

5.5 Example. Using Residuals to Create Better Rankings for SAT dataIn Display 12.1, the states are ranked based on raw SAT scores, which doesn’t seem reasonable.

Some state universities require the SAT and some require a competing exam (the ACT). States witha high proportion of takers probably have in state requirements for the SAT. In states without thisrequirement, only the more elite students will take the SAT, causing a bias. In Display 12.2, thestates are ranked based on SAT scores, corrected for percent taking the exam and median classrank. Let’s explore this thinking further.

To address the research question of how the states rank after accounting for the percentage oftakers and median class rank, we use our reduced model (’reduced.line’ above). Instead of rankingby actual SAT score, we can rank the schools by how far they fall above or below their fittedregression line value. A residual is defined as the difference between the observed value and thepredicted value.

For example, we have the reduced regression model:

> summary(reduced.line)

33

Figure 11:

34

Figure 12:

35

Call:lm(formula = sat ˜ takers + rank)


-98.49 -22.31 5.46 21.40 53.89


(Intercept) 412.8554 194.2227 2.126 0.03881 *takers -0.8170 0.6584 -1.241 0.22082rank 6.9574 2.2229 3.130 0.00300 **---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 33.83 on 47 degrees of freedomMultiple R-squared: 0.7814, Adjusted R-squared: 0.7721F-statistic: 84 on 2 and 47 DF, p-value: 3.032e-16

To manually find the fitted value for Iowa,

state sat takers income years public expend rank1 Iowa 1088 3 326 16.79 87.8 25.60 89.7

we have the equation: fitted = 412.85 - 0.8170*3 + 6.9574*89.7 = 1034.48 Additionally, we can verify thisvalue by typing in ’reduced.line$fit’ to get a vector of the fitted values for all 50 states. The residual wouldthen be: residual = observed - fitted = 1088 - 1034.48 = 53.52

> order.vec = order(reduced.line$res, decreasing = TRUE)> states = factor(data[order.vec, 1])> newtable = data.frame(State = states,Residual = as.numeric(round(reduced.line$res[order.vec], 1)),oldrank = (1:50)[order.vec])> newtable

State Residual oldrank1 Connecticut 53.9 352 Iowa 53.5 13 NewHampshire 45.8 284 Massachusetts 41.9 415 NewYork 40.9 366 Minnesota 40.6 77 Kansas 35.8 48 SouthDakota 33.4 29 NorthDakota 32.8 310 Illinois 28.0 2111 Montana 25.6 6

36

12 NewJersey 22.8 4413 Delaware 21.7 3414 Wisconsin 20.5 1015 Nebraska 20.5 516 Maryland 19.5 3917 RhodeIsland 15.6 4318 Utah 14.8 819 Colorado 14.1 1820 Virginia 13.9 4021 Tennessee 13.3 1322 Missouri 9.6 2323 NewMexico 8.3 1424 Vermont 7.9 3225 Washington 5.8 1926 Ohio 5.1 2727 Pennsylvania 2.3 4228 Wyoming -0.5 929 Michigan -0.8 2430 Oklahoma -3.3 1131 Hawaii -3.8 4732 Maine -4.3 3733 Arizona -9.4 2034 Idaho -9.8 1535 Louisiana -10.5 2236 Florida -11.0 3837 Alaska -18.3 2938 California -23.6 3339 Oregon -23.9 3140 Kentucky -24.1 1741 Alabama -27.7 2642 Indiana -29.2 4643 Arkansas -31.2 1244 WestVirginia -38.9 2545 Nevada -45.4 3046 Mississippi -49.3 1647 Texas -50.3 4548 Georgia -63.0 4949 NorthCarolina -71.3 4850 SouthCarolina -98.5 50

Above, the order command is used to sort the vectors by residual value. Saving this orderinginto the vector order.vec, we were able to sort the state and old ranking by the same ordering.

Note how dramatically the rankings shift once we conrol for the variables ’takers’ and ’rank’.Connecticut for example shifts from 35th in raw score (896) to 1st in residuals. Similar shiftshappened in the reverse direction, with Arkansas sliding from 12th to 43rd. We could further

37

Figure 13: Residual plots for SAT data analysis

analyze the ranks by accounting for such things as expenditures to get a sense of which statesappear to make ’efficient’ use of their spendings.

One of the assumptions of the basic regression model is that the magnitude of residuals isrelatively constant at all levels of the response. It is important to check that this assumption isupheld here.

res = reduced.line$respar(mfrow = c(1,3))plot(sat, res, xlab = "SAT Score", ylab = "Residuals", pch = 19)abline(h = 0)plot(takers, res, xlab = "Percent of Students Tested",+ ylab = "Residuals", pch = 19)abline(h = 0)plot(rank, res, xlab = "Median Class Ranking Percentile",+ ylab = "Residuals", pch = 19)abline(h = 0)

Often residuals will “fan out” (increase in magnitude) as the value of a variable increases. Thisis called ’nonconstant variance’.

Sometimes there will be a pattern in the residual plots, such as a U shape or a curve. This isdue to nonlinearity. Patterns are generally an indication that a variable transformation is needed.Ideally, a residual plot will look like a rectangular blob, with no clear pattern.

In the attached residual plots, the first and third plots (SAT and Rank) appear to fit the idealrectangular blob distribution (Fig. 13); however, ’Takers’ (percentage of students tested) had highresiduals on the edges and low residuals in the center. This is a product of nonlinearity. Fig. 14 isa graph showing the scatterplots of Takers vs. SAT before and after a transformation.

38

Figure 14: Relationship before and after transformation

5.3 The Hat MatrixRecall that

Y = Xβ = X(XTX)−1XTY = HY (48)

where

H = X(XTX)−1XT (49)

is called the hat matrix. The hat matrix is the projector onto the columnspace of X.

The residuals areε = Y − Y = Y −HY = (I −H)Y. (50)

The hat matrix will play an important role in all that follows.

5.6 Theorem. The hat matrix has the following properties.

1. HX = X.

2. H is symetric and idempotent: H2 = H

3. H projects Y onto the column space of X.

39

4. rank(X) = tr(H).

5.7 Theorem. Properties of residuals:

1. True residuals: E(ε) = 0, V(ε) = σ2I .

2. Estimated residuals: E(ε) = 0, V(ε) = σ2(I −H).

3.∑

i εi = 0.

4. V(εi) = σ2(1− hii) where hii is diagonal element of H .

Let’s prove a few of these. First,

E(ε) = (I −H)E(Y ) = (I −H)Xβ= Xβ −HXβ= Xβ − Xβ since HX = X= 0.

Next,

V(ε) = (I −H)V(Y )(I −H)T

= σ2(I −H)(I −H)T

= σ2(I −H)(I −H)

= σ2(I −H −H +H2)

= σ2(I −H −H +H) since H2 = H

= σ2(I −H).

To see that the sum of the residuals is 0, note that∑

i εi = 〈1, ε〉 where 1 denotes a vector of ones.Now, 1 ∈ L, Y is the projection onto L and ε = Y − Y . By the properties of the projection, ε isperpendicular to every vector in L. Hence,

∑i εi = 〈1, ε〉 = 0.

5.8 Example. Suppose Yi = β0 + εi. Let

X =

11...1

.

Then

H =1

n

1 1 · · · 11 1 · · · 1...

......

...1 1 · · · 1

The column space is V = (a, a, . . . , a)T : a ∈ R and HY = (Y , Y , . . . , Y )T .

5.9 Example. Suppose that the X matrix has two-columns. Denote these columns by x1 and x2.The column space is V = a1x1 + a2x2 : a1, a2 ∈ R. The hat matrix projects Y ∈ Rn onto V .

40

x1

x2

Y

Y

1

Figure 15: Projection

5.4 Weighted Least SquaresSo far we have assumed that the εi’s are independent and have the same variance. What happens ifthe variance is not constant? For example, Sheather’s text gives a simple example about a cleaningcompany. The building maintenance company keeps track of how many crews it has working (X)and the number of rooms cleaned (Y ). The number of crews varied from 2 to 16, and for each levelof X , at least 4 observations of Y are available. A plot of X versus Y reveals that the relationshipis linear, but that the variance grows as X increases. Because we have several observations foreach level of X we can estimate σ2 as a function of X . (Of course, we don’t usually have multiplemeasures of Y for each level of X , so we will need more subtle ways of handling this problem.)

For another example, suppose Di is the number of diseased individuals in a population ofsize mi and Yi = Di

mi. Under certain assumptions, it might be reasonable to assume that Di ∼

binomial, in which case V[Yi] would be proportional to 1/mi. If the disease is contagious thebinomial assumption would not be correct. Nevertheless, provided mi is large for each i, it mightbe reasonable to assume that Yi is approximately normal with mean β0 + β1xi and variance σ2

mi. In

this case the variance is a function of mi, and we could model this variance as described below.Suppose that

Y = Xβ + ε

whereV(ε) = Σ.

Suppose we use the usual least squares estimator β. Then,

E(β) = E((XTX)−1XTY )

41

= (XTX)−1XTE(Y )

= (XTX)−1XTXβ = β.

So β is still unbiased. Also, under weak conditions, it can be shown that β is consistent (convergesto β as we get more data). The usual estimator has reasonable properties. However, there are twoproblems.

First, with constant variance, the usual least squares estimator is not just unbiased, it is anoptimal estimator in the sense that it is they are the minimum variance, linear, unbiased estimator.This is no longer true with non-constant variance. Second, and more importantly, the formulafor the standard error of β is wrong. To see this, recall that V(AY ) = AV(Y )AT . Hence,

V(β) = V((XTX)−1XTY ) = (XTX)−1XTV(Y )X(XTX)−1 = (XTX)−1XTΣX(XTX)−1

which is different than the usual formula.It can be shown that minimum variance, linear, unbiased estimator is obtained by minimizing

RSS(β) = (Y − Xβ)TΣ−1(Y − Xβ).

The solution isβ = SY (51)

whereS = (XTΣ−1X)−1XTΣ−1. (52)

This is unbiased with varianceV(β) = (XTΣ−1X)−1.

This is called weighted least squares.Let B denote the square root of Σ. Thus, B is a symmetric matrix that satisfies

BTB = BBT = Σ.

It can be shown that B−1 is the square root of Σ−1. Let Z = B−1Y . Then we have

Z = B−1Y = B−1(Xβ + ε)

= B−1Xβ +B−1ε

= Mβ + δ

whereM = B−1X, and, δ = B−1ε.

Moreover,V(δ) = B−1V (ε)B−1 = B−1ΣB−1 = B−1BBB−1 = I.

Thus we can simply regress Z on M and do ordinary regression.

42

Let us look more closely at a special case. If the residuals are uncorrelated then

Σ =

σ2

w10 0 . . . 0

0 σ2

w20 . . . 0

......

...... 0

0 0 0 0 σ2

wn

.

In this case,

RSS(β) = (Y − Xβ)TΣ−1(Y − Xβ) ∝n∑

i=1

wi(Yi − xTi β)2.

Thus, in weighted least squares we are simply giving lower weight to the more variable (lessprecise) observations.

Now we have to address the following question: where do we get the weights? Or equivalently,how do we estimate σ2

i = V(εi)? There are four approaches.(1) Do a transformation to make the variances approximately equal. Then we don’t need to do

a weighted regression.(2) Use external information. There are some cases where other information (besides the cur-

rent data) will allow you to know (or estimate) σi. These cases are rare but they do occur. Forexample σ2

i could be the measurement error of the instrument.(3) Use replications. If there are several Y values corresponding to each x value, we can use

the sample variance of those Y values to estimate σ2i . However, it is rare that you would have so

many replications.(4) Estimate σ(x) as a function of x. Just as we can estimate the regression line, we can also

estimate the variance, thinking of it as a function of x. We could assume a simple model like

σ(xi) = α0 + α1xi

for example. Then we could try to find a way to estimate the parameters α0 and α1 from thedata. In fact, we will do something more ambitious. We will estimate σ(x) assuming only thatit is a smooth function of x. We will do this later in the course when we discuss nonparametricregression.

In R we simply include weights in the lm command:lm(Y ∼ X , weights= 1/StdDev2), where StdDev is simply an estimate of V[Y |X].

6 DiagnosticsFigure 16 shows a famous example. Four different data sets with the same fit. The moral: lookingat the fit is not enough. We should also use some diagnostics. Generally, we diagnose problems bylooking at the residuals. When we do this, we are looking for: (1) outliers, (2) influential points,(3) nonconstant variance, (4) nonlinearity, (5) nonnormality. The remedies are:

43

0 5 10 15 20

05

1015

x

y

0 5 10 15 20

05

1015

x

y

0 5 10 15 20

05

1015

x

y

0 5 10 15 20

05

1015

x

y

Figure 16: The Ansombe Example

Problem Remedy1. Outliers Non-influential: don’t worry about it.

Influential: remove or use robust regression.2. Influential points Fit regression with and without the point

and report both analyses.3. Nonconstant variance Use transformation or nonparametric methods.

Note: doesn’t affect the fit too much;mainly an issue for confidence intervals.

4. Nonlinearity Use transformation or nonparametric methods.5. Nonnormality Large samples: not a problem.

Small samples: use transformations

44

Three types of residuals:

Name Formula R command (assume lm output is in tmp)residual εi = Yi − Yi resid(tmp)

standardized residual Yi−bYibσ√1−hiirstandard(tmp)

studentized residual Yi−bYibσ(i)

√1−hii

rstudent(tmp)

6.1 OutliersCan be found (i) graphically or (ii) by testing. Let us write

Yj =

XTj β + εj + δ j = i

XTj β + εj j 6= i.

TestH0 : case i is an outlier versus H1 : case i is not an outlier

Do the following: (1) Delete case i. (2) Compute β(i) and σ(i). (3) Predict the deleted case:Yi = XT

i β(i). (4) Compute

ti =Yi − Yi

se.

(5) Reject H0 if p-value is less than α/n.Note that

V(Yi − Yi) = V(Yi) + V(Yi) = σ2 + σ2xTi (XT(i)X(i))

−1xi.

So,se(Yi − Yi) = σ

√1 + xTi (XT

(i)X(i))−1xi.

How do the residuals come into this? Internally studentized residuals:

ri =εi

σ√

1− hii.

Externally studentized residuals:

r(i) =εi

σ(i)

√1− hii

.

6.1 Theorem.

ti = ri

√n− p− 2

n− p− 1− r2i

= r(i).

45

6.2 InfluenceCook’s distance

Di =(Y(i) − Y )T (Y(i) − Y )

qσ2=

1

qr2i

(hii

1− hii

)

where Y = Xβ and Y(i) = Xβ(i). Points with Di ≥ 1 might be influential. Points near the edgeare typically the influential points.

6.2 Example (Rats).> data = c(176,6.5,.88,.42,176,9.5,.88,.25,190,9.0,1.00,.56,176,8.9,.88,.23,...149,6.4,.75,.46)

> data = matrix(data,ncol=4,byrow=T)> bwt = data[,1]> lwt = data[,2]> dose = data[,3]> y = data[,4]> dim(data)

[1] 19 4

> n = length(y)The four variables:bwt: the rat’s body weightlwt: the rat’s lung weightdose: the dosage given to the raty: the amount of the dose that reached the rat’s liver

> data2 = cbind(bwt, lwt, dose, y)> datam = as.data.frame(data2)> pch.vec = c(1, 1, 19, rep(1, 16))> plot(datam, pch = pch.vec)

To produce a scatterplot matrix, the data must be formatted in R as a dataframe. In the scatter-plot matrix, I have colored as black the observation that is ultimately removed as our high influencepoint (Fig. 17). This observation is pretty typical as far as high influence points go, and we canlearn a lot just by looking at these graphs. How does this observation differ from the other 18?

46

Figure 17: Rat Data

47

Most obviously, it has an unusually large value for y. Furthermore, it is on the ”edge” of thedata (high values of bwt, lwt, and dose). Observations on the edge of a data space have higherinfluence by nature.

The plot of body weight versus dosage is particularly interesting. The relationship betweenweight and dosage is nearly perfectly linear, with exception to our high influence point. Perhapssince one rat was given an abnormally large dosage for his weight, we see an abnormally largeamount of the dose ending up in the liver.

We note that in the plots on the bottom row, none of the three predictors appear to have anobvious relationship with the response variable y.

> out = lm(y˜bwt+lwt+dose, qr=TRUE)> summary(out)

Call:

lm(formula = y ˜ bwt + lwt + dose, qr = TRUE)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.265922 0.194585 1.367 0.1919bwt -0.021246 0.007974 -2.664 0.0177 *lwt 0.014298 0.017217 0.830 0.4193dose 4.178111 1.522625 2.744 0.0151 *

r = rstandard(out) ### standardizedr = rstudent(out) ### studentizedplot(fitted(out),rstudent(out),pch=pch.vec, xlab = "Fitted Value",ylab = "Studentized Residual");abline(h=0)

This first model appears to have some explanatory power, with the coefficients for both bwt anddose appearing as significant. However, consider the scatterplot matrix again. These two variableswere nearly perfectly correlated with exception to one observation. So what’s happening here?

In short, they’re cancelling each other out for most observations, while still accounting for ourhigh influence point. For every observation (except for our high influence point), the expressionfor β1*bwt + β3*dose evaluates to nearly 0! For our high influence point, however, it evaluatesto 0.141. The model has been heavily influenced by one point, creating artificial significance toaccount for the unusually high dosage (with respect to body weight) given to one rat.

To formally quantify this influence, we look at the Cook’s distance for each observation.

> I = influence.measures(out)> names(I)[1] "infmat" "is.inf" "call"

48

Figure 18: Rat Data

> I$infmat[1:5,]

dfb.1_ dfb.bwt dfb.lwt dfb.dose dffit cov.r cook.d1 -0.03835128 0.31491627 -0.7043633 -0.2437488 0.8920451 0.6310012 0.168826822 0.14256373 -0.09773917 -0.4817784 0.1256122 -0.6087606 1.0164073 0.088540243 -0.23100202 -1.66770314 0.3045718 1.7471972 1.9047699 7.4008047 0.929615964 0.12503004 -0.12685888 -0.3036512 0.1400908 -0.4943610 0.8599033 0.057184565 0.52160605 -0.39626771 0.5500161 0.2747418 -0.9094531 1.5241607 0.20291617

> cook = I$infmat[,7]> plot(cook,type="h",lwd=3,col="red", ylab = "Cook’s Distance")

In the Cook’s distance plot, we see that our high influence point (the third observation) hasa much larger Cook’s distance than any of the others (Fig 18). This generally indicates that theobservation should be removed from the analysis.

> y = y[-3]> bwt = bwt[-3]> lwt = lwt[-3]> dose = dose[-3]> out = lm(y ˜ bwt + lwt + dose)> summary(out)

Call:lm(formula = y ˜ bwt + lwt + dose)

49

Residuals:

Min 1Q Median 3Q Max-0.102154 -0.056486 0.002838 0.046519 0.137059


(Intercept) 0.311427 0.205094 1.518 0.151bwt -0.007783 0.018717 -0.416 0.684lwt 0.008989 0.018659 0.482 0.637dose 1.484877 3.713064 0.400 0.695

Residual standard error: 0.07825 on 14 degrees of freedomMultiple R-squared: 0.02106, Adjusted R-squared: -0.1887F-statistic: 0.1004 on 3 and 14 DF, p-value: 0.9585

After removing the high influence point, we refit the original model. Now we find a regressionrelationship with nearly no significance (p = 0.9585). This seems consistent with what we observedin the original scatterplot matrix.

6.3 Tweaking the RegressionIf residual plots indicate some problem, we need to apply some remedies. Look at Figure 6.3 p132 of Weisberg.

Possible remedies are:

• Transformation

• Robust regression

• nonparametric regression

Examples of transformations:√Y , log(Y ), log(Y + c), 1/Y

These can be applied to Y or x. We transform to make the assumptions valid, not to chase statisticalsignificance.

6.3 Example (Bacteria). This example is from Chatterjee and Price(1991, p 36). Bacteria wereexposed to radiation. Figure 19 shows the number of surviving bacteria versus time of exposure toradiation.

The program and output look like this.

50

2 4 6 8 10 12 14

5010

015

020

025

030

035

0

time

surv

ivors

0 50 100 150 200 250

−50

050

100

Fitted values

Resid

uals

Residuals vs Fitted

1

15

8

−1 0 1

−10

12

3

Theoretical Quantiles

Stan

dard

ized

resid

uals

Normal Q−Q plot

1

15

8

2 4 6 8 10 12 14

0.0

0.5

1.0

1.5

Obs. number

Cook

’s di

stan

ce

Cook’s distance plot

1

15

14

Figure 19: Bacteria Data

51

> time = 1:15> survivors = c(355,211,197,166,142,106,104,60,56,38,36,32,21,19,15)>> plot(time,survivors)> out = lm(survivors ˜ time)> abline(out)> plot(out,which=c(1,2,4))> print(summary(out))


(Intercept) 259.58 22.73 11.420 3.78e-08 ***time -19.46 2.50 -7.786 3.01e-06 ***---Residual standard error: 41.83 on 13 degrees of freedomMultiple R-Squared: 0.8234, Adjusted R-squared: 0.8098F-statistic: 60.62 on 1 and 13 DF, p-value: 3.006e-06

The residual plot suggests a problem. Consider the following transformation.

>> logsurv = log(survivors)> plot(time,logsurv)> out = lm(logsurv ˜ time)> abline(out)> plot(out,which=c(1,2,4))> print(summary(out))


(Intercept) 5.973160 0.059778 99.92 < 2e-16 ***time -0.218425 0.006575 -33.22 5.86e-14 ***---

Residual standard error: 0.11 on 13 degrees of freedomMultiple R-Squared: 0.9884, Adjusted R-squared: 0.9875F-statistic: 1104 on 1 and 13 DF, p-value: 5.86e-14

Check out Figure 20. Much better. In fact, theory predicts Nt = N0eβt where Nt is number of

52

survivors at exposure t and N0 is the number of bacteria before exposure. So the fact that the logtransformation is useful here is not surprising.

7 Misc Topics in Multiple Regression

7.1 Qualitative VariablesIf X ∈ 0, 1, then it is called a dummy variable. More generally, if X takes discrete values, it iscalled a qualitative variable or a factor. Let D be a dummy variable. Consider

E(Y ) = β0 + β1X + β2D.

Thencoefficient intercept sloped = 0 β0 β1

d = 1 β0 + β2 β1

These are parallel lines. Now consider this model:

E(Y ) = β0 + β1X + β2D + β3X D

Then:coefficient intercept sloped = 0 β0 β1

d = 1 β0 + β2 β1 + β3

These are nonparallel lines. To include a discrete variable with k levels, use k − 1 dummyvariables. For example, if z ∈ 1, 2, 3, do this:

z d1 d2

1 1 02 0 13 0 0

In the modelY = β0 + β1D1 + β2D2 + β3X + ε

we see

E(Y |z = 1) = β0 + β1 + β3X

E(Y |z = 2) = β0 + β2 + β3X

E(Y |z = 3) = β0 + β3X

You should not create k dummy variables because they will not be linearly independent. ThenXTX is not invertible.

7.1 Example. Salary data from Chatterjee and Price p 96.

53

2 4 6 8 10 12 14

3.0

3.5

4.0

4.5

5.0

5.5

time

logs

urv

3.0 3.5 4.0 4.5 5.0 5.5

−0.2

−0.1

0.0

0.1

0.2

Fitted values

Resid

uals

Residuals vs Fitted

7

2

10

−1 0 1

−2−1

01

2


Stan

dard

ized

resid

uals

Normal Q−Q plot

7

2

10

2 4 6 8 10 12 14

0.0

0.1

0.2

0.3

0.4

Obs. number

Cook

’s di

stan

ce

Cook’s distance plot

2

1

7

Figure 20: Bacteria Data

54

##salary example p 97 chatterjee and pricesdata = read.table("salaries.dat",skip=1)names(sdata) = c("salary","experience","education","management")attach(sdata)

n = length(salary)d1 = rep(0,n)d1[education==1] = 1d2 = rep(0,n)d2[education==2] = 1

out1 = lm(salary ˜ experience + d1 + d2 + management)summary(out1)


(Intercept) 11031.81 383.22 28.787 < 2e-16 ***experience 546.18 30.52 17.896 < 2e-16 ***d1 -2996.21 411.75 -7.277 6.72e-09 ***d2 147.82 387.66 0.381 0.705management 6883.53 313.92 21.928 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1027 on 41 degrees of freedomMultiple R-Squared: 0.9568, Adjusted R-squared: 0.9525F-statistic: 226.8 on 4 and 41 DF, p-value: < 2.2e-16

Intepretation:Each year of experience increases our prediction by 546 dollars. Increment for management

position is 6883 dollars. Compare bachelors to high school. For high school, d1 = 1 and d2 = 0so:

E(Y ) = β0 + β1experience− 2996 + β4management

For bachelors, d1 = 0 and d2 = 1 so:

E(Y ) = β0 + β1experience + 147 + β4management

SoEbach(Y )− Ehigh(Y ) = 3144.

### another wayed = as.factor(education)

55

out2 = lm(salary ˜ experience + ed + management)summary(out2)


(Intercept) 8035.60 386.69 20.781 < 2e-16 ***experience 546.18 30.52 17.896 < 2e-16 ***ed2 3144.04 361.97 8.686 7.73e-11 ***ed3 2996.21 411.75 7.277 6.72e-09 ***management 6883.53 313.92 21.928 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1027 on 41 degrees of freedomMultiple R-Squared: 0.9568, Adjusted R-squared: 0.9525F-statistic: 226.8 on 4 and 41 DF, p-value: < 2.2e-16

Apparently, R codes the dummy variables differently.level mean d1 d2 ed2 ed3high-school 8036 1 0 0 0BS 11179 0 1 1 0advanced 11032 0 0 0 1

You can change the way R does this. Do help C and contr.treatment.

7.2 CollinearityIf one of the predictor variables is a linear combination of the others, then we say that the variablesare collinear. The result is that XTX is not invertible. Formally, this means that the standard errorof β is infinite and the standard error for predictions is infinite.

For example, suppose that x1i = 2 and suppose we include an intercept. Then the X matrix is

1 21 2...

...1 2

and so

XTX = n

(1 22 4

)

which is not invertible. The implied model in this example is

Y = β0 + β1X1 + εi = β0 + 2β1 + εi ≡ β0 + εi

56

where β0 = β0+2β1. We can estimate β0 using Y but there is no way to separate this into estimatesfor β0 and β1.

Sometimes the variables are close to collinear. The result is that it may be difficult to invertXTX. However, the bigger problem is that the standard errors will be huge.

The solution is easy. Don’t use all the variables; use variable selection (stay tuned...).Multicollinearity is just an extreme example of the bias-variance tradeoff we face whenever we

do regression. If we include too many variables, we get poor predictions due to increased variance(more later).

7.3 Case StudyThis example is drawn from the “Sleuth” text. When men and women of the same size and drinkinghistory consume equal amounts of alcohol, the women tend to maintain a higher blood alcoholconcentration. To explain this researchers conjectured that enzymes in the stomach are more activein men than women. In this study we examine the level of gastric enzyme as a predictor of firstpass metabolism. These two variables are known to be positively related. The question is, doesthis relationship differ between men and women?

18 women and 14 men were in the study. Of the 32 subjects, 8 were considered alcoholics. Allsubjects were given a dose of alcohol and then the researchers measured their first pass metabolism.The higher this quantity, the more rapidly they were processing the alcohol.

Here are the data:

subject metabol gastric female alcohol1 1 0.6 1.0 1 12 2 0.6 1.6 1 13 3 1.5 1.5 1 14 4 0.4 2.2 1 05 5 0.1 1.1 1 06 6 0.2 1.2 1 07 7 0.3 0.9 1 08 8 0.3 0.8 1 09 9 0.4 1.5 1 010 10 1.0 0.9 1 011 11 1.1 1.6 1 012 12 1.2 1.7 1 013 13 1.3 1.7 1 014 14 1.6 2.2 1 015 15 1.8 0.8 1 016 16 2.0 2.0 1 017 17 2.5 3.0 1 018 18 2.9 2.2 1 019 19 1.5 1.3 0 120 20 1.9 1.2 0 1

57

21 21 2.7 1.4 0 122 22 3.0 1.3 0 123 23 3.7 2.7 0 124 24 0.3 1.1 0 025 25 2.5 2.3 0 026 26 2.7 2.7 0 027 27 3.0 1.4 0 028 28 4.0 2.2 0 029 29 4.5 2.0 0 030 30 6.1 2.8 0 031 31 9.5 5.2 0 032 32 12.3 4.1 0 0

In Figure 21 you can see the relationship between gastric enzyme and metabolism and how itrelates to sex and alcoholism. Consider the full model, including all interactions

> out = lm(formula = metabol ˜ (gastric + female + alcohol)ˆ3, data = dat)


(Intercept) -1.6597 0.9996 -1.660 0.1099gastric 2.5142 0.3434 7.322 1.46e-07 ***female 1.4657 1.3326 1.100 0.2823alcohol 2.5521 1.9460 1.311 0.2021gastric:female -1.6734 0.6202 -2.698 0.0126 *gastric:alcohol -1.4587 1.0529 -1.386 0.1786female:alcohol -2.2517 4.3937 -0.512 0.6130gastric:female:alcohol 1.1987 2.9978 0.400 0.6928---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> plot(fitted(out),resid(out),xlab="Fitted value",ylab="Residual")

From the residual plot we see that two observations have very high fitted values and largeresiduals. These are likely to be high influence observations that require careful consideration.First we look to see if these observations are affecting our inferences. If we drop observations 31and 32 from the full model (above) only gastric is significant (results not shown). Consequentlywe believe these observations have high influence on our analysis. This makes sense because bothof these males have very high gastric activity. All other sujects have activity between 0.5 and 3.Perhaps these males are athletes, or extremely large people, or they differ in some fundamentalway from the remaining individuals. If we restrict our inferences to people with gastric activityless than 3, we can feel more confident about our inferences.

Next we need to simplify the model before preceeding because we have a small number ofobservations for the complexity of our original model. There is no indication of a detectable effect

58

FFFFF F

F FFFFF F

FF

0 1 2 3 4 5

02

46

810

12

Gastric

Met

abol

ism

M

M MM

MM

M

M

M

F F

FMM

MM

M

FMFM

Female,non−AlcMale,non−AlcFemale,AlcMale,Alcoholic

Figure 21: Alcohol metabolism

59

2 4 6 8 10

−2

−1

01

23

Fitted value

Res

idua

l

Figure 22: Alcohol metabolism

60

due to alcoholism, so we drop alcohol from the model and then do formal investigation of influenceof observations 31 and 32. We calculate the influence measures and look at Cook’s distance. Belowwe show it for these two observations. All other observations have small distances. Hence weremove these two observations.

> outsimple = lm(formula = metabol ˜ (gastric + female)ˆ2, data = dat)> I = influence.measures(outsimple)

Cooks distance for records 31 and 32:> I$infmat[31:32,7]

31 320.960698 1.167255

> datclean = dat[-c(31,32)]> attach (datclean)> final = lm(metabol ˜ (gastric + female)ˆ2, data = datclean)> summary(final)


(Intercept) 0.06952 0.80195 0.087 0.931580gastric 1.56543 0.40739 3.843 0.000704 ***female -0.26679 0.99324 -0.269 0.790352gastric:female -0.72849 0.53937 -1.351 0.188455---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Whereas, with 31 and 32 included we had significant interaction between gastric and female.


(Intercept) -1.1858 0.7117 -1.666 0.1068gastric 2.3439 0.2801 8.367 4.22e-09 ***female 0.9885 1.0724 0.922 0.3645gastric:female -1.5069 0.5591 -2.695 0.0118 *

From this output it appears metabolism does not depend on sex, but the effect of sex wasclearly visible in the plot of the original data. Perhaps the model is still over parameterized. We trya model with no intercept for males or females, since metabolism is known to be approximately0 when gastric activity is 0. This model forces the line through the origin. (Note: I have foundthat forcing no intercept is often a good modeling choice if the fitted line appears to go through theorigin.) The model we are fitting for males is

Y = β1X + ε,

61

and for females isY = (β1 + β2)X + ε.

> femgastric = female*gastric> outnoint = lm(metabol ˜ (gastric + femgastric) -1)> summary(outnoint)


gastric 1.5989 0.1249 12.800 3.20e-13 ***femgastric -0.8732 0.1740 -5.019 2.63e-05 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.8518 on 28 degrees of freedomMultiple R-squared: 0.877,Adjusted R-squared: 0.8683F-statistic: 99.87 on 2 and 28 DF, p-value: 1.804e-13

A plot of residuals reveals that this model provides a good fit (results not shown). We considerthis our best model.

Conclusions: As expected, metabolism increases with gastric activity; however, mean firstpass metabolism is higher for males than females, even if we account for gastric activity (p-value ¡0.0001). Specifically it is 2.2 = β1/(β1 + β2) times higher for males than females. The experimentsupports our hypothesis that males process alcohol more quickly than females, even when weaccount for gastic enzyme levels.

Although it was not mentioned, I believe that gastric enzyme levels are higher for larger people,thus including this variable in the model controls for size of subject. It would be interesting to lookat this study in more depth.

62

8 Bias-Variance Decomposition and Model Selection

8.1 The Predictive ViewpointThe main motivation for studying regression is prediction. Suppose we observeX and then predictY with g(X). Recall that the prediction error, or prediction risk,

R(g) = E(Y − g(X))2

and this is minimized by taking g(x) = r(x) where r(x) = E(Y |X = x).Consider the set of linear predictors

L =

`(x) = xTβ : β ∈ Rp

.

(We assume usually that x1 = 1.) The best linear predictor, or linear oracle, is `∗(x) = xTβ∗where

R(`∗) = min`∈L

R(`).

In other words, `∗(x) = xTβ∗ gives the smallest prediction error of all linear predictors. Note that`∗ is well-defined even without assuming that the true regression function is linear. One way tothink about linear regression is as follows. When we are using least squares, we are trying toestimate the linear oracle, not the true regression function.

Let us make the connection between the best linear predictor and least squares more explicit.(Notation note: Remember that the column vector of features X is row vector in the matrix X.Consequently, in the following text, every time we move between X and X, it seems like we’vemade a transpositional error, but we have not.)

We have

R(β) = E(Y 2)− 2E(Y XTβ) + βTE(XXT )β

= E(Y 2)− 2E(Y XTβ) + βTΣβ

where Σ = E(XXT ). By differentiating R(β) with respect to β and setting the derivative equal to0, we see that the best value of β is

β∗ = Σ−1C (53)

where C is the p× 1 vector whose jth element is E(Y Xj).We can estimate Σ with the matrix

Σn =1

nXTX

and we can estimate C with n−1XTY . An estimate of the oracle is thus

β = (XTX)−1XTY

which is the least squares estimator.

63

Summary

1. The linear oracle, or best linear predictor at x, is xTβ∗ where β∗ = Σ−1C. An estimate of β∗is β = (XTX)−1XTY .

2. The least squares estimator is β = (XTX)−1XTY . We can regard β as an estimate of thelinear oracle. If the regression function r(x) is actually linear so that r(x) = xTβ, then theleast squares estimator is unbiased and has variance matrix σ2(XTX)−1.

3. The predicted values are Y = Xβ = HY where H = X(XTX)−1XT is the hat matrix whichprojects Y onto the column space of X.

8.2 The Bias-Variance DecompositionIf X and Y are random variables, recall the rule of iterated expectations

E[g(X, Y )] = E[E(g(x, Y )|X = x]],

where the inner expectation is taken with respect to Y |X and the outer one is taken with respectto the marginal distribution of X . Throughout the following section, we use this rule, conditioningon X to obtain the risk function R(x) at X = x.

Let r(x) be any predictor, based on i = 1, . . . , n observations (Xi, Yi). As a function of randomvariables, r(x) is a random variable itself, calculated at a fixed value of x. In the calculations belowthink of (X, Y ) as new input and output variables, independent of (X1, Y1), . . . , (Xn, Yn). Thendefine the risk a

R = E(Y − r(X))2 =

∫R(x)f(x)dx

where R(x) = E((Y − r(x))2|X = x). Let

r(x) = E(r(x)),

V (x) = V(r(x))

σ2(x) = V(Y |X = x).

Now

R(x) = E((Y − r(X))2|X = x)

= E(

((Y − r(x)) + (r(x)− r(x)) + (r(x)− r(x)))2

∣∣∣∣ X = x

)

= σ2(x)︸︷︷︸irreducible error

+ (r(x)− r(x))2

︸︷︷︸bias squared

+ V (x)︸︷︷︸variance

. (54)

We call (54) the bias-variance decomposition. Note: The irreducible error is the error due tounmodeled variation, such as instrument error and population variability around the model. The

64

bias is the lack of fit between the assumed model and the true relationship between Y and X .This will be zero if the assumed model r(x) includes the truth E[Y |X = x]. The variance is thestatistical variability in the estimation procedure. As n → ∞ this quantity goes to zero. Finally,all the cross product terms have expectation 0 because Y is independent of r(x).

If we combine the last two terms, we can also write

R(x) = σ2(x) + MSEn(x)

where MSEn(x) = E((r(X)− r(X))2|X = x) is the conditional mean squared error of r(x). Now

R =

∫R(x)f(x)dx ≈ 1

n

n∑

i=1

R(xi) ≡ Rav

and Rav is called the average prediction risk. It averages over the observed X’s as an approxima-tion to the theoretical average over the marginal distribution of the X’s. We have

Rav =1

n

n∑

i=1

R(xi) =1

n

n∑

i=1

σ2(xi) +1

n

n∑

i=1

(r(xi)− r(xi))2 +1

n

n∑

i=1

V (xi).

To summarize, we wish to know R, the prediction risk. Rav provides an excellent approxima-tion, but Rav is not a quantity that we can readily calculate empirically because we do not knowR(Xi). Let us explore why it is challenging to calculate R. Let Yi = r(xi), the fitted value of theregression at xi. Define the training error

Rtraining =1

n

n∑

i=1

(Yi − Yi)2.

We might guess that Rtraining estimates the prediction error (R) well but this is not true. Thereason is that we used the observed pairs (xi, Yi) to obtain Yi = r(xi). As a consequence Yi areand Yi are correlated. Typically Yi “predicts” Yi better than it predicts a new Y at the same xi. Letus explore this formally. Let ri = E(r(xi)) and compute

E(Yi − Yi)2 = E(Yi − r(xi) + r(xi)− r(xi) + r(xi)− Yi)2

= σ2 + E(r(xi)− r(xi))2 + V(r(xi))− 2Cov(Yi, Yi).

Note: this time the cross-product involving the 1st and 3rd terms is not 0 because Cov(Yi, Yi) 6= 0.This is because Yi is a particular observation from which we calculated Yi, hence the two terms arecorrelated. This introduces a bias into the estimate of risk

E(Rtraining) = E(Rav)− 2Cov(Yi, Yi). (55)

Typically, Cov(Yi, Yi) > 0 and so Rtraining underestimates the risk. Later, we shall see how toestimate the prediction risk.

65

8.3 Variable SelectionWhen selecting variables for a model, one needs to consider the research hypothesis as well asany potential confounding variables to control for. For example, in most medical studies, age andgender are always included in the model since they are common confounders. Researchers arelooking for the effect of other predictors on the response once age and gender have been accountedfor.

If your research hypothesis specifically addresses the effect of a variable, say expenditure, youneed to either include it in your model or show explicitly in your analysis why the variable doesnot belong.

Furthermore, one needs to consider the purpose of the analysis. If the purpose is to simplycome up with accurate predictions for the response, researchers tend to simply look for variablesthat are easily obtained that account for a high degree of variation in the response.

However we choose to select our variables, we should always be wary of overinterpretation ofthe model in a multiple regression setting. Here’s why: 1) The selected variables are not neces-sarily special. Variable selection methods are highly influenced by correlations between variables.Particularly when two predictors are highly correlated (R2 > .8), usually one will be ommitteddespite the fact that the other may be a good predictor on its own. The problem is that since thetwo variables contain so much overlapping information, once you include one, the second variableaccounts for very little additional variability in the response. 2) Interpretation of coefficients. If wehave a regression coefficient of 0.2 for variable A, the interpretation is as follows: ”While holdingthe values of all other predictors constant, a 1-unit increase in the value of A is associated with anincrease of 0.2 in the expected value of the response.” 3) Lastly, for observational studies, causalityis rarely implied.

If the dimension p of the covariate X is large, then we might get better predictions by omittingsome covariates. Models with many covariates have low bias but high variance; models with fewcovariates have high bias but low variance. The best predictions come from balancing these twoextremes. This is called the bias-variance tradeoff. To reiterate:

including many covariates leads to low bias and high variance

including few covariates leads to high bias and low variance

The problem of deciding which variables to include in the regression model to achive a goodtradeoff is called model selection or variable selection.

It is convenient in model selection to first standardize all the variables by subtracting off themean and dividing by the standard deviation. For example, we replace xij with (xij−xj)/sj wherexj = n−1

∑ni=1 xij is the mean of covariate xj and sj is the standard deviation. The R function

scale will do this for you. Thus, we assume throughout this section that

66

1

n

n∑

i=1

yi = 0,1

n

n∑

i=1

y2i = 1 (56)

1

n

n∑

i=1

xij = 0,1

n

n∑

i=1

x2ij = 1, j = 1, . . . , p. (57)

Given S ⊂ 1, . . . , p, let (Xj : j ∈ S) denote a subset of the covariates. There are 2p suchsubsets. Let β(S) = (βj : j ∈ S) denote the coefficients of the corresponding set of covariatesand let β(S) = (XT

SXS)−1XTSY denote the least squares estimate of β(S), where XS denotes the

design matrix for this subset of covariates. Thus, β(S) is the least squares estimate of β(S) fromthe submodel Y = XSβ(S) + ε. The vector of predicted values from model S is Y (S) = XSβ(S).For the null model S = ∅, Y is defined to be a vector of 0’s. Let rS(x) =

∑j∈S βj(S)xj denote the

estimated regression function for the submodel. We measure the predictive quality of the modelvia the prediction risk.

The prediction risk of the submodel S is defined to be

R(S) =1

n

n∑

i=1

E(Yi(S)− Y ∗i )2 (58)

where Y ∗i = r(Xi) + ε∗i denotes the value of a future observation of Y atcovariate value Xi.

Ideally, we want to select a submodel S to make R(S) as small as possible. We face twoproblems:

• estimating R(S)

• searching through all the submodels S

8.4 The Bias-Variance TradeoffAll results in this subsection are calculated conditionally on X1, X2, . . . , Xn.

Before discussing the estimation of the prediction risk, we recall an important result.

67

Bias-Variance Decomposition of the Prediction Risk

Rav(S) = σ2 + 1n

∑ni=1 b

2i + 1

n

∑ni=1 vi

unavoidable error squared bias variance(59)

where bi = E(rS(Xi)|Xi = xi) − r(xi) is the bias and vi =V(rS(Xi)|X − i = xi) is the variance.

Let us look at the bias-variance tradeoff in some simpler settings. In both settings we inducea bias by chosing an estimator that is closer to zero than the observed data. Surprisingly, this biascan yield an estimator with smaller MSE than the unbiased estimator.

8.1 Example. Suppose that we observe a single observation Y ∼ N(µ, σ2). The minimumvariance unbiased estimator of µ is Y . Now consider the estimator µ = αY where 0 ≤ α ≤ 1.The bias is E(µ)− µ = (α− 1)µ, the variance is σ2α2 and mean squared error is

MSE = bias2 + variance = (1− α)2µ2 + σ2α2. (60)

Notice that the bias increases and the variance decreases as α→ 0. Conversely, the bias decreasesand the variance increases as α→ 1. The optimal estimator is obtained by taking α = µ2/(σ2+µ2).The simple estimator Y did not produce the minimum MSE.

8.2 Example. Consider the following model:

Yi = N(µi, σ2), i = 1, . . . , p. (61)

We want to estimate µ = (µ1, . . . , µp)T . Fix 1 ≤ k ≤ p and let

µi =

Yi i ≤ k0 i > k.

(62)

Because the first k terms have no bias and the last p− k terms have no variance the MSE is

MSE =

p∑

i=k+1

µ2i + kσ2. (63)

As k increases the bias term decreases and the variance term increases. Using the fact that E(Y 2i −

σ2) = µ2i , which matches the bias term above, we can form an unbiased estimate of the risk,

namely,

MSE =

p∑

i=k+1

(y2i − σ2) + kσ2

=

p∑

i=k+1

y2i + 2kσ2 − pσ2

68

(Note RSS =∑p

i=k+1 y2i .) Assuming σ2 is known, we can estimate the optimal choice of k by

minimizingRSS + 2kσ2 (64)

over k. While admittedly weird, this example shows that the estimated risk equals the observederror (RSS) plus a term that increases with model complexity (k).

8.5 Risk Estimation and Model ScoringAn obvious candidate to estimate R(S) is the training error

Rtr(S) =1

n

n∑

i=1

(Yi(S)− Yi)2. (65)

For the null model S = ∅, Yi = 0, i = 1, . . . , n, and Rtr(S) is an unbiased estimator of R(S)and this is the risk estimator we will use for this model. But in general, this is a poor estimator ofR(S) because it is very biased. Indeed, if we add more and more covariates to the model, we cantrack the data better and better and make Rtr(S) smaller and smaller. For example, in the previousexample Rtr(S) = 0 if k = p. Thus if we used Rtr(S) for model selection we would be led toinclude every covariate in the model.

8.3 Theorem. The training error is a downward-biased estimate of the prediction risk, meaningthat E(Rtr(S)) < R(S). In fact,

bias(Rtr(S)) = E(Rtr(S))−R(S) = − 2

n

∑

i=1

Cov(Yi, Yi). (66)

Now we discuss some better estimates of risk. For each one we obtain an estimate of risk thatcan be approximately expressed in the form

Rtr(S)) + penalty(S).

One picks the model that yields the minimum value. The first term decreases, while the secondterm increases with model complexity. A challenge for most estimators of risk is that they requirean estimate of σ2.

Mallow’s Cp

Mallow’s Cp statistic is defined by

R(S) = Rtr(S) +2|S|σ2

n(67)

where |S| denotes the number of terms in S and σ2 is the estimate of σ2 obtained from the fullmodel (with all covariates in the model). This is simply the training error plus a bias correction.

69

This estimate is named in honor of Colin Mallows who invented it. The first term in (67) measuresthe fit of the model while the second measure the complexity of the model. Think of the Cp statisticas:

lack of fit + complexity penalty. (68)

The disadvantage of Cp is that we need to supply an estimate of σ.

leave-one-out cross-validation

Another method for estimating risk is leave-one-out cross-validation.

The leave-one-out cross-validation (CV) estimator of risk is

RCV (S) =1

n

n∑

i=1

(Yi − Y(i)) (69)

where Y(i) is the prediction for Yi obtained by fitting the model with Yiomitted. It can be shown that

RCV (S) =1

n

n∑

i=1

(Yi − Yi(S)

1−Hii(S)

)2

(70)

where Hii(S) is the ith diagonal element of the hat matrix

H(S) = XS(XTSXS)−1XT

S . (71)

From equation (70) it follows that we can compute the leave-one-out cross-validation estimatorwithout actually dropping out each observation and refitting the model. An important advantageof cross-validation is that it does not require an estimate of σ.

We can relate CV to Cp as follows. First, approximate each Hii(S) with their average valuen−1

∑ni=1Hii(S) = trace(H(S))/n = |S|/n. This yields

RCV (S) ≈ 1

n

RSS(S)(

1− |S|n

)2 . (72)

The right hand side of (72) is called the generalized cross validation (GCV) score and will comeup again later. Next, use that fact that 1/(1− x)2 ≈ 1 + 2x and conclude that

RCV (S) ≈ Rtr(S) +2σ2|S|n

(73)

where σ2 = RSS(S)/n. This is identical to Cp except that the estimator of σ2 is different.

70

Akaike Information Criterion

Another criterion for model selection is AIC (Akaike Information Criterion). The idea is tochoose S to maximize

`S − |S|, (74)

or minimize−2`S + 2|S|,

where `S = `S(βS, σ2) is the log-likelihood (assuming Normal errors) of the model evaluated at

the MLE. This can be thought of as “goodness of fit” minus “complexity.” Assuming Normalerrors,

`(β, σ2) = constant− n

2log σ2 − 1

2σ2||Y −Xβ|2.

Define RSS(S) as the residual sum of squares in model S. Inserting β yields

`(β, σ2) = constant− n

2log σ2 − 1

2σ2RSS(S).

In this expression we can ignore n2

log σ2 because it does not include any terms that depend on thefit of model S. Thus up to a constant we can write

AIC(S) =RSS(S)

σ2+ 2|S|. (75)

Equivalently AIC finds the model that minimizes

RSS(S)

n+

2|S|σ2

n. (76)

If we estimate σ using the error from largest model, then minimizing AIC is equivalent to mini-mizing Mallow’s Cp.

Bayesian information criterion

Yet another criterion for model selection is BIC (Bayesian information criterion). Here wechoose a model to maximize

BIC(S) = `S −|S|2

log n = −n2

log

(RSS(S)

n

)− |S|

2log n. (77)

The BIC score has a Bayesian interpretation. Let S = S1, . . . , Sm where m = 2p denote allthe models. Suppose we assign the prior P(Sj) = 1/m over the models. Also, assume we put asmooth prior on the parameters within each model. It can be shown that the posterior probabilityfor a model is approximately,

P(Sj|data) ≈ eBIC(Sj)

∑r e

BIC(Sr). (78)

71

Hence, choosing the model with highest BIC is like choosing the model with highest posteriorprobability. But this interpretation is poor unless n is large relative to p. The BIC score also hasan information-theoretic interpretation in terms of something called minimum description length.The BIC score is identical to AIC except that it puts a more severe penalty for complexity. It thusleads one to choose a smaller model than the other methods.

Summary

• Cp

R(S) = Rtr(S) +2|S|σfull2

n.

• CV

R(S) ≈ Rtr(S) +2|S|σS2

n.

• AIC−2`(S) + 2|S|.

• -2BIC−2`(S) + |S| log n.

Note: the key term in `(S) is Rtr(S), so each of these methods has a similar form. They vary inhow they estimate σ2 and how substantial a penalty is paid for model complexity.

8.6 Model SearchOnce we choose a model selection criterion, such as cross-validation or AIC, we then need tosearch through all 2p models, assign a score to each one, and choose the model with the best score.We will consider 4 methods for searching through the space of models:

1. Fit all submodels.

2. Forward stepwise regression.

3. Ridge Regression.

4. The Lasso.

Fitting All Submodels. If p is not too large we can do a complete search over all the models.

8.4 Example. Consider the SAT data but let us only consider three variables: Public, Expendi-ture, and Rank. There are 8 possible submodels.

Here, x is a matrix of explanatory variables. (Do not include a column of 1’s.) You can alsouse the “nbest= ” option, for example,

out = leaps(x,y,method="Cp",nbest=10)

72

This will report only the best 10 subsets of each size model. The output is a list with severalcomponents. In particular, out$which shows which variables are in the model, out$sizeshows how many parameters are in the model and out$Cp shows the Cp statistic.

> library(leaps)> x = cbind(expend, public, rank)> out = leaps(x, sat, method = "Cp")> out$which

1 2 31 FALSE FALSE TRUE1 FALSE TRUE FALSE1 TRUE FALSE FALSE2 TRUE FALSE TRUE2 FALSE TRUE TRUE2 TRUE TRUE FALSE3 TRUE TRUE TRUE

$label[1] "(Intercept)" "1" "2" "3"

$size[1] 2 2 2 3 3 3 4

$Cp[1] 19.36971 241.68429 242.40923 12.34101 16.84098 243.17995 4.00000

The best model is the one with the lowest Mallow’s Cp. Here, that model is the one containingall three covariates.

Stepwise.When p is large, searching through all 2p models is infeasible. In that case we need to search overa subset of all the models. One common method is to use stepwise regression. Stepwise regressioncan be run forward, backward, or in both directions.

In forward stepwise regression, we start with no covariates in the model. We then add the onevariable that leads to the best score. We continue adding variables one at a time this way. SeeFigure 23. Backwards stepwise regression is the same except that we start with the biggest modeland drop one variable at a time. Both are greedy searches; nether is guaranteed to find the modelwith the best score. Backward selection is infeasible when p is larger than n since β will not bedefined for the largest model. Hence, forward selection is preferred when p is large.

8.5 Example. Figure 24 shows forward stepwise regression on a data set with 13 correlatedpredictors. The x-axis shows the order that the variables entered. The y-axis is the cross-validation

73

Forward Stepwise Regression

1. For j = 1, . . . , p, regress Y on the jth covariate Xj and let Rj be the estimated risk. Setj = argminjRj and let S = j.

2. For each j ∈ Sc, fit the regression model Y = βjXj +∑

s∈S βsXs + ε and let Rj be theestimated risk. Set j = argminj∈ScRj and update S ←− S

⋃j.

3. Repeat the previous step until all variables are in S or until it is not possible to fit the regres-sion.

4. Choose the final model to be the one with the smallest estimated risk.

Figure 23: Forward stepwise regression.

score. We start with a null model and we find that adding x4 reduces the cross-validation scorethe most. Next we find try adding each of the remaining variables to the model abd find the x13

leads to the most improvement. We continue this way until all the variables have been added. Thesequence of models chosen by the algorithm is

Y = β4X4 S = 4Y = β4X4 + β13X13 S = 4, 13Y = β4X4 + β13X13 + β3X3 S = 4, 13, 3...

......

...

(79)

The best overall model we find is the model with five variables x4, x13, x3, x1, x11 although themodel with seven variables is essentially just as good.

8.6.1 Regularization: Ridge Regression and the Lasso.

Another way to deal with variable selection is to use regularization or penalization. Specifically,we define β to minimize the penalized sums of squares

Q(β) =n∑

i=1

(yi − xTi β)2 + λpen (β)

where pen(β) is a penalty and λ ≥ 0 is a tuning parameter. The bigger λ, the bigger the penaltyfor model complexity. We consider three choices for the penalty:

L0 penalty ||β||0 = #j : βj 6= 0

74

0.30

0.35

0.40

0.45

0.50

0.55

0.60

Number of variables

Cro

ss−

valid

atio

n sc

ore

4

13

3

1

11 10 712

56

2

8

9

1 2 3 4 5 6 7 8 9 10 11 12 13

Figure 24: Forward stepwise regression on the 13 variable data.

75

L1 penalty ||β||1 =

p∑

j=1

|βj|

L2 penalty ||β||2 =

p∑

j=1

β2j .

The L0 penalty would force us to choose estimates which make many of the βj’s equal to 0.But there is no way to minimize Q(β) without searching through all the submodels.

The L2 penalty is easy to implement. The estimate β that minimizes

n∑

i=1

(Yi −p∑

j=1

βjXij)2 + λ

p∑

j=1

β2j

is called the ridge estimator. It can be shown that the estimator β that minimizes the penalizedsums of squares is as follows (assuming the features are standardized) is

β = (XTX + λI)−1XTY,

where I is the identity. When λ = 0 we get the least squares estimate (low bias, high variance).When λ→∞ we get β = 0 (high bias, low variance).

Ridge regression produces a linear estimator: β = SY where

S = (XTX + λI)−1XT

and Y = HY where whereH = X(XTX + λI)−1XT .

For regression we keep track of two types of degrees of freedom: model df (p, the number ofcovariates), and degrees of freedom error (n− p− 1). As the model incorporates more covariates,it becomes more complex, fitting the data better and better and eventually, over-fitting the data. Forthe remainder of the notes when we say “effective degrees of freedom”, we are always refering toan analog to the model degrees of freedom.

For regularized regression, the effective degrees of freedom is defined to be

df(λ) = trace(H).

When λ = 0 we have df(λ) = p (maximum complexity) and when λ→∞, df(λ)→ 0 (minimumcomplexity).

How do we choose λ? Recall that r(−i)(xi) = Y(−i), the leave-one-out fitted value, and thecross-validation estimate of predictive risk is

CV =n∑

i=1

(yi − r(−i)(xi))2.

76

It can be shown that

CV =n∑

i=1

(yi − r(xi)1−Hii

)2

.

Thus we can choose λ to minimize CV.An alternative criterion that is sometimes used is generalized cross validation or, GCV. This

is just an approximation to CV where Hii is replaced with its average: n−1∑n

i=1Hii. Thus,

GCV =n∑

i=1

(Yi − r(xi)

1− b

)2

where

b =1

n

n∑

i=1

Hii =df(λ)

n.

The problem with ridge regression is that we really haven’t done variable selection because wehaven’t forced any βj’s to be 0. This is where the L1 penalty comes in.

The lasso estimator β(λ) is the value of β that solves:

minβ∈Rp

( n∑

i=1

(yi − xTi β)2 + λ||β||1)

(80)

where λ > 0 and ||β||1 =∑p

j=1 |βj| is the L1 norm of the vector β.

The lasso is called basis pursuit in the signal processing literature. Equation (80) defines a convexoptimation problem with a unique solution β(λ) that depends on λ. Typically, it turns out thatmany of the βj(λ)′s are zero. (See Figure 25 for intuition on this process.) Thus, the lasso performsestimation and model selection simultaneously. The selected model, for a given λ, is

S(λ) = j : βj(λ) 6= 0. (81)

The constant λ can be chosen by cross-validation. The estimator has to be computed numericallybut this is a convex optimization and so can be solved quickly.

To see the difference in shrinkage and selection of terms in ridge versus lasso regression com-pare Figures 26 and 27, respectively. For the chosen tuning parameters shown (selected by crossvalidation), only three of the variables are included in the final lasso model (svi, lweight and cavol).In contrast, because ridge regression is not a model selection procedure, all of the terms are in theridge model. The three chosen by lasso do have the largest coefficients in the ridge model.

What is special about the L1 penalty? First, this is the closest penalty to the L0 penalty thatmakes Q(β) convex. Moreover, the L1 penalty captures sparsity.

77

Figure 25: from Hastie et al. 2001

78


79

Digression on Sparsity. We would like our estimator β to be sparse, meaning that most βj’sare zero (or close to zero). Consider the following two vectors, each of length p:

u = (1, 0, . . . , 0)

v = (1/√p, 1/√p, . . . , 1/

√p).

Intuitively, u is sparse while v is not. Let us now compute the norms:||u||1 = 1 ||u||2 = 1||v||1 =

√p ||v||2 = 1.

So the L1 norm correctly captures sparseness.Two related variable selection methods are forward stagewise regression and lars. In forward

stagewise regression we first set Y = (0, . . . , 0)T and we choose a small, positive constant ε. Nowwe build the predicted values incrementally. Let Y denote the current vector of predicted values.Find the current correlations c = c(Y ) = XT (Y − Y ) and set

j = argmaxj|cj|. (82)

Finally, we update Y by the following equation:

Y ←− Y + ε sign(cj)xj. (83)

This is like forward stepwise regression except that we only take small, incremental steps towardsthe next variable and we do not go back and refit the previous variables by least squares.

A modification of forward stagewise regression is called least angle regression. We begin withall coefficients set to 0 and then find the predictor xj most correlated with Y . Then increase βj indirection of the sign of its correlation with Y and set ε = Y −Y . When some other predictor xk hasas much correlation with ε as xj has we increase (βj, βk) in their joint least squares direction, untilsome other predictor xm has as much correlation with the residual ε. Continue until all predictorsare in the model. A formal description is in Figure 28.

lars can be easily modified to produce the lasso estimator. If a non-zero coefficient ever hitszero, remove it from the active set A of predictors and recompute the joint direction. This is whythe lars function in R is used to compute the lasso estimator. You need to download the larspackage first.

81

lars

1. Set Y = 0, k = 0, A = ∅. Now repeat steps 2–3 until Ac = ∅.

2. Compute the following quantities:

c = XT (Y − Y ) C = maxj|cj| A = j : |cj| = Csj = sign(cj), j ∈ A XA = (sjxj : j ∈ A) G = XT

AXA

B = (1TG−11)−1/2 w = BG−11 u = XAwa = XTu

(84)

where 1 is a vector of 1’s of length |A|.

3. SetY ←− Y + γu (85)

where

γ = minj∈Ac

+

C − cjB − aj

,C + cjB + aj

. (86)

Here, min+ means that the minimum is only over positive components.

Figure 28: A formal descrription of lars.

82

Summary

1. The prediction risk R(S) = n−1∑n

i=1(Yi(S) − Y ∗i )2 can be de-composed into unavoidable error, bias and variance.

2. Large models have low bias and high variance. Small models havehigh bias and low variance. This is the bias-variance tradeoff.

3. Model selection methods aim to find a model which balances biasand varaince, yielding a small risk.

4. Cp or cross-validation are used to estimate the risk.

5. Search methods look through a subset of models and find the onewith the smallest value of estimated risk R(S).

6. The lasso estimates β with the penalized residual sums of squares∑ni=1(Yi −XT

i β)2 + λ||β||1. Some of the estimates will be 0 andthis corresponds to omitting them from the model. lars is an effi-cient algorithm for computing the lasso estimates.

8.6.2 Model selection on SAT data

Forward Stepwise Regression

> step(lm(sat˜log(takers)+rank), scope = list(lower=sat˜log(takers)+rank,+ upper = sat˜log(takers)+rank+expend+years+income+public), direction = "forward")Start: AIC=346.7sat ˜ log(takers) + rank

Df Sum of Sq RSS AIC+ expend 1 13149 32380 332+ years 1 9827 35703 337<none> 45530 347+ income 1 1305 44224 347+ public 1 16 45514 349

Step: AIC=331.66sat ˜ log(takers) + rank + expend

Df Sum of Sq RSS AIC+ years 1 5744 26637 324<none> 32380 332

83

+ public 1 421 31959 333+ income 1 317 32063 333

Step: AIC=323.9sat ˜ log(takers) + rank + expend + years

Df Sum of Sq RSS AIC<none> 26636.8 323.9+ income 1 26.6 26610.2 325.9+ public 1 4.6 26632.2 325.9

Call:lm(formula = sat ˜ log(takers) + rank + expend + years)

Coefficients:(Intercept) log(takers) rank expend years

388.425 -38.015 4.004 2.423 17.857

With forward stepwise regression, we can define the scope of the model to ensure that anyvariables that we wish to always include will be in the model, regardless of which variables wouldhave otherwise selected. Above, we start with the model sat log(takers)+rank, and the algorithmpicks variables to add one by one. At any given step, forward stepwise regression will select thevariable that gives the greatest decrease in the AIC criterion. If no variable will decrease the AIC(or we have reached the upper scope of the model), stepwise regression will stop.

Above, we started with an AIC of 346.7 with our lower scope model of sat log(takers) + rank(Fig. 29). By adding the variable ’expend’, the AIC dropped to 331.66. Further adding ’years’lowered the AIC to 323.9. Given the presence of these four variables, neither the addition ofincome or public would have resulted in a reduction in AIC. As such, the algorithm stopped.

par(mfrow = c(1,1))AIC = c(346.7, 331.66, 323.9)ommitted = c(325.85, 327.8)plot(1:3, AIC, xlim = c(1,5), type = "l", xaxt = "n", xlab = " ",main = "Forward Stepwise AIC Plot")points(1:3, AIC, pch = 19)points(4:5, ommitted)axis(1, at = 1:5, labels = c("log(takers)+rank", "expend",+ "years", "income", "public"))abline(h = 323.9, lty = 2)#Descriptions of most graphics functions used above can be found in#help(par) and help(axis)

As seen in the graph, the addition of income and public would have increased the AIC criterion.

Backward Stepwise Regression

84

Figure 29: AIC

Backward Stepwise Regression works like forward stepwise but in reverse. We generally startwith a full model containing all possible variables, and remove one at a time until the AIC isminimized. With simple data sets, forward and backward regression often find the same model.One drawback to backward regression is that n must be larger than p. For example, we can’t startwith a model with 25 variables and 20 observations.

> full = lm(sat˜log(takers)+rank+expend+years+income+public)> minimum = lm(sat˜log(takers)+rank)> step(full, scope = list(lower=minimum, upper = full), direction = "backward")Start: AIC=327.8sat ˜ log(takers) + rank + expend + years + income + public

Df Sum of Sq RSS AIC- public 1 25 26610 326- income 1 47 26632 326<none> 26585 328- years 1 4589 31174 334- expend 1 6264 32850 336

Step: AIC=325.85sat ˜ log(takers) + rank + expend + years + income

Df Sum of Sq RSS AIC- income 1 27 26637 324<none> 26610 326- years 1 5453 32063 333

85

- expend 1 7430 34040 336

Step: AIC=323.9sat ˜ log(takers) + rank + expend + years

Df Sum of Sq RSS AIC<none> 26637 324- years 1 5744 32380 332- expend 1 9066 35703 337

Call:lm(formula = sat ˜ log(takers) + rank + expend + years)

Coefficients:(Intercept) log(takers) rank expend years

388.425 -38.015 4.004 2.423 17.857

“Both” Stepwise Regression”While forward and backward stepwise regression are both “greedy” algorithms, a third option

is to run stepwise regression in both directions. It essentially toggles between:1) One step of forward selection, and 2) One step of backward selection.As before, a step is only performed if it lowers AIC, otherwise it is skipped. The algorithm

stops if two consecutive steps are skipped. Note that while ”both” stepwise regression is notgreedy, we are not guaranteed to find the model with the lowest AIC. It could be that the ”best”model would require exchanging sets of multiple variables, but stepwise can only move one stepat a time. In a sense, we could find a local minimum, but not the global minimum.

step(minimum, scope = list(lower=minimum, upper = full), direction = "both")step(full, scope = list(lower=minimum, upper = full), direction = "both")

For this particular data set, Both Stepwise Regression finds the same model whether we startwith the minimum or full model. In fact, it is the same model found by both forward and backwardstepwise regression.

Ridge Regression

> library(MASS)> ltakers = log(takers)> predictors = cbind(ltakers, income, years, public, expend, rank)> predictors = scale(predictors)> sat.scaled = scale(sat)

The function for ridge regression (lm.ridge) is contained in the MASS library. For simplicity, Icolumn-bind the predictors into a single matrix. Also, Ridge Regression requires the variables tobe scaled.

86

> lambda = seq(0, 10, by=0.25)> length(lambda)[1] 41>> out = lm.ridge(sat.scaled˜predictors, lambda = lambda)> round(out$GCV, 5)

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.250.00274 0.00271 0.00269 0.00268 0.00267 0.00266 0.00266 0.00265 0.00265 0.00265

2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.750.00265 0.00265 0.00265 0.00266 0.00266 0.00266 0.00267 0.00267 0.00268 0.00268

5.00 5.25 5.50 5.75 6.00 6.25 6.50 6.75 7.00 7.250.00269 0.00269 0.00270 0.00270 0.00271 0.00271 0.00272 0.00273 0.00274 0.00274

7.50 7.75 8.00 8.25 8.50 8.75 9.00 9.25 9.50 9.750.00275 0.00276 0.00276 0.00277 0.00278 0.00279 0.00280 0.00280 0.00281 0.00282

10.000.00283> which(out$GCV == min(out$GCV))2.25

10

A primary feature of the lm.ridge function is its ability to accept a vector of lambda penalizationvalues. Above, the lambda vector is created as a sequence from 0 to 10 by increments of 0.25,totalling to 41 elements. We will choose the model whose lambda value minimizes the generalizedcross validation (GCV) value. We see above that the lambda value which minimizes GCV is 2.25,our 10th element (Fig. 8.6.2).

> dim(out$coef)[1] 6 41> round(out$coef[,10], 4)predictorsltakers predictorsincome predictorsyears predictorspublic

-0.4771 0.0223 0.1796 -0.0028predictorsexpend predictorsrank

0.1808 0.4195

In these models, we have 6 predictors and 41 values of lambda. The various model coefficientsare listed in a 6 by 41 matrix, one column for each value of lambda. Above we found 2.25, the10th element of the lambda vector, to minimize GCV, so we simply need to pull the 10th columnfrom the coefficient matrix.

Ridge Regression is unique in that it doesn’t actually perform model selection. Instead, vari-ables of lesser predictive importance will have coefficients that are lower in magnitude (which iswhy scaling was needed). Above, we see that income and public have coefficients that are closeto 0 in magnitude, indicating that they are relatively unimportant given the presence of the othervariables.

Log(takers) and rank, the variables that we have controlled for, have the largest coefficients.Lastly, years and expend have moderately large coefficients. These four variables were selected ineach of the stepwise regression models.

87

Figure 30: GCV for Ridge regression

88

In the following plot we see how the values of each regression coefficient changes with lambda(Fig. 31).

par(mfrow = c(1,1))plot(lambda, out$coef[1,], type = "l", col = 1, xlab = "Lambda",+ ylab = "Coefficients",main = "Plot of Regression Coefficients vs. Lambda Penalty+ \nRidge Regression",ylim = c(min(out$coef), max(out$coef)))abline(h = 0, lty = 2, lwd = 2)abline(v = 2.25, lty = 2)for(i in 2:6)points(lambda, out$coef[i,], type = "l", col = i)

Lasso

> library(lars)> object = lars(x = predictors, y = sat.scaled)> object

Call:lars(x = predictors, y = sat.scaled)R-squared: 0.892Sequence of LASSO moves:

ltakers rank years expend income publicVar 1 6 3 5 2 4Step 1 2 3 4 5 6> plot(object)> object$Cp

0 1 2 3 4 5 6349.908084 103.404397 46.890645 35.639307 3.101503 5.089719 7.000000attr(,"sigma2")

60.1231440attr(,"n")[1] 50> plot.lars(object, xvar="df", plottype="Cp")

In Lasso regression, the coeffients vary as a function of the tuning parameter. For λ near 0, thecoefficient are set to zero. Even as the terms enter the model they are attenuated (shrunk towardszero) until λ = 1 (Fig. 32). We seek to find the model that minimizes the Mallow’s Cp criterion.The order in which the variables are added is listed above after ’Sequence of LASSO moves’. Thisshows us that ltakers was added first, then rank, then years, etc. By examining ’object$Cp’, we cansee the Cp after each of the steps; see also Figure 33. The lowest Cp is found after 4 steps. Thatis, after ltakers, rank, years, and expend. This is consistent with the models found previously (instepwise and ridge regression).

89

Figure 31: ridge regression coefficients

90

Figure 32: lasso coefficients

91

Figure 33: Mallows Cp versus degrees of freedom

92

8.7 Variable Selection versus Hypothesis TestingThe difference between variable selection and hypothesis testing can be confusing. Look at asimple example. Let

Y1, . . . , Yn ∼ N(µ, 1).

We want to compare two models:

M0 : N(0, 1), and M1 : N(µ, 1).

Hypothesis Testing. We testH0 : µ = 0 versus µ 6= 0.

The test statistic is

Z =Y − 0√

V(Y )=√n Y .

We reject H0 if |z| > zα/2. For α = 0.05, we reject H0 if |z| > 2, i.e., if

|y| > 2√n.

AIC. The likelihood is proportional to

L(µ) =n∏

i=1

e−(yi−µ)2/2 = e−n(y−µ)2/2e−ns2/2

where s2 =∑

i(yi − y)2. Hence,

`(µ) = −n(y − µ)2

2− ns2

2.

Recall that AIC = `s − |s|. The AIC scores are

AIC0 = `(0)− 0 = −ny2

2− ns2

2

and

AIC1 = `(µ)− 1 = −ns2

2− 1

since µ = y. We choose model 1 ifAIC1 > AIC0

that is, if

−ns2

2− 1 > −ny

2

2− ns2

2

93

or

|y| >√

2√n.

Similar to but not the same as the hypothesis test.

BIC. The BIC scores are

BIC0 = `(0)− 0

2log n = −ny

2

2− ns2

2

and

BIC1 = `(µ)− 1

2log n = −ns

2

2− 1

2log n.

We choose model 1 ifBIC1 > BIC0

that is, if

|y| >√

log n

n.

Summary

Hypothesis testing controls type I errorsAIC/CV/Cp finds the most predictive modelBIC finds the true model (with high probability)

94

9 Nonlinear RegressionWe can fit regression models when the regression is nonlinear:

Yi = r(Xi; β) + εi

where the regression function r(x; β) is a known function except for some parameters β =(β1, . . . , βk).

9.1 Example. Figure 34 shows the weight of a patient on a weight rehabilitation program as afunction of the number of dayes in the program. The data are from Venables and Ripley (1994). Itis hypothesized that

Yi = r(x; β) + ε,

wherer(x; β) = β0 + β12−x/β2 .

Sincelimx→∞

r(x; β) = β0

we see that β0 is the ideal stable lean weight. Also,

r(0; β)− r(∞; β) = β1

so β1 is the amount of weight to be lost. Finally, we see that expected remaining weight r(x; β)−β0

is one-half the starting remaining weight r(0; β)− β0 when x = β2. So β2 is the half-life, i.e., thetime to lose half the remaining weight.

The parameter estimate is found by minimizing

RSS =n∑

i=1

(yi − r(xi; β))2.

Generally, this must be done numerically. The algorithms are iterative and you must supply startingvalues for the parameters. Here is how to fit the example in R.

> library(MASS)> attach(wtloss)> help(wtloss)

Description

The data frame gives the weight, in kilograms, of an obese patient at52 time points over an 8 month period of a weight rehabilitation programme.

Format

95

This data frame contains the following columns:

DaysTime in days since the start of the programme.WeightWeight in kilograms of the patient.

> plot(Days,Weight,pch=19)> out = nls(Weight ˜ b0 + b1*2ˆ(-Days/b2),data=wtloss,

start=list(b0=90,b1=95,b2=120))> info = summary(out)> print(info)

Formula: Weight ˜ b0 + b1 * 2ˆ(-Days/b2)

Parameters:Estimate Std. Error t value Pr(>|t|)

b0 81.374 2.269 35.86 <2e-16 ***b1 102.684 2.083 49.30 <2e-16 ***b2 141.911 5.295 26.80 <2e-16 ***---

Residual standard error: 0.8949 on 49 degrees of freedom

Correlation of Parameter Estimates:b0 b1

b1 -0.9891b2 -0.9857 0.9561

> b = info$parameters[,1]> grid = seq(0,250,length=1000)> fit = b[1] + b[2]*2ˆ(-grid/b[3])> lines(grid,fit,lty=1,lwd=3,col=2)> plot(Days,info$residuals)> lines(Days,rep(0,length(Days)))> dev.off()

The fit and residuals are shown in Figure 34.

96

0 50 100 150 200 250

110120

130140

150160

170180

Days

Weigh

t

0 50 100 150 200 250

−2−1

01

2

Days

info$re

sidual

s

Figure 34: Weight Loss Data

97

10 Logistic Regression10.1 Example. Our first example concerns the probability of extinction as a function of islandsize (Sleuth case study 21.1). The data provide island size, number of bird species present in 1949and the number of these extinct by 1959.

island area atrisk extinctions

Ulkokrunni 185.80 75 5Maakrunni 105.80 67 3Ristikari 30.70 66 10Isonkivenletto 8.50 51 6Hietakraasukka 4.80 28 3Kraasukka 4.50 20 4Lansiletto 4.30 43 8Pihlajakari 3.60 31 3Tyni 2.60 28 5Tasasenletto 1.70 32 6Raiska 1.20 30 8Pohjanletto 0.70 20 2Toro 0.70 31 9Luusiletto 0.60 16 5Vatunginletto 0.40 15 7Vatunginnokka 0.30 33 8Tiirakari 0.20 40 13Ristikarenletto 0.07 6 3

Let ni be the number of species at risk and Yi be the number of extinctions out of ni. We assumeYi ∼ Binomial(ni, pi) where pi is a function of the area. Define xi = log(areai), Plotting pi =Yi/ni as a function of xi we see an s-shaped decline in the response variable, but log[pi/(1 − pi)]declines linearly with xi (Display21.2). This example motivates logistic regression.

Logistic regression is a generalization of regression that is used when the outcome Y is binaryor binomial. We start with binary data (ni = 1), which is most common in practice. In a latersection we revisit the binomial with ni > 1.

Suppose that Yi ∈ 0, 1 and we want to relate Y to some covariate x. The usual regressionmodel is not appropriate since it does not constrain Y to be binary.

With the logistic regression model we assume that

E(Yi|Xi) = P(Yi = 1|Xi) =eβ0+β1Xi

1 + eβ0+β1Xi.

Note that since Yi is binary, E(Yi|Xi) = P(Yi = 1|Xi). We assume this probability follows thelogistic function eβ0+β1x/(1 + eβ0+β1x). The parameter β1 controls the steepness of the curve. Theparameter β0 controls the horizontal shift of the curve (see Fig 36).

98

Figure 35:

99

x0

1

1

Figure 36: The logistic function p = ex/(1 + ex).

Figure 37: From Sleuth

100

Define the logit function

logit(z) = log

(z

1− z

).

Also, defineπi = P(Yi = 1|Xi).

Then we can rewrite the logistic model as

logit(πi) = β0 + β1Xi.

The extension to several covariates is starightforward:

logit(πi) = β0 +

p∑

j=1

βjxij = xTi β.

The logit is the log odds function. Exponentiating the logit yields the odds, so that the oddsof Y = 1 response at X = x are eβ0+β1x. Thus the odds at X = A versus X = B are equal tothe odds ratio, which simplifies to eβ1(A−B). For instance if A = B + 1 then the odds ratio is eβ1 .In Fig 37), the nonlinearity of the probability is contrasted with the linear log odds. In this plotη = β0 + β1x. Increasing η by one unit anywhere on the horizontal axis increases the odds by thesame quantity. In contrast, the probability tends toward an asymptote of zero or one as η becomesvery small or very large, respectively.

How do we estimate the parameters of the logistic regression function? Usually we use max-imum likelihood. Let’s review the basics of maximum likelihood. Let Y ∈ 0, 1 denote theoutcome of a coin toss. We call Y a Bernoulli random variable. Let π = P(Y = 1) and1− π = P(Y = 0). The probability function is

f(y; π) = πy(1− π)1−y.

The probability function for n independent tosses, Y1, . . . , Yn, is

f(y1, . . . , yn; π) =n∏

i=1

f(yi; π) =n∏

i=1

πyi(1− π)1−yi .

The likelihood function is just the probability function regarded as a function of the parameter πand treating the data as fixed:

L(π) =n∏

i=1

πyi(1− π)1−yi .

The maximum likelihood estimator or MLE, is the value π that maximizes L(π). Maximizing thelikelihood is equivalent to maximizing the loglikelihood function

`(π) = logL(π) =n∑

i=1

(yi log π + (1− yi) log(1− π)

).

101

Setting the derivative of `(π) to zero yields

π =

∑ni=1 Yin

.

Thus the mle for π is easily obtained. The mle for β is not so readily obtained.Recall that the Fisher information is defined to be

I(π) = −E(∂2`(π)

∂π2

).

The approxmate standard error is

se(π) =

√1

I(π)=

√π(1− π)

n.

Notice that the standard error of π is a function of the mean π.Returning to logistic regression, the likelihood function is

L(β) =n∏

i=1

f(yi|Xi; β) =n∏

i=1

πYii (1− πi)1−Yi

where

πi =eX

Ti β

1 + eXTi β.

The maximum likelihood estimator β has to be found numerically. The usual algorithm iscalled reweighted least squares and works as follows. First set starting values β(0). Now, fork = 1, 2, . . . do the following steps until convergence:

1. Compute fitted values

πi =eX

Ti β

(k)

1 + eXTi β

(k), i = 1, . . . , n.

2. Define an n× n diagonal weight matrix W whose ith diagonal element is πi(1− πi).

3. Define the adjusted response vector

Z = Xβ(k) +W−1(Y − π)

where πT = (π1, . . . , πn).

4. Takeβ(k+1) = (XTWX)−1XTWZ

which is the weighted linear regression of Z on X.

102

Iterating this until k is large yields β, the mle.

The standard errors are given by

V(β) = (XTWX)−1.

10.2 Example. This example is drawn from the Sleuth text. An association was noted betweenkeeping pet birds and increased risk of lung cancer. To study this further a case-control study wasconducted. Among patients under 65 years of age, 49 cases were identified with lung cancer. Fromamong the general population, 98 controls of similar age were selected. The investigators recordedSex (1=F,0=M), Age, Socioeconomic status (SS = high or low), years of smoking, average rate ofsmoking, and whether a pet bird was kept for at least 6 months, 5 to 14 years prior to diagnosis(BK=1) or examination. We call those with lung cancer cases and others controls.

Age and smoking history are known to be risk factors for cancer. The question is whether BKis an additional risk factor. Figure 38 shows the number of years a subject smoked versus theirage. The plotting symbols show BK=1 (triangles) or BK=0 (circles). Symbols are filled if thesubject is a case. In this figure it is obvious that smoking is associated with cancer. To see therelationship with BK, look at the distribution of triangles over horizontal stripes. For instance,among the non-smokers, the only lung cancer case was a bird keeper.

The first step of the analysis is to find a good model for the relationship between lung cancerand the other covariates (excluding BK). To visualize this, bin smoking by decade (0,1-20,...,41-50) and calculate the proportion of cases among the subjects in a bin. An empirical estimate of thelogit versus binned years smoking shows that the logit increases as years of smoking increases (plotnot presented). Using the available covariates, and including potential interactions and quadraticterms, we explore the models prediction of case/control status with logistic regression. From thisclass of models we chose a simpler model that includes sex, age, SS and year.

The final step is to include birdkeeping in the model. Including BK leads to a drop in devianceof 11.29, (with one df), which is clearly significant (p-value of .0008). The estimated coefficientof birdkeeping is β = 1.33. The odds of lung cancer for birdkeepers is estimated to be e1.33 = 3.8times higher for those who keep birds than those who do not. With a 95% CI of (0.533,2.14), theodds of lung cancer are estimated to be between 1.70 and 8.47 times greater.

Scope of inference. These inferences apply to the Netherlands in 1985. Because this is an ob-servational study, no causal inferences can be drawn, but these analyses do control for the effects ofsmoking and age. In the publication they cite medical rationale supporting the statistical findings.

Note: This case/control study is also known as a retrospective study. When the response vari-able has a rare outcome, like lung cancer, it is common to sample the study subjects retrospectively.In this way we can over sample the population of people who have the rare outcome. In a randomsample virtually everyone will be not have lung cancer. Using a retrospective study we cannotestimate the probability of lung cancer, but we can estimate the log odds ratio.

10.3 Example. The Coronary Risk-Factor Study (CORIS) data involve 462 males between theages of 15 and 64 from three rural areas in South Africa. The outcome Y is the presence (Y = 1)

103

Figure 38: From Sleuth

104

or absence (Y = 0) of coronary heart disease. There are 9 covariates: systolic blood pressure, cu-mulative tobacco (kg), ldl (low density lipoprotein cholesterol), adiposity, famhist (family historyof heart disease), typea (type-A behavior), obesity, alcohol (current alcohol consumption), and age.A logistic regression yields the following estimates and Wald statistics Wj for the coefficients:

Covariate βj se Wj p-valueIntercept -6.145 1.300 -4.738 0.000sbp 0.007 0.006 1.138 0.255tobacco 0.079 0.027 2.991 0.003ldl 0.174 0.059 2.925 0.003adiposity 0.019 0.029 0.637 0.524famhist 0.925 0.227 4.078 0.000typea 0.040 0.012 3.233 0.001obesity -0.063 0.044 -1.427 0.153alcohol 0.000 0.004 0.027 0.979age 0.045 0.012 3.754 0.000

Are you surprised by the fact that systolic blood pressure is not significant or by the minus signfor the obesity coefficient? If yes, then you are confusing association and causation. The fact thatblood pressure is not significant does not mean that blood pressure is not an important cause ofheart disease. It means that it is not an important predictor of heart disease relative to the othervariables in the model.

Model selection can be done using AIC or BIC:

AICS = −2`(βS) + 2|S|

where S is a subset of the covariates.When ni = 1 it is not possible to examine residuals to evaluate the fit of our regression model.To fit this model in R we use the glm command, which stands for generalized linear model.

> attach(sa.data)> out = glm(chd ˜ ., family=binomial,data=sa.data)> print(summary(out))

Coefficients:Estimate Std. Error z value Pr(>|z|)

(Intercept) -6.1482113 1.2977108 -4.738 2.16e-06 ***sbp 0.0065039 0.0057129 1.138 0.254928tobacco 0.0793674 0.0265321 2.991 0.002777 **ldl 0.1738948 0.0594451 2.925 0.003441 **

105

adiposity 0.0185806 0.0291616 0.637 0.524020famhist 0.9252043 0.2268939 4.078 4.55e-05 ***typea 0.0395805 0.0122417 3.233 0.001224 **obesity -0.0629112 0.0440721 -1.427 0.153447alcohol 0.0001196 0.0044703 0.027 0.978655age 0.0452028 0.0120398 3.754 0.000174 ***---

> out2 = step(out)

Start: AIC= 492.14chd ˜ sbp + tobacco + ldl + adiposity + famhist + typea + obesity +

alcohol + age

etc.

Step: AIC= 487.69chd ˜ tobacco + ldl + famhist + typea + age

Df Deviance AIC<none> 475.69 487.69- ldl 1 484.71 494.71- typea 1 485.44 495.44- tobacco 1 486.03 496.03- famhist 1 492.09 502.09- age 1 502.38 512.38>

> p = out2$fitted.values> names(p) = NULL> n = nrow(sa.data)> predict = rep(0,n)> predict[p > .5] = 1> print(table(chd,predict))

predictchd 0 1

0 256 46

106

1 73 87> error = sum( ((chd==1)&(predict==0)) | ((chd==0)&(predict==1)) )/n> print(error)

[1] 0.2575758

From these results we see that the model predicts the wrong outcome just over 25% of the timein these data. If we use this model to predict the outcome in new data, we will find predictionslightly less accurate (more later).

10.1 More About Logistic RegressionJust when you thought you understood logistic regression...

Suppose we have a binary outcome Yi and a continuous covariate Xi. To examine the relation-ship between x and Y we used the logistic model

P(Y = 1|x) =eβ0+β1x

1 + eβ0+β1x.

To formally test if there is a relationship between x and Y we test

H0 : β1 = 0 versus H1 : β1 6= 0.

When the Xi’s are random (so I am writing them with a capital letter) there is another wayto think about this and it is instructive to do so. Suppose, for example, that X is the amount ofexposure to a chemical and Y is presence or absence of disease. Instead of regressing Y in X ,you might simply compare the distribution of X among the sick (Y = 1) and among the healthy(Y = 0). Let’s consider both methods for analyzing the data.

Method 1: Logistic Regression. (Y |X) The first plot in Figure 39 shows Y versus x and thefitted logistic model. The results of the regression are:


(Intercept) -2.2785 0.5422 -4.202 2.64e-05 ***x 2.1933 0.4567 4.802 1.57e-06 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 138.629 on 99 degrees of freedomResidual deviance: 72.549 on 98 degrees of freedomAIC: 76.55

107

The test for H0 : β1 = 0 is highly significant and we conclude that there is a strong relationshipbetween Y and X .

Method 2: Comparing Two Distributions. (X|Y )Think of X as the outcome and Y as a group indicator. Examine the boxplots and the his-

tograms in the figure. To test whether these distributions (or at least the means of the distributions)are the same, we can do a standard t-test for

H0 : E(X|Y = 1) = E(X|Y = 0) versus E(X|Y = 1) 6= E(X|Y = 0).

x0 = x[y==0]x1 = x[y==1]

> print(t.test(x0,x1))

Welch Two Sample t-test

data: x0 and x1t = -9.3604, df = 97.782, p-value = 3.016e-15alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-2.341486 -1.522313sample estimates:mean of x mean of y0.1148648 2.0467645

Again we conclude that there is a difference.

What’s the connection? Let f0 and f1 be the probability density functions for the two groups.By Bayes’ theorem, and letting π = P(Y = 1),

P(Y = 1|X = x) =f(x|Y = 1)π

f(x|Y = 1)π + f(x|Y = 0)(1− π)

=f1(x)π

f1(x)π + f0(x)(1− π)

=

f1(x)πf0(x)(1−π)

1 + f1(x)πf0(x)(1−π)

.

Now suppose that X|Y = 0 ∼ N(µ0, σ2) and that X|Y = 1 ∼ N(µ1, σ

2). Also, let π = P(Y =1). Then the last equation becomes

P(Y = 1|X = x) =eβ0+βx

1 + eβ0+βx

108

−2 −1 0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

x

y

1 2

−2−1

01

23

4

Histogram of x0

x0

Freq

uenc

y

−2 −1 0 1 2 3 4

02

46

810

Histogram of x1

x1

Freq

uenc

y

−2 −1 0 1 2 3 4

02

46

810

Figure 39: Logistic Regression

109

where

β0 = log

(π

1− π

)+µ2

0 − µ21

2σ2(87)

andβ1 =

µ1 − µ0

σ2. (88)

This is exactly the logistic regression model! Moreover, β1 = 0 if and only if µ0 = µ1. Thus, thetwo approaches are testing the same thing.

In fact, here is how I generated the data for the example. I took P(Y = 1) = 1/2, f0 = N(0, 1)and f0 = N(2, 1). Plugging into (87) and (88) we see that β0 = −2 and β1 = 2. Indeed, we seethat β0 = −2.3 and β1 = 2.2.

There are two different ways of answering the same question.

10.2 Logistic Regression With ReplicationWhen there are replications (ni > 1), we can say more about diagnostics. Suppose there is onecovariate taking values x1, . . . , xk and suppose there are ni observations at each Xi. An examplewas given for extinctions as a function of island size.

Now we let Yi denote the number of successes at Xi. Hence,

Yi ∼ Binomial(ni, πi).

We can fit the logistic regression as before:

logit(πi) = XTi β

and now we define the Pearson residuals

ri =Yi − niπi√niπi(1− πi)

and deviance residuals

di = sign(Yi − Yi)

√√√√2Yi log

(Yi

Yi

)+ 2(ni − Yi) log

((ni − Yi)(ni − Yi)

)

where Yi = nπi, and 0 log 0 = 0. In glm models, deviances play the role of sum of squares in alinear model.

Pearson’s residuals follow directly a normal approximation to a binomial. The deviance resid-uals are the signed square root of the loglikelihood evaluated at the saturated model (Yi = Yi)versus the fitted model (Yi = niπi). These diagnostics are approximately the same in practice. Theresiduals will behave like N(0, 1) random variables when the model is correct.

We can also form standardized versions of these. Let

H = W 1/2X(XTWX)−1XTW 1/2

110

where W is diagonal with ith element niπi(1− πi). The standardized Pearson residuals are

ri =ri√

1−Hii

which should behave like N(0,1) random variables if the model is correct. Similarly define stan-dardized deviance residuals by

di =di√

1−Hii

.

Goodness-of-Fit Test. Now we ask the question, is the model right? The Pearson χ2

χ2 =∑

i

r2i

and devianceD =

∑

i

d2i

both have, approximately, a χ2n−p distribution if the model is correct. Large values are indicative

of a problem.

Let us now discuss the use of residuals. We’ll do this in the context of an example. Here arethe data:

y = c(2 , 7, 9,14,23,29,29,29)n = c(29,30,28,27,30,31,30,29)x = c(49.06,52.99,56.91,60.84,64.76,68.69,72.61,76.54)

The data, from Strand (1930) and Collett (1991) are the number of flour beetles killed by carbondisulphide (CS2). The covariate is the dose of CS2 in mg/l.

There are two ways to run the regression. In the first, the response “Y” goes in in two columns((y, n − y)). In this manner R knows how many trial (n) for each binomial observation. This is aprefered approach.

> out = glm(cbind(y,n-y) ˜ x,family=binomial)> print(summary(out))


(Intercept) -14.7312 1.8300 -8.050 8.28e-16 ***x 0.2478 0.0303 8.179 2.87e-16 ***

111

Null deviance: 137.7204 on 7 degrees of freedomResidual deviance: 2.6558 on 6 degrees of freedom

> b = out$coef> grid = seq(min(x),max(x),length=1000)> l = b[1] + b[2]*grid> fit = exp(l)/(1+exp(l))> plot(x,y/n)> lines(grid,fit,lwd=3)>

The second way to input the data enters each trial as a distinct row.

> Y = c(rep(1,sum(y)),rep(0,sum(n)-sum(y)))> X = c(rep(x,y),rep(x,n-y))> out2 = glm(Y ˜ X,family=binomial)> print(summary(out2))


(Intercept) -14.73081 1.82170 -8.086 6.15e-16 ***X 0.24784 0.03016 8.218 < 2e-16 ***

Null deviance: 313.63 on 233 degrees of freedomResidual deviance: 178.56 on 232 degrees of freedom

Notice that the outcome is the same if the binomials are entered as distinct Bernoulli trialsexcept for the deviance. The correct deviance, which is useful as a goodness-of-fit test, is obtainedfrom the first method.

Going back to our original approach, to test goodness-of-fit test,

> print(out$dev)[1] 2.655771> pvalue = 1-pchisq(out$dev,out$df.residual)> print(pvalue)[1] 0.8506433

Conclude that the fit is good. Still, we should look at the residuals.

> r = resid(out,type="deviance")> p = out$linear.predictors> plot(p,r,pch=19,xlab="linear predictor",ylab="deviance residuals")

112

50 55 60 65 70 75

0.20.4

0.60.8

1.0

x

y/n

−2 −1 0 1 2 3 4

−0.5

0.00.5

linear predictor

devian

ce res

iduals

−2 −1 0 1 2 3 4

−0.5

0.00.5

1.0

linear predictor

standa

rdized

devian

ce res

iduals

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−0.5

0.00.5

1.0

Normal Q−Q Plot


Samp

le Quan

tiles

Figure 40: Beetles

Note that

> print(sum(rˆ2))[1] 2.655771

gives back the deviance test. Now let’s create standardized residuals.

> r = rstandard(out)> plot(x,r,xlab="linear predictor",ylab="standardized deviance residuals")

10.3 Deviance TestsFor linear models if we wish to compare the fit of a “full” model with a “reduced” model weexamine the difference in residual sum of squares via an F test:

F =(RSSred −RSSfull)/df1

RSSfull/df2

∼ Fdf1,df2 ,

where df1 = pfull − pref and df2 = n− (pfull + 1). Now, assuming normality, RSSfull ∼ σ2χ2df2

,so E[RSSfull/df2] = σ2, and RSSred − RSSfull) ∼ σ2χ2

df1. Consequently, σ2 cancels out of the

113

equation in the F test. The test assesses whether the difference in RSS in the full and reducedmodels is bigger than expected with degrees of freedom equal to df1. The denominator of the Ftest is included simply as an estimate σ2.

For glm models a deviance test plays the same role for comparing the fit of a “full” model witha “reduced” model. If the log likelihood is `(θ) =

∑i log f(yi|θ), then the deviance is defined as

2[`(Y ) − `(Y )]. For a linear model the deviance reduces to RSS/σ2. For Poisson and Binomialmodel the mean determines the variance of the model, so there is no unknown σ2 to be estimated.

To test a full vs. reduced model for glm we look at

Devred −Devfull,

rather than an F test. Under the null hypothesis, the difference in deviances is distributed χ2df1

. Ifdf1 = 1 then an alternative test is Wald’s test βj/se(βj) The Wald’s test is the natural analog to thet-test in linear models. The deviance test and Wald’s test will give similar, but not identical results.

11 Generalized Linear ModelsWe can write the logistic regression model as

Yi ∼ Bernoulli(µi)

g(µi) = XTi β

where g(z) = logit(z). The function g is an example of a link function and the Bernoulli isan example of an exponential family, which we explain below. Any model in which Y has adistribution that is in the exponential family and a function of its mean is linear in a set of predictors,is called a generalized linear model.

A probability function (or probability density function) is said to be in the exponential familyif there are functions η(θ), B(θ), T (y) and h(y) such that

f(y; θ) = h(y)eη(θ)T (y)−B(θ).

11.1 Example. Let Y ∼ Poisson(θ). Then

f(y; θ) =θye−θ

y!=

1

y!ey log θ−θ

and hence, this is an exponential family with η(θ) = log θ, B(θ) = θ, T (y) = y, h(y) = 1/y!.

11.2 Example. Let Y ∼ Binomial(n, θ). Then

f(y; θ) =

(n

y

)θy(1− θ)n−y =

(n

y

)exp

y log

(θ

1− θ

)+ n log(1− θ)

.

In this case,

η(θ) = log

(θ

1− θ

), B(θ) = −n log(θ)

114

and

T (y) = y, h(y) =

(n

y

).

If θ = (θ1, . . . , θk) is a vector,then we say that f(y; θ) has exponential family form if

f(y; θ) = h(y) exp

k∑

j=1

ηj(θ)Tj(y)−B(θ)

.

11.3 Example. Consider the Normal family with θ = (µ, σ). Now,

f(y; θ) = exp

µ

σ2x− y2

2σ2− 1

2

(µ2

σ2+ log(2πσ2)

).

This is exponential withη1(θ) =

µ

σ2, T1(y) = y

η2(θ) = − 1

2σ2, T2(y) = y2

B(θ) =1

2

(µ2

σ2+ log(2πσ2)

), h(x) = 1.

Now consider independent random variables Y1, . . . , Yn each from the same exponential familydistribution. Let µi = E(Yi) and suppose that

g(µi) = XTi β.

This is a generalized linear model with link g. Notice that the regression equation

E(Yi) = g−1(XTi β)

is based on the inverse of the link function.

11.4 Example (Normal Regression). Here, Yi ∼ N(µi, σ2) and the link g(µi) = µi is the

indentity function.

11.5 Example (Logistic Regression). Here, Yi ∼ Bernoulli(µi) and g(µi) = logit(µi).

11.6 Example (Poisson Regression). This is often used when the outcomes are counts. Here,Yi ∼ Poisson(µi) and the usual link function is g(µi) = log(µi).

115

Although many link functions could be used, there are default link functions that are standardfor each family. Here they are (from Table 12.5 in Weisberg):

Distribution Link Inverse Link(Regression Function)

Normal Identity µ = xTβg(µ) = µ

Poisson Log µ = exT β

g(µ) = log(µ)

Bernoulli Logit µ = exT β

1+exT β

g(µ) = logit(µ)Gamma Inverse µ = 1

xT β

g(µ) = 1/µIn R you type:

glm(y ˜ x, family= xxxx)

where xxxx is Normal, binomial, poisson etc. R will assume the default link.

11.7 Example. This is a famous data set collected by Sir Richard Doll in the 1950’s. I amfollowing example 9.2.1 in Dobson. The data are on smoking and number of deaths due to coronaryheart disease. Here are the data:

Age Smokers Non-smokersDeaths Person-years Deaths Person-years

35-44 32 52407 2 1879045-54 104 43248 12 1067355-64 206 28612 28 571065-74 186 12663 28 258575-84 102 5317 31 1462

Poisson regression is also appropriate for rate data, where the rate is a count of events occurringto a particular unit of observation, divided by some measure of that unit’s exposure. For example,biologists might count the number of tree species in a forest, and the rate would be the numberof species per square kilometre. Demographers may model death rates in geographic areas as thecount of deaths divided by person-years. It is important to note that event rates can be calculatedas events per units of varying size. For instance, in these data, person-years vary. As the peopleget older there are fewer at risk. At the time the data were collected most people smoked, sothere were fewer person-years for non-smokers. In these examples, exposure is respectively to unit(be it area, person-years or time). In Poisson regression this is handled as an offset, where theexposure variable enters on the right-hand side of the equation, but with a parameter estimate (forlog(exposure)) constrained to 1.

116

log(E[Y |X = x]) = xtβ + log(exposure)

This makes sense because we believe that

log(E[Y |X = x]/exposure) = xtβ.

Thus we pull log(exposure) to the right-hand-side of the equation and fit a Poisson regressionmodel. The offset option allows us to include log(exposure) in the model without estimating a βcoefficient.

A plot of deaths by age exhibits an obvious increasing relationship with age which shows somehint of nonlinearity. The increase may differ between smokers and non-smokers so we will includean interaction term. We took the midpoint of each age group as the age.

> ### page 155 dobson> deaths = c(32,104,206,186,102,2,12,28,28,31)> age = c(40,50,60,70,80,40,50,60,70,80)> py = c(52407,43248,28612,12663,5317,+ 18790,10673,5710,2585,1462)> smoke = c(1,1,1,1,1,0,0,0,0,0)> agesq = age*age> sm.age = smoke*age##Notice the use of the offset for person-years below#> out = glm(deaths˜smoke+age+agesq+sm.age,offset=log(py),family=poisson)> summary(out)

Deviance Residuals:1 2 3 4 5 6 7 8 9

0.43820 -0.27329 -0.15265 0.23393 -0.05700 -0.83049 0.13404 0.64107 -0.4105810

-0.01275


(Intercept) -1.970e+01 1.253e+00 -15.717 < 2e-16 ***smoke 2.364e+00 6.562e-01 3.602 0.000316 ***age 3.563e-01 3.632e-02 9.810 < 2e-16 ***agesq -1.977e-03 2.737e-04 -7.223 5.08e-13 ***sm.age -3.075e-02 9.704e-03 -3.169 0.001528 **---

Null deviance: 935.0673 on 9 degrees of freedoResidual deviance: 1.6354 on 5 degrees of freedom

117

AIC: 66.703

Based on the p-value from the Wald tests above, smoking appears to be quite important (butkeep the usual causal caveats in mind).

Suppose we want to compare smokers to non-smokers for 40 years olds. The estimated modelis

E(Y |x) = expβ0 + β1smoke + β2age + β3age2 + β4smoke ∗ age + log PYand hence

E(Y |smoker, age = 40)

E(Y |non− smoke, age = 40)=

expβ0 + β1 + 40β2 + 1600β3 + 40β4 + log(52407)expβ0 + 40β2 + 1600β3 + log(18790)

= ebβ1+40bβ4+log(52407)−log(18790)

This gives us the ratio of the expected number of deaths in a similar population. But we areinterested in the rate parameter, so we now want to drop the person-years terms. The ratio of ratesis

ebβ1+40bβ4 = 3.1.

suggesting that smokers in this group have a death rate due to coronary heart disease that is 3.1times higher than non-smokers.

Let’s get a confidence interval for this. First, set

γ = β1 + 40β4 = `Tβ

andγ = β1 + 40β4 = `T β

where`T = (0, 1, 0, 0, 40).

Then,V(γ) = `TV `

where V = V(β). An approximate 95 percent confidence interval for γ is

(a, b) = (γ − 2√

V(γ), γ + 2√

V(γ)).

We are interested in ψ = eγ . The confidence interval is

(ea, eb).

In R:

118

> summ = summary(out)> v = summ$dispersion * summ$cov.unscaled#summ$dispersion is 1 unless we allow "over dispersion"#relative to the model. This is a topic I skipped over.#> print(v)

(Intercept) smoke age agesq sm.age(Intercept) 1.5711934366 -4.351992e-01 -4.392812e-02 2.998070e-04 6.445856e-03smoke -0.4351992084 4.306356e-01 7.424636e-03 -1.601373e-05 -6.280941e-03age -0.0439281178 7.424636e-03 1.318857e-03 -9.633853e-06 -1.144205e-04agesq 0.0002998070 -1.601373e-05 -9.633853e-06 7.489759e-08 2.700594e-07sm.age 0.0064458558 -6.280941e-03 -1.144205e-04 2.700594e-07 9.416983e-05> ell = c(0,1,0,0,40)> gam = sum(ell*out$coef)> print(exp(gam))[1] 3.106274> se = sqrt(ell %*% v %*% ell)> ci = exp(c(gam- 2*se, gam+ 2*se))> print(round(ci,2))[1] 1.77 5.45

The result is that the rate is 2 to 5 times higher for smokers than non-smokers at age 40. Sinceheart disease is much more common than lung cancer, the risk of smoking has a bigger impact onpublic health for heart disease than smoking.

There is a formal way to test the model for goodness of fit. Let’s look at the fit of the model. Aswith logistic regression we can compute the deviances. Recall that the log likelihood for a Poissonis of the form

`(θ) = y log(θ)− θ.The deviance residuals are defined as

di = sign(Yi − Yi)√

2[`(Yi)− `(Yi)]

= sign(Yi − Yi)√

2[(Yi log(Yi/Yi)− (Yi − Yi).The deviance is defined as

D =∑

i

d2i .

This statistic is approximately distributed χ2n−p−1 where p is the number of covariates. If D is

larger than expeced (i.e., the p-value is small) this means that the Poisson model with the covariatesincluded is not sufficient to explain the data.

For these data the model appears to fit well.

> print(1-pchisq(out$deviance,df=5))[1] 0.8969393

119

Y

X

W

1

Figure 41: Regression with measurement error. X is not observed. W is a noisy version of X . Ifyou regress Y on W , you will get an inconsistent estimate of β1.

12 Measurement ErrorSuppose we are interested in regressing the outcome Y on a covariate X but we cannot observe Xdirectly. Rather, we observe X plus noise U . The observed data are (Y1,W1), . . . , (Yn,Wn) where

Yi = β0 + β1Xi + εi

Wi = Xi + Ui

and E(Ui) = 0. This is called a measurement error problem or an errors-in-variables problem.The model is illustrated by the directed graph in Figure 41. It is tempting to ignore the error andjust regress Y on W . If the goal is just to predict Y from W then there is no problem. But if thegoal is to estimate β1, regressing Y on W leads to inconsistent estimates.

Let σ2x = V(X), and assume that ε is independent of X , has mean 0 and variance σ2

ε . Alsoassume that U is independent of X , with mean 0 and variance σ2

u. Let β1 be the least squaresestimator of β1 obtained by regressing the Yis on the Wis. It can be shown that

β1as−→λβ1 (89)

where

λ =σ2x

σ2x + σ2

u

< 1. (90)

Thus, the effect of the measurement is to bias the estimated slope towards 0, an effect that is usuallycalled attenuation bias. Let us give a heuristic explanation of why (89) is true. For simplicity,assume that β0 = 0 and that E(X) = 0. So Y ≈ 0, W ≈ 0 and

β1 =

∑i(Yi − Y )(Wi −W )∑

i(Wi −W )2≈

1n

∑i YiWi

1n

∑iW

2i

.

Now,

1

n

∑

i

YiWi =1

n

∑

i

(β1Xi + εi)(Xi + Ui)

120

=β1

n

∑

i

X2i +

β1

n

∑

i

XiUi +1

n

∑

i

εiXi +1

n

∑

i

εiUi

≈ β1σ2x.

(Note: Xi, Ui and εi are all uncorrelated, so by the law of large numbers these sums are approxi-mately zero for large n.) Also,

1

n

∑

i

W 2i =

1

n

∑

i

(Xi + Ui)2

=1

n

∑

i

X2i +

1

n

∑

i

U2i +

2

n

∑

i

XiUi

≈ σ2x + σ2

u

which yields (89).If there are several observed values of W for each X then σ2

u can be estimated. Otherwise, σ2u

must be estimated by external means such as through background knowledge of the noise mecha-nism. For our purposes, we will assume that σ2

u is known. Since, σ2w = σ2

x + σ2u, we can estimate

σ2x by

σ2x = σ2

w − σ2u (91)

where σ2w is the sample variance of the Wis. Plugging these estimates into (90), we get an estimate

λ = (σ2w − σ2

u)/σ2w of λ. An estimate of β1 is

β1 =β1

λ=

σ2w

σ2w − σ2

u

β1. (92)

This is called the method of moments estimator. This estimator makes little sense if σ2w−σ2

u ≤ 0.In such cases, one might reasonable conclude that the sample size is simply not large enough toestimate β1.

Another method for correcting the attenuation bias is SIMEX which stands for simulation ex-trapolation and is due to Cook and Stefanski. Recall that the least squares estimate β1 is a consis-tent estimate of

β1σ2x

σ2x + σ2

u

.

Generate new random variablesWi = Wi +

√ρUi

where Ui ∼ N(0, σ2U). The least squares estimate obtained by regressing the Yis on the Wis is a

consistent estimate of

Ω(ρ) =β1σ

2x

σ2x + (1 + ρ)σ2

u

. (93)

Repeat this processB times (whereB is large) and denote the resulting estimators by β1,1(ρ), . . . , β1,B(ρ).Then define

Ω(ρ) =1

B

B∑

b=1

β1,b(ρ).

121

b

b

-1.0 0.0 1.0 2.0ρ

Uncorrected Least Squares Estimate β1

SIMEX Estimate β1

Ω(ρ)

1

Figure 42: In the SIMEX method we extrapolate Ω(ρ) back to ρ = −1.

Now comes some clever sleight of hand. Setting ρ = −1 in (93) we see that Ω(−1) = β1 whichis the quantity we want to estimate. The idea is to compute Ω(ρ) for a range of values of ρ such as0, 0.5, 1.0, 1.5, 2.0. We then extrapolate the curve Ω(ρ) back to ρ = −1. see Figure 42. To do theextrapolation, we fit the values Ω(ρj) to the curve

G(ρ; γ1, γ2, γ3) = γ1 +γ2

γ3 + ρ(94)

using standard nonlinear regression. Once we have estimates of the γ’s, we take

β1 = G(−1; γ1, γ2, γ3) (95)

as our corrected estimate of β1. Fitting the nonlinear regression (94) is inconvenient; it oftensuffices to approximate G(ρ) with a quadratic. Thus, we fit the Ω(ρj)

′s to the curve

Q(ρ; γ1, γ2, γ3) = γ1 + γ2ρ+ γ3ρ2

and the corrected estimate of β1 is

β1 = Q(−1; γ1, γ2, γ3) = γ1 − γ2 + γ3.

An advantage of SIMEX is that it extends readily to nonlinear and nonparametric regression.

122

0 200 400 600 800

026000

0 200 400

05000

Figure 43: CMB data. The horizontal axis is the multipole moment, essentially the frequency offluctuations in the temperature field of the CMB. The vertical axis is the power or strength of thefluctuations at each frequency. The top plot shows the full data set. The bottom plot shows the first400 data points. The first peak, around x ≈ 200, is obvious. There may be a second and third peakfurther to the right.

13 Nonparametric RegressionNow we will study nonparametric regression, also known as “learning a function” in the jargon ofmachine learning. We are given n pairs of observations (X1, Y1), . . . , (Xn, Yn) where

Yi = r(Xi) + εi, i = 1, . . . , n (96)

andr(x) = E(Y |X = x). (97)

13.1 Example (CMB data). Figure 43 shows data on the cosmic microwave background (CMB).The first plot shows 899 data points over the whole range while the second plot shows the first 400data points. We have noisy measurements Yi of r(Xi) so the data are of the form (96). Our goal isto estimate r. It is believed that r may have three peaks over the range of the data. The first peak isobvious from the second plot. The presence of a second or third peak is much less obvious; carefulinferences are required to assess the significance of these peaks.

123

The simplest nonparametric estimator is the regressogram. Suppose theXi’s are in the interval[a, b]. Divide the interval into m bins of of equal length. Thus each has length h = (b − a)/m.Denote the bins by B1, . . . , Bm. Let kj be the number of observations in bin Bj and let Y j be themean of the Yi’s in bin Bj . Define

rn(x) =1

kj

∑

i:Xi∈Bj

Yi = Y j for x ∈ Bj. (98)

We can rewrite the estimator as

rn(x) =n∑

i=1

ì(x)Yi

where ì(x) = 1/kj if x,Xi ∈ Bj and ì(x) = 0 otherwise. Thus,

`(x) =

(0, 0, . . . , 0,

1

kj, . . . ,

1

kj, 0, . . . , 0

)T.

In other words, the estimate rn is a step function obtained by averaging the Yis over each bin.

13.2 Example (LIDAR). These are data from a light detection and ranging (LIDAR) experiment.LIDAR is used to monitor pollutants. Figure 44 shows 221 observations. The response is the log ofthe ratio of light received from two lasers. The frequency of one laser is the resonance frequency ofmercury while the second has a different frequency. The estimates shown here are regressograms.The smoothing parameter h is the width of the bins. As the binsize h decreases, the estimatedregression function rn goes from oversmoothing to undersmoothing.

Let us now compute the bias and variance of the estimator. For simplicity, suppose that [a, b] =[0, 1] and further suppose that the Xi’s are equally spaced so that each bin has k = n/m. Let usfocus on rn(0). The mean (conditional on the Xi’s) is

E(rn(0)) =1

k

∑

i∈B1

E(Yi) =1

k

∑

i∈B1

r(Xi).

By Taylor’s theorem r(Xi) ≈ r(0) +Xir′(0). So,

E(rn(0)) ≈ r(0) +r′(0)

k

∑

i∈B1

Xi.

The largest Xi can be in bin B1 is the length of the bin h = 1/m. So the absolute value of the biasis

|r′(0)|k

∑

i∈B1

Xi ≤ h|r′(0)|.

The variance isσ2

k=mσ2

n=σ2

nh.

124

range

log

ra

tio

400 500 600 700

−1

.0−

0.6

−0

.2

range

log

ra

tio

400 500 600 700

−1

.0−

0.6

−0

.2

range

log

ra

tio

400 500 600 700

−1

.0−

0.6

−0

.2

range

log

ra

tio

400 500 600 700

−1

.0−

0.6

−0

.2

Figure 44: The LIDAR data from Example 13.2. The estimates are regressograms, obtained byaveraging the Yis over bins. As we decrease the binwidth h, the estimator becomes less smooth.

125

The mean squared error is the squared bias plus the variance:

MSE = h2(r′(0))2 +σ2

nh.

Large bins cause large bias. Small bins cause large variance. The MSE is minimized at

h =

(σ2

2(r′(0))2n

)1/3

=c

n1/3

for some c. With this optimal value of h, the risk (or MSE) is of the order n−2/3.Another simple estimator is the local average defined by

rn(x) =1

kx

∑

i: |Xi−x|≤h

Yi. (99)

The smoothing parameter is h. We can rewrite the estimtor as

rn(x) =

∑ni=1 YiK((x−Xi)/h)∑ni=1 K((x−Xi)/h)

(100)

where K(z) = 1 if |z| ≤ 1 and K(z) = 0 if |z| > 1. We can further rewrite the estimator as

rn(x) =n∑

i=1

Yiì(x)

where

ì(x) = K((x−Xi)/h)/n∑

t=1

K((x−Xt)/h).

We shall see later that his estimator has risk n−4/5 which is better than n−2/3.Notice that both estimators so far have the form rn(x) =

∑ni=1 ì(x)Yi. In fact, most of the

estimators we consider have this form.An estimator rn of r is a linear smoother if, for each x, there exists a vector `(x) = (`1(x), . . . , `n(x))T

such that

rn(x) =n∑

i=1

ì(x)Yi. (101)

Define the vector of fitted values

Y = (rn(x1), . . . , rn(xn))T (102)

where Y = (Y1, . . . , Yn)T . It then follows that

Y = LY (103)

126

where L is an n× n matrix whose ith row is `(Xi)T ; thus, Lij = `j(Xi). The entries of the ith row

show the weights given to each Yi in forming the estimate rn(Xi).The matrix L is called the smoothing matrix or the hat matrix. The ith row of L is called the

effective kernel for estimating r(Xi). We define the effective degrees of freedom by

ν = tr(L). (104)

Compare with linear regression where ν = p. The larger ν, the more complex the model. Asmaller ν yields a smoother regression function.

13.3 Example (Regressogram). Recall that for x ∈ Bj , ì(x) = 1/kj ifXi ∈ Bj and ì(x) = 0otherwise. Thus, rn(x) =

∑ni=1 Yiì(x). The vector of weights `(x) looks like this:

`(x)T =

(0, 0, . . . , 0,

1

kj, . . . ,

1

kj, 0, . . . , 0

).

To see what the smoothing matrix L looks like, suppose that n = 9, m = 3 and k1 = k2 = k3 = 3.Then,

L =1

3×

1 1 1 0 0 0 0 0 01 1 1 0 0 0 0 0 01 1 1 0 0 0 0 0 00 0 0 1 1 1 0 0 00 0 0 1 1 1 0 0 00 0 0 1 1 1 0 0 00 0 0 0 0 0 1 1 10 0 0 0 0 0 1 1 10 0 0 0 0 0 1 1 1

.

In general, it is easy to see that there are ν = tr(L) = m effective degrees of freedom. Thebinwidth h = (b− a)/m controls how smooth the estimate is.

13.4 Example (Local averages). The local average estimator of r(x), is a special case of thekernel estimator discussed shortly. In this case, rn(x) =

∑ni=1 Yiì(x) where ì(x) = 1/kx if

|Xi − x| ≤ h and ì(x) = 0 otherwise. As a simple example, suppose that n = 9, Xi = i/9 andh = 1/9. Then,

L =

1/2 1/2 0 0 0 0 0 0 01/3 1/3 1/3 0 0 0 0 0 00 1/3 1/3 1/3 0 0 0 0 00 0 1/3 1/3 1/3 0 0 0 00 0 0 1/3 1/3 1/3 0 0 00 0 0 0 1/3 1/3 1/3 0 00 0 0 0 0 1/3 1/3 1/3 00 0 0 0 0 0 1/3 1/3 1/30 0 0 0 0 0 0 1/2 1/2

.

127

13.5 Example. Linear Regression. We have Y = HY where H = X(XTX)−1XT . We canwrite

r(x) = xT β = xT (XTX)−1XTY =∑

i

`i(x)Yi.

13.1 Choosing the Smoothing ParameterThe smoothers depend on some smoothing parameter h and we will need some way of choosingh. Recall from our discussion of variable selection that the predictive risk is

E(Y − rn(X))2 = σ2 + E(r(X)− rn(X))2 = σ2 + MSE

where MSE means mean-squared-error. Also,

MSE =

∫bias2(x)p(x)dx+

∫var(x)p(x)dx

wherebias(x) = E(rn(x))− r(x))

is the bias of rn(x) andvar(x) = Variance(rn(x))

is the variance.When the data are oversmoothed, the bias term is large and the variance is small. When the data

are undersmoothed the opposite is true; see Figure 45. This is called the bias–variance tradeoff.Minimizing risk corresponds to balancing bias and variance.

Ideally, we would like to choose h to minimize R(h) but R(h) depends on the unknown func-tion r(x). Instead, we will minimize an estimate R(h) of R(h). As a first guess, we might use theaverage residual sums of squares, also called the training error

1

n

n∑

i=1

(Yi − rn(Xi))2 (105)

to estimateR(h). This turns out to be a poor estimate ofR(h): it is biased downwards and typicallyleads to undersmoothing (overfitting). The reason is that we are using the data twice: to estimatethe function and to estimate the risk. The function estimate is chosen to make

∑ni=1(Yi− rn(Xi))

2

small so this will tend to underestimate the risk. We will estimate the risk using the leave-one-outcross-validation score which is defined as follows.

The leave-one-out cross-validation score is defined by

CV = R(h) =1

n

n∑

i=1

(Yi − r(−i)(Xi))2 (106)

where r(−i) is the estimator obtained by omitting the ith pair (Xi, Yi).

128

b

b

Risk

Bias squared

Variance

Optimalsmoothing

More smoothingLess smoothing

1

Figure 45: The bias–variance tradeoff. The bias increases and the variance decreases with theamount of smoothing. The optimal amount of smoothing, indicated by the vertical line, minimizesthe risk = bias2 + variance.

The intuition for cross-validation is as follows. Note that

E(Yi − r(−i)(Xi))2 = E(Yi − r(Xi) + r(Xi)− r(−i)(Xi))

2

= σ2 + E(r(Xi)− r(−i)(Xi))2

≈ σ2 + E(r(Xi)− rn(Xi))2

and hence,E(R) ≈ predictive risk. (107)

Thus the cross-validation score is a nearly unbiased estimate of the risk. There is a shortcut formulafor computing R just like in linear regression.

13.6 Theorem. Let rn be a linear smoother. Then the leave-one-out cross-validation score R(h)can be written as

R(h) =1

n

n∑

i=1

(Yi − rn(Xi)

1− Lii

)2

(108)

where Lii = `i(Xi) is the ith diagonal element of the smoothing matrix L.

The smoothing parameter h can then be chosen by minimizing R(h).Rather than minimizing the cross-validation score, an alternative is to use generalized cross-

validation in which each Lii in equation (108) is replaced with its average n−1∑n

i=1 Lii = ν/nwhere ν = tr(L) is the effective degrees of freedom. Thus, we would minimize

GCV(h) =1

n

n∑

i=1

(Yi − rn(Xi)

1− ν/n

)2

=Rtraining

(1− ν/n)2. (109)

129

Usually, the bandwidth that minimizes the generalized cross-validation score is close to the band-width that minimizes the cross-validation score.

Using the approximation (1− x)−2 ≈ 1 + 2x we see that

GCV(h) ≈ 1

n

n∑

i=1

(Yi − rn(Xi))2 +

2νσ2

n≡ Cp (110)

where σ2 = n−1∑n

i=1(Yi − rn(Xi))2. Equation (110) is just like the Cp statistic

13.2 Kernel RegressionWe will often use the word “kernel.” For our purposes, the word kernel refers to any smoothfunction K such that K(x) ≥ 0 and

∫K(x) dx = 1,

∫xK(x)dx = 0 and σ2

K ≡∫x2K(x)dx > 0. (111)

Some commonly used kernels are the following:

the boxcar kernel : K(x) =1

2I(x),

the Gaussian kernel : K(x) =1√2πe−x

2/2,

the Epanechnikov kernel : K(x) =3

4(1− x2)I(x)

the tricube kernel : K(x) =70

81(1− |x|3)3I(x)

where

I(x) =

1 if |x| ≤ 10 if |x| > 1.

These kernels are plotted in Figure 46.

130

−3 0 3 −3 0 3

−3 0 3 −3 0 3

Figure 46: Examples of kernels: boxcar (top left), Gaussian (top right), Epanechnikov (bottomleft), and tricube (bottom right).

131

Let h > 0 be a positive number, called the bandwidth. The Nadaraya–Watson kernel esti-mator is defined by

rn(x) =n∑

i=1

ì(x)Yi (112)

where K is a kernel and the weights ì(x) are given by

ì(x) =K(x−Xih

)∑n

j=1K(x−xj

h

) . (113)

13.7 Remark. The local average estimator in Example 13.4 is a kernel estimator based on theboxcar kernel.

R-code. In R, I suggest using the loess command or using the locfit library. (You need todownload locfit.) For loess:

plot(x,y)out = loess(y ˜ x,span=.25,degree=0)lines(x,fitted(out))

The span option is the bandwidth. To compute GCV, you will need the effective number ofparameters. You get this by typing:

out$enp

The command for kernel regression in locfit is:

out = locfit(y ˜ x,deg=0,alpha=c(0,h))

where h is the bandwidth you want to use. The alpha=c(0,h) part looks strange. There aretwo ways to specify the smoothing parameter. The first way is as a percentage of the data, forexample, alpha=c(.25,0) makes the bandwidth big enough so that one quarter of the datafalls in the kernel. To smooth with a specific value for the bandwidth (as we are doing) we usealpha=c(0,h). The meaning of deg=0 will be explained later. Now try

names(out)print(out)summary(out)plot(out)plot(x,fitted(out))plot(x,residuals(out))help(locfit)

132

To do cross-validation, create a vector bandwidths h = (h1, . . . , hk). alpha then needs to be amatrix.

h = c( ... put your values here ... )k = length(h)zero = rep(0,k)H = cbind(zero,h)out = gcvplot(y˜x,deg=0,alpha=H)plot(out$df,out$values)

13.8 Example (CMB data). Recall the CMB data from Figure 43. Figure 47 shows four differ-ent kernel regression fits (using just the first 400 data points) based on increasing bandwidths. Thetop two plots are based on small bandwidths and the fits are too rough. The bottom right plot isbased on large bandwidth and the fit is too smooth. The bottom left plot is just right. The bottomright plot also shows the presence of bias near the boundaries. As we shall see, this is a generalfeature of kernel regression. The bottom plot in Figure 48 shows a kernel fit to all the data points.The bandwidth was chosen by cross-validation.

The choice of kernel K is not too important. Estimates obtained by using different kernels areusually numerically very similar. This observation is confirmed by theoretical calculations whichshow that the risk is very insensitive to the choice of kernel. What does matter much more is thechoice of bandwidth hwhich controls the amount of smoothing. Small bandwidths give very roughestimates while larger bandwidths give smoother estimates. In general, we will let the bandwidthdepend on the sample size so we sometimes write hn.

The following theorem shows how the bandwidth affects the estimator. To state these results weneed to make some assumption about the behavior of x1, . . . , xn as n increases. For the purposesof the theorem, we will assume that these are random draws from some density f .

13.9 Theorem. The risk (using integrated squared error loss) of the Nadaraya–Watson kernelestimator is

R(hn) =h4n

4

(∫x2K(x)dx

)2 ∫ (r′′(x) + 2r′(x)

f ′(x)

f(x)

)2

dx

+σ2∫K2(x)dx

nhn

∫1

f(x)dx+ o(nh−1

n ) + o(h4n) (114)

as hn → 0 and nhn →∞.

The first term in (114) is the squared bias and the second term is the variance. What is especiallynotable is the presence of the term

2r′(x)f ′(x)

f(x)(115)

in the bias. We call (115) the design bias since it depends on the design, that is, the distributionof the Xi’s. This means that the bias is sensitive to the position of the Xis. Furthermore, it can be

133

0 200 400 0 200 400

0 200 400 0 200 400

Figure 47: Four kernel regressions for the CMB data using just the first 400 data points. Thebandwidths used were h = 1 (top left), h = 10 (top right), h = 50 (bottom left), h = 200 (bottomright). As the bandwidth h increases, the estimated function goes from being too rough to toosmooth.

134

shown that kernel estimators also have high bias near the boundaries. This is known as boundarybias. We will see that we can reduce these biases by using a refinement called local polynomialregression.

If we differentiate (114) and set the result equal to 0, we find that the optimal bandwidth h∗ is

h∗ =

(1

n

)1/5

σ2∫K2(x)dx

∫dx/f(x)

(∫x2K2(x)dx)2

∫ (r′′(x) + 2r′(x)f

′(x)f(x)

)2

dx

1/5

. (116)

Thus, h∗ = O(n−1/5). Plugging h∗ back into (114) we see that the risk decreases at rate O(n−4/5).In (most) parametric models, the risk of the maximum likelihood estimator decreases to 0 at rate1/n. The slower rate n−4/5 is the price of using nonparametric methods. In practice, we cannotuse the bandwidth given in (116) since h∗ depends on the unknown function r. Instead, we useleave-one-out cross-validation as described in Theorem 13.6.

13.10 Example. Figure 48 shows the cross-validation score for the CMB example as a functionof the effective degrees of freedom. The optimal smoothing parameter was chosen to minimizethis score. The resulting fit is also shown in the figure. Note that the fit gets quite variable to theright.

13.3 Local PolynomialsKernel estimators suffer from boundary bias and design bias. These problems can be alleviated byusing a generalization of kernel regression called local polynomial regression.

To motivate this estimator, first consider choosing an estimator a ≡ rn(x) to minimize the sumsof squares

∑ni=1(Yi− a)2. The solution is the constant function rn(x) = Y which is obviously not

a good estimator of r(x). Now define the weight function wi(x) = K((Xi − x)/h) and choosea ≡ rn(x) to minimize the weighted sums of squares

n∑

i=1

wi(x)(Yi − a)2. (117)

From elementary calculus, we see that the solution is

rn(x) ≡∑n

i=1 wi(x)Yi∑ni=1wi(x)

which is exactly the kernel regression estimator. This gives us an interesting interpretation of thekernel estimator: it is a locally constant estimator, obtained from locally weighted least squares.

This suggests that we might improve the estimator by using a local polynomial of degree pinstead of a local constant. Let x be some fixed value at which we want to estimate r(x). Forvalues u in a neighborhood of x, define the polynomial

Px(u; a) = a0 + a1(u− x) +a2

2!(u− x)2 + · · ·+ ap

p!(u− x)p. (118)

135

24 26 28 30

0 400 800

1000

3000

5000

Figure 48: Top: The cross-validation (CV) score as a function of the effective degrees of freedom.Bottom: the kernel fit using the bandwidth that minimizes the cross-validation score.

136

We can approximate a smooth regression function r(u) in a neighborhood of the target value x bythe polynomial:

r(u) ≈ Px(u; a). (119)

We estimate a = (a0, . . . , ap)T by choosing a = (a0, . . . , ap)

T to minimize the locally weightedsums of squares

n∑

i=1

wi(x) (Yi − Px(Xi; a))2 . (120)

The estimator a depends on the target value x so we write a(x) = (a0(x), . . . , ap(x))T if we wantto make this dependence explicit. The local estimate of r is

rn(u) = Px(u; a).

In particular, at the target value u = x we have

rn(x) = Px(x; a) = a0(x). (121)

Warning! Although rn(x) only depends on a0(x), this is not equivalent to simply fitting a localconstant.

Setting p = 0 gives back the kernel estimator. The special case where p = 1 is called locallinear regression and this is the version we recommend as a default choice. As we shall see, localpolynomial estimators, and in particular local linear estimators, have some remarkable properties.

To find a(x), it is helpful to re-express the problem in vector notation. Let

Xx =

1 x1 − x · · · (x1−x)p

p!

1 x2 − x · · · (x2−x)p

p!...

... . . . ...1 xn − x · · · (xn−x)p

p!

(122)

and let Wx be the n× n diagonal matrix whose (i, i) component is wi(x). We can rewrite (120) as

(Y −Xxa)TWx(Y −Xxa). (123)

Minimizing (123) gives the weighted least squares estimator

a(x) = (XTxWxXx)

−1XTxWxY. (124)

In particular, rn(x) = a0(x) is the inner product of the first row of(XT

xWxXx)−1 XT

xWx with Y . Thus we have:The local polynomial regression estimate is

rn(x) =n∑

i=1

`i(x)Yi (125)

137

where `(x)T = (`1(x), . . . , `n(x)),

`(x)T = eT1 (XTxWxXx)

−1XTxWx,

e1 = (1, 0, . . . , 0)T and Xx and Wx are defined in (122).Once again, our estimate is a linear smoother and we can choose the bandwidth by minimizing

the cross-validation formula given in Theorem 13.6.R-code. The R-code is the same except we use deg = 1 for local linear, deg = 2 for local

quadratic etc. Thus, for local linear regression:

loess(y ˜ x,deg=1,span=h)locfit(y ˜ x,deg = 1,alpha=c(0,h))

13.11 Example (LIDAR). These data were introduced in Example 13.2. Figure 49 shows the 221observations. The top left plot shows the data and the fitted function using local linear regression.The cross-validation curve (not shown) has a well-defined minimum at h ≈ 37 corresponding to 9effective degrees of freedom. The fitted function uses this bandwidth. The top right plot shows theresiduals. There is clear heteroscedasticity (nonconstant variance). The bottom left plot shows theestimate of σ(x) using the method described later. Next we compute 95 percent confidence bands(explained later). The resulting bands are shown in the lower right plot. As expected, there is muchgreater uncertainty for larger values of the covariate.

Local Linear Smoothing

13.12 Theorem. When p = 1, rn(x) =∑n

i=1 `i(x)Yi where

`i(x) =bi(x)∑nj=1 bj(x)

,

bi(x) = K

(Xi − xh

)(Sn,2(x)− (Xi − x)Sn,1(x)) (126)

and

Sn,j(x) =n∑

i=1

K

(Xi − xh

)(Xi − x)j, j = 1, 2.

13.13 Example. Figure 50 shows the local regression for the CMB data for p = 0 and p = 1.The bottom plots zoom in on the left boundary. Note that for p = 0 (the kernel estimator), the fit ispoor near the boundaries due to boundary bias.

138

range

log

ra

tio

400 500 600 700

−1

.0−

0.6

−0

.2

range

resi

du

als

400 500 600 700

−0

.40

.00

.20

.4

range

sig

ma

(x)

400 500 600 700

0.0

00

.05

0.1

0

range

log

ra

tio

400 500 600 700

−1

.0−

0.6

−0

.2

Figure 49: The LIDAR data from Example 13.11. Top left: data and the fitted function using locallinear regression with h ≈ 37 (chosen by cross-validation). Top right: the residuals. Bottom left:estimate of σ(x). Bottom right: 95 percent confidence bands.

139

0 200 400

1000

3000

5000

0 200 400

1000

3000

5000

0 20 40

500

1500

0 20 40

500

1500

Figure 50: Locally weighted regressions using local polynomials of order p = 0 (top left) andp = 1 (top right). The bottom plots show the left boundary in more detail (p = 0 bottom left andp = 1 bottom right). Notice that the boundary bias is reduced by using local linear estimation(p = 1).

140

0.0 0.5 1.0

−1

01

0.0 0.5 1.0

−1

01

100 150 200 0.0 0.5 1.0

−1

01

Figure 51: The Doppler function estimated by local linear regression. The function (top left), thedata (top right), the cross-validation score versus effective degrees of freedom (bottom left), andthe fitted function (bottom right).

13.14 Example (Doppler function). Let

r(x) =√x(1− x) sin

(2.1π

x+ .05

), 0 ≤ x ≤ 1 (127)

which is called the Doppler function. This function is difficult to estimate and provides a goodtest case for nonparametric regression methods. The function is spatially inhomogeneous whichmeans that its smoothness (second derivative) varies over x. The function is plotted in the top leftplot of Figure 51. The top right plot shows 1000 data points simulated from Yi = r(i/n) + σεiwith σ = .1 and εi ∼ N(0, 1). The bottom left plot shows the cross-validation score versus theeffective degrees of freedom using local linear regression. The minimum occurred at 166 degreesof freedom corresponding to a bandwidth of .005. The fitted function is shown in the bottom rightplot. The fit has high effective degrees of freedom and hence the fitted function is very wiggly. Thisis because the estimate is trying to fit the rapid fluctuations of the function near x = 0. If we usedmore smoothing, the right-hand side of the fit would look better at the cost of missing the structurenear x = 0. This is always a problem when estimating spatially inhomogeneous functions. We’lldiscuss that further later.

The following theorem gives the large sample behavior of the risk of the local linear estimator

141

and shows why local linear regression is better than kernel regression.

13.15 Theorem. Let Yi = r(Xi) + σ(Xi)εi for i = 1, . . . , n and a ≤ Xi ≤ b. Assume thatX1, . . . , Xn are a sample from a distribution with density f and that (i) f(x) > 0, (ii) f , r′′ andσ2 are continuous in a neighborhood of x, and (iii) hn → 0 and nhn →∞. Let x ∈ (a, b). GivenX1, . . . , Xn, we have the following: the local linear estimator and the kernel estimator both havevariance

σ2(x)

f(x)nhn

∫K2(u)du+ oP

(1

nhn

). (128)

The Nadaraya–Watson kernel estimator has bias

h2n

(1

2r′′(x) +

r′(x)f ′(x)

f(x)

)∫u2K(u)du+ oP (h2) (129)

whereas the local linear estimator has asymptotic bias

h2n

1

2r′′(x)

∫u2K(u)du + oP (h2) (130)

Thus, the local linear estimator is free from design bias. At the boundary points a and b, theNadaraya–Watson kernel estimator has asymptotic bias of order hn while the local linear estimatorhas bias of order h2

n. In this sense, local linear estimation eliminates boundary bias.

13.16 Remark. The above result holds more generally for local polynomials of order p. Gener-ally, taking p odd reduces design bias and boundary bias without increasing variance.

An alternative to locfit is loess.

out = loess(y ˜ x,span=.1,degree=1)plot(x,fitted(out))out$trace.hat ### effective degrees of freedom

13.4 Penalized Regression, Regularization and SplinesBefore introducing splines, consider polynomial regression.

Y =

p∑

j=0

βjxj + ε.

or

r(x) =

p∑

j=0

βjxj.

142

In other words, we have design matrix

X =

1 x1 x21 . . . xp1

1 x2 x22 . . . xp2. . .

1 xn x2n . . . xpn

.

Least squares minimizes(Y −Xβ)t(Y −Xβ),

which implies β = (X tX)−1X tY . If we introduce a ridge regression penalty and aim to minimize

(Y −Xβ)t(Y −Xβ) + λβtIβ,

then β = (X tX + λI)−1X tY . Spline regression follows a similar pattern, except that we replaceX with the B-spline basis matrix B (see below).

Consider once again the regression model

Yi = r(Xi) + εi

and suppose we estimate r by choosing rn(x) to minimize the sums of squares

n∑

i=1

(Yi − rn(Xi))2,

over a class of functions. Minimizing over all linear functions (i.e., functions of the form β0 +β1x)yields the least squares estimator. Minimizing over all functions yields a function that interpolatesthe data. In the previous section we avoided these two extreme solutions by replacing the sums ofsquares with a locally weighted sums of squares. An alternative way to get solutions in betweenthese extremes is to minimize the penalized sums of squares

M(λ) =∑

i

(Yi − rn(Xi))2 + λ J(r) (131)

whereJ(r) =

∫(r′′(x))2dx (132)

is a roughness penalty. This penalty leads to a solution that favors smoother functions. Adding apenalty term to the criterion we are optimizing is sometimes called regularization. The parameterλ controls the trade-off between fit (the first term of 131) and the penalty. Let rn denote the functionthat minimizes M(λ). When λ = 0, the solution is the interpolating function. When λ → ∞, rnconverges to the least squares line. The parameter λ controls the amount of smoothing. What doesrn look like for 0 < λ <∞? To answer this question, we need to define splines.

A spline is a special piecewise polynomial. The most commonly used splines are piecewisecubic splines. Let ξ1 < ξ2 < · · · < ξk be a set of ordered points—called knots—contained in some

143

0.0 0.5 1.0

0.0

0.6

1.2

Figure 52: Cubic B-spline basis using nine equally spaced knots on (0,1).

interval (a, b). A cubic spline is a continuous function r such that (i) r is a cubic polynomial over(ξ1, ξ2), . . . and (ii) r has continuous first and second derivatives at the knots. A spline that is linearbeyond the boundary knots is called a natural spline. Cubic splines are the most common splinesused in practice. They arise naturally in the penalized regression framework as the followingtheorem shows.

13.17 Theorem. The function rn(x) that minimizes M(λ) with penalty (132) is a natural cubicspline with knots at the data points. The estimator rn is called a smoothing spline.

In other words, for a fitted vector Y , the penalty term is minimized by a cubic spline that goesthrough the points Y .

The theorem above does not give an explicit form for rn. To do so, we will construct a basisfor the set of splines. The most commonly used basis for splines is the cubic B-spline. Ratherthan write out a bunch of polynomials to show their form, I suggest that you explore B-splinesbases in R. Figure 52 shows the cubic B-spline basis using nine equally spaced knots on (0,1).B-spline basis functions have compact support which makes it possible to speed up calculations.Without a penalty, using the B-spline basis one can interpolate the data so as to provide a perfectfit to the data. Alternatively, with a penalty, one can provide a nice smooth curve that is useful forprediction.

We are now in a position to describe the spline estimator in more detail. According to Theorem13.17, rn(x) is a natural cubic spline. Hence, we can write

rn(x) =N∑

j=1

βjBj(x) (133)

144

where Bj(x), j = 1, . . . , N , are the basis vectors for the B-spline, N = n + 4. (Note the basisis determined by the observed values of xj .) We follow the pattern of polynomial regression, butreplace X with B, where

B =

B1(x1) B2(x1) . . . BN(x1)B1(x2) B2(x2) . . . BN(x2)

. . .B1(xn) B2(xn) . . . BN(xn)

To find our regression estimator we only need to find the coefficients β = (β1, . . . , βN)T . Byexpanding r in the basis, and calculating the 2nd derivatives, we can now rewrite the minimizationas follows:

minimize : (Y −Bβ)T (Y −Bβ) + λβTΩβ (134)

where Bij = Bj(Xi) and Ωjk =∫B′′j (x)B′′k(x)dx.

Following the pattern we saw for ridge regression, the value of β that minimizes (134) is

β = (BTB + λΩ)−1BTY. (135)

Splines are another example of linear smoothers:

L = B(BTB + λΩ)−1BT .

So Y = LY .If we had done ordinary linear regression of Y with basis B, the hat matrix would be L =

B(BTB)−1BT and the fitted values would interpolate the observed data. The effect of the termλΩ in the penalty is to shrink the regression coefficients towards a subspace, which results in asmoother fit. As before, we define the effective degrees of freedom by ν = tr(L) and we choosethe smoothing parameter λ by minimizing either the cross-validation score (108) or the generalizedcross-validation score (109).

In R:

out = smooth.spline(x,y,df=10,cv=TRUE) ### df is the effective+ degrees of freedomplot(x,y)lines(x,out$y) ### NOTE: the fitted values are in out$y NOT out$fit!!out$cv ### print the cross-validation score

You need to do a loop to try many values of df and then use cross-validation to choose df. dfmust be between 2 and n. For example:

cv = rep(0,50)df = seq(2,n,length=50)for(i in 1:50)cv[i] = smooth.spline(x,y,df=df[i],cv=TRUE)$cvplot(df,cv,type="l")df[cv == min(cv)]

145

0 400 800

1000

3000

5000

Figure 53: Smoothing spline for the CMB data. The smoothing parameter was chosen by cross-validation.

13.18 Example. Figure 53 shows the smoothing spline with cross-validation for the CMB data.The effective number of degrees of freedom is 8.8. The fit is smoother than the local regressionestimator. This is certainly visually more appealing, but the difference between the two fits is smallcompared to the width of the confidence bands that we will compute later.

Spline estimates rn(x) are approximately kernel estimates in the sense that

`i(x) ≈ 1

f(Xi)h(Xi)K

(Xi − xh(Xi)

)

where f(x) is the density of the covariate (treated here as random),

h(x) =

(λ

nf(x)

)1/4

and

K(t) =1

2exp

− |t|√

2

sin

( |t|√2

+π

4

).

Another nonparametric method that uses splines is called the regression spline method. Ratherthan placing a knot at each data point, we instead use fewer knots. We then do ordinary linear re-gression on the basis matrix B with no regularization. The fitted values for this estimator areY = LY with L = B(BTB)−1BT . The difference between this estimate and smoothing splines isthat the basis matrix B is based on fewer knots and there is no shrinkage factor λΩ. The amountof smoothing is instead controlled by the choice of the number (and placement) of the knots. Byusing fewer knots, one can save computation time.

146

13.5 Smoothing Using Orthogonal Functions

Let L2(a, b) denote all functions defined on the interval [a, b] such that∫ baf(x)2dx <∞:

L2(a, b) =

f : [a, b]→ R,

∫ b

a

f(x)2dx <∞. (136)

We sometimes write L2 instead of L2(a, b). The inner product between two functions f, g ∈ L2 isdefined by

∫f(x)g(x)dx. The norm of f is

||f || =√∫

f(x)2dx. (137)

Two functions are orthogonal if∫f(x)g(x)dx = 0.

A sequence of functionsφ1, φ2, φ3, φ4, . . .

is orthonormal if∫φ2j(x)dx = 1 for each j and

∫φi(x)φj(x)dx = 0 for i 6= j. An orthonormal

sequence is complete if the only function that is orthogonal to each φj is the zero function. Acomplete orthonormal set is called an orthonormal basis.

Any f ∈ L2 can be written as

f(x) =∞∑

j=1

βjφj(x), where βj =

∫ b

a

f(x)φj(x)dx. (138)

Also, we have Parseval’s relation:

||f ||2 ≡∫f 2(x) dx =

∞∑

j=1

β2j ≡ ||β||2 (139)

where β = (β1, β2, . . .).

Note: The equality in the displayed equation means that∫

(f(x) − fn(x))2dx → 0 wherefn(x) =

∑nj=1 βjφj(x).

13.19 Example. An example of an orthonormal basis for L2(0, 1) is the cosine basis defined asfollows. Let φ0(x) = 1 and for j ≥ 1 define

φj(x) =√

2 cos(jπx). (140)

147

Figure 54: Approximating the doppler function with its expansion in the cosine ba-sis. The function f (top left) and its approximation fJ(x) =

∑Jj=1 βjφj(x) with

J equal to 5 (top right), 20 (bottom left), and 200 (bottom right). The coefficientsβj =

∫ 1

0f(x)φj(x)dx were computed numerically.

13.20 Example. Let

f(x) =√x(1− x) sin

(2.1π

(x+ .05)

)

which is called the “doppler function.” Figure 54 shows f (top left) and its approximation

fJ(x) =J∑

j=1

βjφj(x)

with J equal to 5 (top right), 20 (bottom left), and 200 (bottom right). As J increases we see thatfJ(x) gets closer to f(x). The coefficients βj =

∫ 1

0f(x)φj(x)dx were computed numerically.

13.21 Example. The Legendre polynomials on [−1, 1] are defined by

Pj(x) =1

2jj!

dj

dxj(x2 − 1)j, j = 0, 1, 2, . . . (141)

It can be shown that these functions are complete and orthogonal and that∫ 1

−1

P 2j (x)dx =

2

2j + 1. (142)

It follows that the functions φj(x) =√

(2j + 1)/2Pj(x), j = 0, 1, . . . form an orthonormal basisfor L2(−1, 1). The first few Legendre polynomials are:

P0(x) = 1, P1(x) = x, P2(x) =1

2

(3x2 − 1

), P3(x) =

1

2

(5x3 − 3x

), . . .

148

These polynomials may be constructed explicitly using the following recursive relation:

Pj+1(x) =(2j + 1)xPj(x)− jPj−1(x)

j + 1. (143)

The coefficients β1, β2, . . . are related to the smoothness of the function f . To see why, note thatif f is smooth, then its derivatives will be finite. Thus we expect that, for some k,

∫ 1

0(f (k)(x))2dx <

∞ where f (k) is the kth derivative of f . Now consider the cosine basis (140) and let f(x) =∑∞j=0 βjφj(x). Then,

∫ 1

0

(f (k)(x))2dx = 2∞∑

j=1

β2j (πj)

2k.

The only way that∑∞

j=1 β2j (πj)

2k can be finite is if the βj’s get small when j gets large. Tosummarize:

If the function f is smooth, then the coefficients βj will be small when j is large.

Return to the regression model

Yi = r(Xi) + εi, i = 1, . . . , n. (144)

Now we write r(x) =∑∞

j=1 βjφj(x). We will approximate r by

rJ(x) =J∑

j=1

βjφj(x).

The number of terms J will be our smoothing parameter. Our estimate is

r(x) =J∑

j=1

βjφj(x),

To find rn, let U denote the matrix whose columns are:

U =

φ1(X1) φ2(X1) . . . φJ(X1)φ1(X2) φ2(X2) . . . φJ(X2)...

......

...φ1(Xn) φ2(Xn) . . . φJ(Xn)

.

Thenβ = (UTU)−1UTY

andY = SY

149

Figure 55: Data from the doppler test function and the estimated function. See Example 13.22.

where S = U(UTU)−1UT is the hat matrix. The matrix S is projecting into the space spanned bythe first J basis functions.

We can choose J by cross validation. Note that trace(S) = J so the GCV score takes thefollowing simple form:

GCV(J) =RSS

n

1

(1− J/n)2.

13.22 Example. Figure 55 shows the doppler function f and n = 2, 048 observations generatedfrom the model

Yi = r(Xi) + εi

where Xi = i/n, εi ∼ N(0, (.1)2). The figure shows the data and the estimated function. Theestimate was based on J = 234 terms.

Here is another example:The fit is in Figure 56 and the smoothing matrix is in 57. Notice that the rows of the smoothing

matrix look like kernels. In fact, smoothing with a series is approximately the same as kernelregression with the kernel K(x, y) =

∑Jj=1 φj(x)φj(y).

Cosine basis smoothers have boundary bias. This can be fixed by adding the functions t and t2

to the basis. In other words, use the design matrix

U =

1 X1 X21 φ2(X1) . . . φJ(X1)

1 X2 X22 φ2(X2) . . . φJ(X2)

......

......

1 Xn X2n φ2(Xn) . . . φJ(Xn)

.

This is called the polynomial-cosine basis.

150

2 4 6 8 10

0.01

30.

015

0.01

70.

019

J

GC

V

0.0 0.2 0.4 0.6 0.8 1.0

−0.

2−

0.1

0.0

0.1

0.2

0.3

0.4

x

y

0.0 0.2 0.4 0.6 0.8 1.0

−0.

2−

0.1

0.0

0.1

0.2

x

resi

dual

s

Figure 56: Cosine Regression

151

0.0 0.2 0.4 0.6 0.8 1.0

−0.

10.

00.

10.

20.

30.

40.

5

x

0.0 0.2 0.4 0.6 0.8 1.0

−0.

10.

00.

10.

20.

30.

4

x

0.0 0.2 0.4 0.6 0.8 1.0

−0.

10.

00.

10.

20.

30.

4

x

Figure 57: Rows of the smoothing spline for Cosine Regression

152

13.6 Summary• A linear smoother has the form rn(x) =

∑i `i(x)Yi.

– Y = L Y

– effective df equal trace(L).

• For a kernel smoother

`i(x) =K((x−Xi)/h)∑iK((x−Xi)/h)

– The kernel smoother is a weighted average of the Yi’s in the neighborhood of x.

– A local polynomial is a slight variation of the kernel smoother that fits a weightedpolynomial rather than a weighted average of the Yi’s in the neighborhood of x.

– The key choice is the smoothing parameter h.

• A smoothing spline is like linear regression with the Xij’s replaced by the basis functionsBj(Xi)

rn(x) =∑

j

βjBj(x).

– The basis functions are obtained using B-splines determined by the observed Xi’s.

– The key choice is the smoothing parameter λ, which penalizes the function for excesscurvature.

– β is obtained via least squares (with the penalty)

– An alternative to smoothing splines are splines with fewer basis functions. In thissetting there is no penalty function. The key choice is the placement and number ofB-splines.

• Orthogonal polynomials. The basis function is constructed using orthogonal functions.

rn(x) =J∑

j=1

βjφj(x).

– The key choice is J , the number of terms in the

– β is obtained via least squares.

153

13.7 Variance EstimationNext we consider several methods for estimating σ2. For linear smoothers, there is a simple, nearlyunbiased estimate of σ2.

13.23 Theorem. Let rn(x) be a linear smoother. Let

σ2 =

∑ni=1(Yi − r(Xi))

2

n− 2ν + ν(145)

where

ν = tr(L), ν = tr(LTL) =n∑

i=1

||`(Xi)||2.

If r is sufficiently smooth, ν = o(n) and ν = o(n) then σ2 is a consistent estimator of σ2.

We will now outline the proof of this result. Recall that if Y is a random vector and Q is asymmetric matrix then Y TQY is called a quadratic form and it is well known that

E(Y TQY ) = tr(QV ) + µTQµ (146)

where V = V(Y ) is the covariance matrix of Y and µ = E(Y ) is the mean vector. Now,

Y − Y = Y − LY = (I − L)Y

and so

σ2 =Y TΛY

tr(Λ)(147)

where Λ = (I − L)T (I − L). Hence,

E(σ2) =E(Y TΛY )

tr(Λ)= σ2 +

Y TΛY

n− 2ν + ν.

Assuming that ν and ν do not grow too quickly, and that r is smooth, the last term is small forlarge n and hence E(σ2) ≈ σ2. Similarly, one can show that V(σ2)→ 0.

Here is another estimator. Suppose that the Xis are ordered. Define

σ2 =1

2(n− 1)

n−1∑

i=1

(Yi+1 − Yi)2. (148)

The motivation for this estimator is as follows. Assuming r(x) is smooth, we have r(xi+1) −r(xi) ≈ 0 and hence

Yi+1 − Yi = [r(xi+1) + εi+1]− [r(xi) + εi] ≈ εi+1 − εi

154

and hence (Yi+1 − Yi)2 ≈ ε2i+1 + ε2i − 2εi+1εi. Therefore,

E(Yi+1 − Yi)2 ≈ E(ε2i+1) + E(ε2i )− 2E(εi+1)E(εi)

= E(ε2i+1) + E(ε2i ) = 2σ2. (149)

Thus, E(σ2) ≈ σ2. A variation of this estimator is

σ2 =1

n− 2

n−1∑

i=2

c2i δ

2i (150)

whereδi = aiYi−1 + biYi+1 − Yi, ai = (xi+1 −Xi)/(xi+1 − xi−1),bi = (xi − xi−1)/(xi+1 − xi−1), c2

i = (a2i + b2

i + 1)−1.

The intuition of this estimator is that it is the average of the residuals that result from fitting a lineto the first and third point of each consecutive triple of design points.

13.24 Example. The variance looks roughly constant for the first 400 observations of the CMBdata. Using a local linear fit, we applied the two variance estimators. Equation (145) yields σ2 =408.29 while equation (148) yields σ2 = 394.55.

So far we have assumed homoscedasticity meaning that σ2 = V(εi) does not vary with x. Inthe CMB example this is blatantly false. Clearly, σ2 increases with x so the data are heteroscedas-tic. The function estimate rn(x) is relatively insensitive to heteroscedasticity. However, when itcomes to making confidence bands for r(x), we must take into account the nonconstant variance.

We will take the following approach. Suppose that

Yi = r(Xi) + σ(Xi)εi. (151)

Let Zi = log(Yi − r(Xi))2 and δi = log ε2i . Then,

Zi = log(σ2(Xi)) + δi. (152)

This suggests estimating log σ2(x) by regressing the log squared residuals on x. We proceed asfollows.

Variance Function Estimation

1. Estimate r(x) with any nonparametric method to get an estimate rn(x).

2. Define Zi = log(Yi − rn(Xi))2.

3. Regress the Zi’s on the Xi’s (again using any nonparametric method) to get an estimate q(x)of log σ2(x) and let

σ2(x) = ebq(x). (153)

155

0 400 800

510

15

20

Figure 58: The dots are the log squared residuals. The solid line shows the log of the estimatedstandard variance σ2(x) as a function of x. The dotted line shows the log of the true σ2(x) whichis known (to reasonable accuracy) through prior knowledge.

13.25 Example. The solid line in Figure 58 shows the log of σ2(x) for the CMB example. Iused local linear estimation and I used cross-validation to choose the bandwidth. The estimatedoptimal bandwidth for rn was h = 42 while the estimated optimal bandwidth for the log variancewas h = 160. In this example, there turns out to be an independent estimate of σ(x). Specifically,because the physics of the measurement process is well understood, physicists can compute areasonably accurate approximation to σ2(x). The log of this function is the dotted line on the plot.

A drawback of this approach is that the log of a very small residual will be a large outlier. Analternative is to directly smooth the squared residuals on x.

156

13.8 Confidence BandsIn this section we will construct confidence bands for r(x). Typically these bands are of the form

rn(x)± c se(x) (154)

where se(x) is an estimate of the standard deviation of rn(x) and c > 0 is some constant. Beforewe proceed, we discuss a pernicious problem that arises whenever we do smoothing, namely, thebias problem.

THE BIAS PROBLEM. Confidence bands like those in (154), are not really confidence bands forr(x), rather, they are confidence bands for rn(x) = E(rn(x)) which you can think of as a smoothedversion of r(x). Getting a confidence set for the true function r(x) is complicated for reasons wenow explain.

Denote the mean and standard deviation of rn(x) by rn(x) and sn(x). Then,

rn(x)− r(x)

sn(x)=

rn(x)− rn(x)

sn(x)+rn(x)− r(x)

sn(x)

= Zn(x) +bias(rn(x))√

variance(rn(x))

where Zn(x) = (rn(x) − rn(x))/sn(x). Typically, the first term Zn(x) converges to a standardNormal from which one derives confidence bands. The second term is the bias divided by thestandard deviation. In parametric inference, the bias is usually smaller than the standard deviationof the estimator so this term goes to zero as the sample size increases. In nonparametric inference,we have seen that optimal smoothing corresponds to balancing the bias and the standard deviation.The second term does not vanish even with large sample sizes.

The presence of this second, nonvanishing term introduces a bias into the Normallimit. The result is that the confidence interval will not be centered around thetrue function r due to the smoothing bias rn(x)− r(x).

There are several things we can do about this problem. The first is: live with it. In otherwords, just accept the fact that the confidence band is for rn not r. There is nothing wrong withthis as long as we are careful when we report the results to make it clear that the inferences arefor rn not r. A second approach is to estimate the bias function rn(x) − r(x). This is difficultto do. Indeed, the leading term of the bias is r′′(x) and estimating the second derivative of r ismuch harder than estimating r. This requires introducing extra smoothness conditions which thenbring into question the original estimator that did not use this extra smoothness. This has a certainunpleasant circularity to it.

13.26 Example. To understand the implications of estimating rn instead of r, consider the fol-lowing example. Let

r(x) = φ(x; 2, 1) + φ(x; 4, 0.5) + φ(x; 6, 0.1) + φ(x; 8, 0.05)

157

0 5 10

02

46

0 5 10

02

46

0 5 10

02

46

0 5 10

−2

02

4

Figure 59: The true function (top left), an estimate rn (top right) based on 100 observations, thefunction rn(x) = E(rn(x)) (bottom left) and the difference r(x)− rn(x) (bottom right).

where φ(x;m, s) denotes a Normal density function with mean m and variance s2. Figure 59shows the true function (top left), a locally linear estimate rn (top right) based on 100 observationsYi = r(i/10)+.2N(0, 1), i = 1, . . . , 100, with bandwidth h = 0.27, the function rn(x) = E(rn(x))(bottom left) and the difference r(x)− rn(x) (bottom right). We see that rn (dashed line) smoothsout the peaks. Comparing the top right and bottom left plot, it is clear that rn(x) is actuallyestimating rn not r(x). Overall, rn is quite similar to r(x) except that rn omits some of the finedetails of r.

CONSTRUCTING CONFIDENCE BANDS. Assume that rn(x) is a linear smoother, so thatrn(x) =

∑ni=1 Yiì(x). Then,

r(x) = E(rn(x)) =n∑

i=1

ì(x)r(Xi).

Also, because we condition on the xi’s,

V(rn(x)) = V

[n∑

i=1

ì(x)Yi

]

158

=n∑

i=1

`2i (x)V(Yi|Xi)

=n∑

i=1

`2i (x)σ2(Xi).

When σ2(x) = σ2 this simplifies to

V(rn(x)) = σ2||`(x)||2.

Notice that the variance depends on x and on the sampling of data. For x near many Xi’s, thecontribution from each of the neighboring `(Xi) will be small, so that the variance is small. If xis far from most Xi’s, then a small number of terms will contribute, each with a bigger weight.Consequently the variance will be bigger.

We will consider a confidence band for rn(x) of the form

I(x) = (rn(x)− c s(x), rn(x) + c s(x)) (155)

for some c > 0 where

s(x) =

√√√√n∑

i=1

σ2(Xi)`2i (x).

At one fixed value of x we can just take

rn(x)± zα/2s(x).

If we want a band over an interval a ≤ x ≤ b we need a constant c larger than zα/2 to count forthe fact that we are trying to get coverage at many points. To guarantee coverage at all the Xi’s wecan use the Bonferroni correction and take

rn(x)± zα/(2n)s(x).

There is a more refined approach which is used in locfit.R Code. In locfit you can get confidence bands as follows.

out = locfit(y ˜ x, alpha=c(0,h)) ### fit the regressioncrit(out) = kappa0(out,cov=.95) ### make locfit find kappa0 and cplot(out,band="local") ### plots the fit and the bands

To actually extract the bands, proceed as follows:

tmp = preplot.locfit(out,band="local",where="data")r.hat = tmp$fitcritval = tmp$critval$crit.valse = temp$se.fitupper = r.hat + critval*selower = r.hat - critval*se

159

0 400 800

010000

0 400 800

010000

Figure 60: Local linear fit with simultaneous 95 percent confidence bands. The band in the topplot assumes constant variance σ2. The band in the bottom plot allows for nonconstant varianceσ2(x).

Now suppose that σ(x) is a function of x. Then, we use rn(x)± cs(x).

13.27 Example. Figure 60 shows simultaneous 95 percent confidence bands for the CMB datausing a local linear fit. The bandwidth was chosen using cross-validation. We find that κ0 = 38.85and c = 3.33. In the top plot, we assumed a constant variance when constructing the band. In thebottom plot, we did not assume a constant variance when constructing the band. We see that if wedo not take into account the nonconstant variance, we overestimate the uncertainty for small x andwe underestimate the uncertainty for large x.

It seems like a good time so summarize the steps needed to construct the estimate rn and aconfidence band.

Summary of Linear Smoothing

1. Choose a smoothing method such as local polynomial, spline, etc. This amounts to choosingthe form of the weights `(x) = (`1(x), . . . , `n(x))T . A good default choice is local linearsmoothing as described in Theorem 13.12.

2. Choose the bandwidth h by cross-validation using (108).

160

3. Estimate the variance function σ2(x) as described in Section 13.7.

4. An approximate 1− α confidence band for rn = E(rn(x)) is

rn(x)± c s(x). (156)

13.28 Example (LIDAR). Recall the LIDAR data from Example 13.2 and Example 13.11. Wefind that κ0 ≈ 30 and c ≈ 3.25. The resulting bands are shown in the lower right plot. As expected,there is much greater uncertainty for larger values of the covariate.

13.9 Local Likelihood and Exponential FamiliesIf Y is not real valued or ε is not Gaussian, then the basic regression model we have been usingmight not be appropriate. For example, if Y ∈ 0, 1 then it seems natural to use a Bernoullimodel. In this section we discuss nonparametric regression for more general models. Beforeproceeding, we should point out that the basic model often does work well even in cases where Yis not real valued or ε is not Gaussian. This is because the asymptotic theory does not really dependon ε being Gaussian. Thus, at least for large samples, it is worth considering using the tools wehave already developed for these cases.

13.29 Example. The BPD data. The outcome Y is presence or absence of BPD and the covariateis x = birth weight. The estimated logistic regression function (solid line) r(x; β0, β1) togetherwith the data are shown in Figure 61. Also shown are two nonparametric estimates. The dashedline is the local likelihood estimator. The dotted line is the local linear estimator which ignores thebinary nature of the Yi’s. Again we see that there is not a dramatic difference between the locallogistic model and the local linear model.

13.10 Multiple Nonparametric RegressionSuppose now that the covariate is d-dimensional,

Xi = (Xi1, . . . , Xid)T .

The regression equation takes the form

Y = r(X1, . . . , Xd) + ε. (157)

In principle, all the methods we have discussed carry over to this case easily. Unfortunately, therisk of a nonparametric regression estimator increases rapidly with the dimension d. This is calledthe curse of dimensionality. The risk of a nonparametric estimator behaves like n−4/5 if r is

161

|

|

| | ||| | |||| | ||

| ||

|| || |

||

|

|

|| |

|

||

|

||||

|

|| |||||

|

||| |

|

|

||

|

|

| |||| |

|

|

| || || | |

||

|

||||

||| |

|

|||

|

||

|

||

|

|

|

| |

||

|| || | | |||| | | | | | ||| | |

|

||

|

| | || | || || |

||

| || ||| ||| |

|

|

|

| ||| || ||| ||

|

| || | | | || | | |||||| |

|

|| ||| | || ||

||

|| |

|

|

|

|

|

|| | | || | ||| |

|

|| || || | |||| |

Birth Weight (grams)

Bro

nch

op

ulm

ona

ry D

yspla

sia

400 600 800 1000 1200 1400 1600

01

Figure 61: The BPD data. The data are shown with small vertical lines. The estimates are fromlogistic regression (solid line), local likelihood (dashed line) and local linear regression (dottedline).

assumed to have an integrable second derivative. In d dimensions the risk behaves like n−4/(4+d).To make the risk equal to a small number δ we have

δ =1

n4/(4+d)(158)

which implies that

n =

(1

δ

)(d+4)/4

. (159)

Thus:

To maintain a given degree of accuracy of an estimator, the sample size mustincrease exponentially with the dimension d.

So you might need n = 30000 points when d = 5 to get the same accuracy as n = 300 whend = 1.

To get some intuition into why this is true, suppose the data fall into a d-dimensional unit cube.Let x be a point in the cube and let Nh be a cubical neighborhood around x where the cube hassides of length h. Suppose we want to choose h = h(π) so that a fraction π of the data falls intoNh.The expected fraction of points in Nh is hd. Setting hd = π we see that h(π) = π1/d = e−(1/d) log π.Thus h(π) → 1 as d grows. In high dimensions, we need huge neighborhoods to capture anyreasonable fraction of the data.

162

With this warning in mind, let us press on and see how we might estimate the regressionfunction.

LOCAL REGRESSION. Consider local linear regression. The kernel function K is now afunction of d variables. Given a nonsingular positive definite d×d bandwidth matrix H , we define

KH(x) =1

|H|1/2K(H−1/2x).

Often, one scales each covariate to have the same mean and variance and then we use the kernel

h−dK(||x||/h)

where K is any one-dimensional kernel. Then there is a single bandwidth parameter h. This isequivalent to using a bandwidth matrix of the form H = h2I . At a target value x = (x1, . . . , xd)

T ,the local sum of squares is given by

n∑

i=1

wi(x)

(Yi − a0 −

d∑

j=1

aj(Xij − xj))2

(160)

wherewi(x) = K(||Xi − x||/h).

The estimator isrn(x) = a0 (161)

where a = (a0, . . . , ad)T is the value of a = (a0, . . . , ad)

T that minimizes the weighted sums ofsquares. The solution a is

a = (XTxWxXx)

−1XTxWxY (162)

where

Xx =

1 X11 − x1 · · · X1d − xd1 X21 − x1 · · · X2d − xd...

... . . . ...1 Xn1 − x1 · · · Xnd − xd

and Wx is the diagonal matrix whose (i, i) element is wi(x). This is what locfit does. In otherwords, if you type

locfit(y ˜ x1 + x2 + x3)

then locfit fits Y = r(x1, x2, x3) + ε using one bandwidth. So it is important to rescale yourvariables.

163

13.30 Theorem (Ruppert and Wand, 1994). Let rn be the multivariate local linear estimatorwith bandwidth matrix H . The (asymptotic) bias of rn(x) is

1

2µ2(K)trace(HH) (163)

where H is the matrix of second partial derivatives of r evaluated at x and µ2(K) is the scalardefined by the equation

∫uuTK(u)du = µ2(K)I . The (asymptotic) variance of rn(x) is

σ2(x)∫K(u)2du

n|H|1/2f(x)(164)

Also, the bias at the boundary is the same order as in the interior.

Thus we see that in higher dimensions, local linear regression still avoids excessive boundarybias and design bias. Suppose that H = h2I . Then, using the above result, the MSE is

h4

4

(µ2

d∑

j=1

rjj(x)

)2

+σ2(x)

∫K(u)2du

nhdf(x)= c1h

4 +c2

nhd(165)

which is minimized at h = cn−1/(d+4) giving MSE of size n−4/(4+d).

ADDITIVE MODELS. Interpreting and visualizing a high-dimensional fit is difficult. As thenumber of covariates increases, the computational burden becomes prohibitive. Sometimes, amore fruitful approach is to use an additive model. An additive model is a model of the form

Y = α +d∑

j=1

rj(Xj) + ε (166)

where r1, . . . , rd are smooth functions. The model (166) is not identifiable since we can add anyconstant to α and subtract the same constant from one of the rj’s without changing the regressionfunction. This problem can be fixed in a number of ways, perhaps the easiest being to set α = Yand then regard the rj’s as deviations from Y . In this case we require that

∑ni=1 rj(Xi) = 0 for

each j.The additive model is clearly not as general as fitting r(x1, . . . , xd) but it is much simpler to

compute and to interpret and so it is often a good starting point. This is a simple algorithm forturning any one-dimensional regression smoother into a method for fitting additive models. It iscalled backfitting.

The Backfitting Algorithm

Initialization: set α = Y and set initial guesses for r1, . . . , rd.Iterate until convergence: for j = 1, . . . , d:

164

• Compute Yi = Yi − α−∑

k 6=j rk(Xi), i = 1, . . . , n.

• Apply a smoother to Yi on xj to obtain rj .

• Set rj(x) equal to rj(x)− n−1∑n

i=1 rj(Xi).

The idea of backfitting is that for each feature, you fit a regression model to the residuals, afterfitting all the other features. The algorithm applies to any type of smoother – kernel, splines, etc.You can write your own function to fit an additive model. R has a preprogrammed backfittingfunction called GAM for generalized additive models. This function fits a kernel regression foreach feature.

13.31 Example. Here is an example involving three covariates and one response variable. Thedata are plotted in Figure 62. The data are 48 rock samples from a petroleum reservoir, the responseis permeability (in milli-Darcies) and the covariates are: the area of pores (in pixels out of 256 by256), perimeter in pixels and shape (perimeter/

√area). The goal is to predict permeability from

the three covariates. First we fit the additive model

permeability = r1(area) + r2(perimeter) + r3(shape) + ε.

We scale each covariate to have the same variance and then use a common bandwidth for eachcovariate. The estimates of r1, r2 and r3 are shown in Figure 62 (bottom). Y was added to eachfunction before plotting it. Next consider a three-dimensional local linear fit (161). After scalingeach covariate to have mean 0 and variance 1, we found that the bandwidth h ≈ 3.2 minimized thecross-validation score. The residuals from the additive model and the full three-dimensional locallinear fit are shown in Figure 63. Apparently, the fitted values are quite similar suggesting that thegeneralized additive model is adequate.

REGRESSION TREES. A regression tree is a model of the form

r(x) =M∑

m=1

cmI(x ∈ Rm) (167)

where c1, . . . , cM are constants and R1, . . . , RM are disjoint rectangles that partition the space ofcovariates. The model is fitted in a recursive manner that can be represented as a tree; hence thename.

Denote a generic covariate value by x = (x1, . . . , xj, . . . , xd). The covariate for the ith ob-servation is Xi = (xi1, . . . , xij, . . . , xid). Given a covariate j and a split point s we define therectangles R1 = R1(j, s) = x : xj ≤ s and R2 = R2(j, s) = x : xj > s where, in thisexpression, xj refers the the jth covariate not the jth observation. Then we take c1 to be the averageof all the Yi’s such that Xi ∈ R1 and c2 to be the average of all the Yi’s such that Xi ∈ R2. Noticethat c1 and c2 minimize the sums of squares

∑Xi∈R1

(Yi − c1)2 and∑

Xi∈R2(Yi − c2)2. The choice

165

area

log

perm

eabi

lity

1000 2000 3000 4000 5000

78

9

perimeter

log

perm

eabi

lity

0.1 0.2 0.3 0.4

78

9

shape

log

perm

eabi

lity

0 200 400 600 800 1200

78

9

area

log

perm

eabi

lity

1000 2000 3000 4000 5000

78

9

perimeter

log

perm

eabi

lity

0.1 0.2 0.3 0.4

78

9

shape

log

perm

eabi

lity

0 200 400 600 800 1000

78

9

Figure 62: top: The rock data. bottom: The plots show r1, r2, and r3 for the additive modelY = r1(x1) + r2(x2) + r3(x3) + ε.

166

predicted values

resi

du

als

7 8 9

−0

.50

.00

.5

Normal quantiles

−2 −1 0 1 2

−0

.20

.00

.20

.4

predicted values

resi

du

als

7 8 9

−0

.50

.00

.5

predicted values

resi

du

als

−0.5 0.0 0.5

−0

.50

.00

.5

Figure 63: The residuals for the rock data. Top left: residuals from the additive model. Top right:qq-plot of the residuals from the additive model. Bottom left: residuals from the multivariate locallinear model. Bottom right: residuals from the two fits plotted against each other.

167

c1 c2

x2 c3

x1

< 100 ≥ 100

< 50 ≥ 50

R3

R1

R2

x1

x2

50

110

1

Figure 64: A regression tree for two covariates x1 and x2. The function estimate is r(x) = c1I(x ∈R1)+c2I(x ∈ R2)+c3I(x ∈ R3) where R1, R2 and R3 are the rectangles shown in the lower plot.

of which covariate xj to split on and which split point s to use is based on minimizing the residualsums if squares. The splitting process is on repeated on each rectangle R1 and R2.

Figure 64 shows a simple example of a regression tree; also shown are the correspondingrectangles. The function estimate r is constant over the rectangles.

Generally one grows a very large tree, then the tree is pruned to form a subtree by collapsingregions together. The size of the tree chosen by cross-validation. Usually, we use ten-fold cross-validation since leave-one-out is too expensive. Thus we divide the data into ten blocks, removeeach block one at a time, fit the model on the remaining blocks and prediction error is computedfor the observations in the left-out block. This is repeated for each block and the prediction erroris averaged over the ten replications.

Here are the R commands:

library(tree) ### load the libraryout = tree(y ˜ x1 + x2 + x3) ### fit the treeplot(out) ### plot the treetext(out) ### add labels to plot

168

area < 1403

area < 1068 area < 3967

area < 3967peri < .1991

peri < .1949

7.746 8.407 8.678 8.8938.985 8.099 8.339

1

Figure 65: Regression tree for the rock data.

print(out) ### print the treecv = cv.tree(out) ### prune the tree

### and compute the cross validation scoreplot(cv$size,cv$dev) ### plot the CV score versus tree sizem = cv$size[cv$dev == min(cv$dev)] ### find the best size treenew = prune.tree(out,best=m) ### fit the best size treeplot(new)text(new)

13.32 Example. Figure 65 shows a tree for the rock data. Notice that the variable shape doesnot appear in the tree. This means that the shape variable was never the optimal covariate to split onin the algorithm. The result is that tree only depends on area and peri. This illustrates an importantfeature of tree regression: it automatically performs variable selection in the sense that a covariatexj will not appear in the tree if the algorithm finds that the variable is not important.

169

14 Density EstimationA problem closely related to nonparametric regression, is nonparametric density estimation. Let

X1, . . . , Xn ∼ f

where f is some probabilty density. We want to estimate f .

14.1 Example (Bart Simpson). The top left plot in Figure 66 shows the density

f(x) =1

2φ(x; 0, 1) +

1

10

4∑

j=0

φ(x; (j/2)− 1, 1/10) (168)

where φ(x;µ, σ) denotes a Normal density with mean µ and standard deviation σ. Based on 1000draws from f , I computed a kernel density estimator, described later. The top right plot is basedon a small bandwidth h which leads to undersmoothing. The bottom right plot is based on a largebandwidth h which leads to oversmoothing. The bottom left plot is based on a bandwidth h whichwas chosen to minimize estimated risk. This leads to a much more reasonable density estimate.

We will evaluate the quality of an estimator fn with the risk, or integrated mean squared error,R = E(L) where

L =

∫(fn(x)− f(x))2dx

is the integrated squared error loss function. The estimators will depend on some smoothing param-eter h and we will choose h to minimize an estimate of the risk. The usual method for estimatingrisk is leave-one-out cross-validation. The details are different for density estimation than forregression. In the regression case, the cross-validation score was defined as

∑ni=1(Yi− r(−i)(Xi))

2

but in density estimation, there is no response variable Y . Instead, we proceed as follows.The loss function, which we now write as a function of h, (since fn will depend on some

smoothing parameter h) is

L(h) =

∫(fn(x)− f(x))2 dx

=

∫f 2n (x) dx− 2

∫fn(x)f(x)dx+

∫f 2(x) dx.

The last term does not depend on h so minimizing the loss is equivalent to minimizing the expectedvalue of

J(h) =

∫f 2n (x) dx− 2

∫fn(x)f(x)dx. (169)

We shall refer to E(J(h)) as the risk, although it differs from the true risk by the constant term∫f 2(x) dx.

The cross-validation estimator of risk is

J(h) =

∫ (fn(x)

)2

dx− 2

n

n∑

i=1

f(−i)(Xi) (170)

170

−3 0 3

0.0

0.5

1.0

True Density

−3 0 3

0.0

0.5

1.0

Undersmoothed

−3 0 3

0.0

0.5

1.0

Just Right

−3 0 3

0.0

0.5

1.0

Oversmoothed

Figure 66: The Bart Simpson density from Example 14.1. Top left: true density. The other plotsare kernel estimators based on n = 1000 draws. Bottom left: bandwidth h = 0.05 chosen byleave-one-out cross-validation. Top right: bandwidth h/10. Bottom right: bandwidth 10h.

171

where f(−i) is the density estimator obtained after removing the ith observation. We refer to J(h)as the cross-validation score or estimated risk.

Perhaps the simplest nonparametric density estimator is the histogram. Suppose f has its sup-port on some interval which, without loss of generality, we take to be [0, 1]. Let m be an integerand define bins

B1 =

[0,

1

m

), B2 =

[1

m,

2

m

), . . . , Bm =

[m− 1

m, 1

]. (171)

Define the binwidth h = 1/m, let Yj be the number of observations in Bj , let pj = Yj/n and letpj =

∫Bjf(u)du.

The histogram estimator is defined by

fn(x) =m∑

j=1

pjhI(x ∈ Bj). (172)

To understand the motivation for this estimator, note that, for x ∈ Bj and h small,

E(fn(x)) =E(pj)

h=pjh

=

∫Bjf(u)du

h≈ f(x)h

h= f(x).

14.2 Example. Figure 67 shows three different histograms based on n = 1, 266 data pointsfrom an astronomical sky survey. Each data point represents a “redshift,” roughly speaking, thedistance from us to a galaxy. Choosing the right number of bins involves finding a good tradeoffbetween bias and variance. We shall see later that the top left histogram has too many bins resultingin oversmoothing and too much bias. The bottom left histogram has too few bins resulting inundersmoothing. The top right histogram is based on 308 bins (chosen by cross-validation). Thehistogram reveals the presence of clusters of galaxies.

Consider fixed x and fixed m, and let Bj be the bin containing x. Then,

E(fn(x)) =pjh

and V(fn(x)) =pj(1− pj)

nh2. (173)

The risk satisfies

R(fn, f) ≈ h2

12

∫(f ′(u))2du+

1

nh. (174)

The value h∗ that minimizes (174) is

h∗ =1

n1/3

(6∫

(f ′(u))2du

)1/3

. (175)

With this choice of binwidth,

R(fn, f) ∼ C

n2/3. (176)

We see that with an optimally chosen binwidth, the risk decreases to 0 at rate n−2/3. We will seeshortly that kernel estimators converge at the faster rate n−4/5.

172

0.0 0.1 0.2

040

80

Undersmoothed

0.0 0.1 0.2

040

80

Just Right

0.0 0.1 0.2

040

80

Oversmoothed

1 500 1000

Number of Bins

Figure 67: Three versions of a histogram for the astronomy data. The top left histogram has toomany bins. The bottom left histogram has too few bins. The top right histogram uses 308 bins(chosen by cross-validation). The lower right plot shows the estimated risk versus the number ofbins.

173

14.3 Theorem. The following identity holds:

J(h) =2

h(n− 1)− n+ 1

h(n− 1)

m∑

j=1

p 2j . (177)

14.4 Example. We used cross-validation in the astronomy example. We find that m = 308 isan approximate minimizer. The histogram in the top right plot in Figure 67 was constructed usingm = 308 bins. The bottom right plot shows the estimated risk, or more precisely, J , plotted versusthe number of bins.

Histograms are not smooth. Now we discuss kernel density estimators which are smoother andwhich converge to the true density faster.

Given a kernelK and a positive number h, called the bandwidth, the kernel density estimatoris defined to be

fn(x) =1

n

n∑

i=1

1

hK

(x−Xi

h

). (178)

This amounts to placing a smoothed out lump of mass of size 1/n over each data point Xi; seeFigure 68.

In R use: kernel(x,bw=h) where h is the bandwidth.As with kernel regression, the choice of kernel K is not crucial, but the choice of bandwidth

h is important. Figure 69 shows density estimates with several different bandwidths. Look alsoat Figure 66. We see how sensitive the estimate fn is to the choice of h. Small bandwidths givevery rough estimates while larger bandwidths give smoother estimates. In general we will let thebandwidth depend on the sample size so we write hn. Here are some properties of fn.

The risk is

R ≈ 1

4σ4Kh

4n

∫(f ′′(x))2dx+

∫K2(x)dx

nh(179)

where σ2K =

∫x2K(x)dx.

If we differentiate (179) with respect to h and set it equal to 0, we see that the asymptoticallyoptimal bandwidth is

h∗ =

(c2

c21A(f)n

)1/5

(180)

where c1 =∫x2K(x)dx, c2 =

∫K(x)2dx and A(f) =

∫(f ′′(x))2dx. This is informative because

it tells us that the best bandwidth decreases at rate n−1/5. Plugging h∗ into (179), we see that if theoptimal bandwidth is used then R = O(n−4/5). As we saw, histograms converge at rate O(n−2/3)showing that kernel estimators are superior in rate to histograms.

In practice, the bandwidth can be chosen by cross-validation but first we describe anothermethod which is sometimes used when f is thought to be very smooth. Specifically, we computeh∗ from (180) under the idealized assumption that f is Normal. This yields h∗ = 1.06σn−1/5.Usually, σ is estimated by mins,Q/1.34 where s is the sample standard deviation and Q is the

174

−10 −5 0 5 10

Figure 68: A kernel density estimator fn. At each point x, fn(x) is the average of the kernelscentered over the data points Xi. The data points are indicated by short vertical bars. The kernelsare not drawn to scale.

175

0.0 0.1 0.2 0.0 0.1 0.2

0.0 0.1 0.2 0.000 0.008

Figure 69: Kernel density estimators and estimated risk for the astronomy data. Top left: over-smoothed. Top right: just right (bandwidth chosen by cross-validation). Bottom left: under-smoothed. Bottom right: cross-validation curve as a function of bandwidth h. The bandwidth waschosen to be the value of h where the curve is a minimum.

176

interquartile range.1 This choice of h∗ works well if the true density is very smooth and is calledthe Normal reference rule.

The Normal Reference Rule

For smooth densities and a Normal kernel, use the bandwidth

hn =1.06 σ

n1/5

where

σ = min

s,

Q

1.34

.

Since we don’t want to necessarily assume that f is very smooth, it is usually better to estimateh using cross-validation. Recall that the cross-validation score is

J(h) =

∫f 2(x)dx− 2

n

n∑

i=1

f−i(Xi) (181)

where f−i denotes the kernel estimator obtained by omitting Xi.R code. use the bw.ucv function to do cros-validation:

h = bw.ucv(x)plot(density(x,bw=h))

The bandwidth for the density estimator in the upper right panel of Figure 69 is based on cross-validation. In this case it worked well but of course there are lots of examples where there areproblems. Do not assume that, if the estimator f is wiggly, then cross-validation has let you down.The eye is not a good judge of risk.

Constructing confidence bands for kernel density estimators is similar to regression. Note thatfn(x) is just a sample average: fn(x) = n−1

∑ni=1 Zi(x) where

Zi(x) =1

hK

(x−Xi

h

).

So the standard error is se(x) = s(x)/√n where s(x) is the standard deviation of the Zi(x)′s:

s(x) =

√√√√ 1

n

n∑

i=1

(Zi(x)− fn(x))2. (182)

Then we usefn(x)± zα/(2n)se(x).

177

−2 −1 0 1 2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

CVgrid

ou

t$f

−2 −1 0 1 2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

reference rulegrid

ou

t$f

−2 −1 0 1 2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

CVgrid

ou

t$f

−2 −1 0 1 2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

reference rulegrid

ou

t$f

178

14.5 Example. Figure 14 shows two examples. The first is data from N(0,1) and second from(1/2)N(−1, .1) + (1/2)N(1, .1). In both cases, n = 1000. We show the estimates using cross-validation and the Normal reference rule together with bands. The true curve is also shown. That’sthe curve outside the bands in the last plot.

Suppose now that the data are d-dimensional so thatXi = (Xi1, . . . , Xid). The kernel estimatorcan easily be generalized to d dimensions. Most often, we use the product kernel

fn(x) =1

nh1 · · ·hd

n∑

i=1

d∏

j=1

K

(xj −Xij

hj

). (183)

To further simplify, we can rescale the variables to have the same variance and then use only onebandwidth.

A LINK BETWEEN REGRESSION AND DENSITY ESTIMATION. Consider regression again.Recall that

r(x) = E(Y |X = x) =

∫yf(y|x)dy =

∫yf(x, y)dy

f(x)(184)

=

∫yf(x, y)∫f(x, y)dx

. (185)

Suppose we compute a bivariate kernel density estimator

f(x, y) =1

n

n∑

i=1

1

h1

K

(x−Xi

h1

)1

h2

K

(y − Yih2

)(186)

and we insert this into (185). Assuming that∫uK(u)du = 0, we see that

∫y

1

h2

K

(y − Yih2

)dy =

∫(h2u+ Yi)K(u)du (187)

= h2

∫uK(u)du+ Yi

∫K(u)du (188)

= Yi. (189)

Hence,∫yf(x, y)dy =

1

n

n∑

i=1

∫y

1

h1

K

(x−Xi

h1

)1

h2

K

(y − Yih2

)dy (190)

=1

n

n∑

i=1

1

h1

K

(x−Xi

h1

)∫y

1

h2

K

(y − Yih2

)dy (191)

=1

n

n∑

i=1

Yi1

h1

K

(x−Xi

h1

). (192)

1Recall that the interquartile range is the 75th percentile minus the 25th percentile. The reason fordividing by 1.34 is that Q/1.34 is a consistent estimate of σ if the data are from a N(µ, σ2).

179

Also,∫f(x, y)dy =

1

n

n∑

i=1

∫1

h1

K

(x−Xi

h1

)1

h2

K

(y − Yih2

)dy (193)

=1

n

n∑

i=1

1

h1

K

(x−Xi

h1

). (194)

Therefore,

r(x) =

∫yf(x, y)∫f(x, y)dx

(195)

=

1n

∑ni=1 Yi

1h1K(x−Xih1

)

1n

∑ni=1

1h1K(x−Xih1

) (196)

=

∑ni=1 YiK

(x−Xih1

)

∑ni=1K

(x−Xih1

) (197)

which is the kernel regression estimator. In other words, the kernel regression estimator can bederived from kernel density estimation.

180

15 ClassificationREFERENCES:

1. Hastie, Tibshirani and Friedman (2001). The Elements of Statistical Learning.

2. Devroye, Gyorfi and Lugosi. (1996). A Probabilistic Theory of Pattern Recognition.

3. Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning.

The problem of predicting a discrete random variable Y from another random variable X iscalled classification, supervised learning, discrimination, or pattern recognition.

Consider IID data (X1, Y1), . . . , (Xn, Yn) where

Xi = (Xi1, . . . , Xid)T ∈ X ⊂ Rd

is a d-dimensional vector and Yi takes values in 0, 1. Often, the covariates X are also calledfeatures. The goal is to predict Y given a new X . This is the same as binary regression exceptthat the focus is on good prediction rather than estimating the regression function.

A classification rule is a function h : X → 0, 1. When we observe anewX , we predict Y to be h(X). The classification risk (or error rate)of h is

R(h) = P(Y 6= h(X)). (198)

EXAMPLES:

1. The Coronary Risk-Factor Study (CORIS) data. There are 462 males between the ages of15 and 64 from three rural areas in South Africa. The outcome Y is the presence (Y = 1)or absence (Y = 0) of coronary heart disease and there are 9 covariates: systolic blood pres-sure, cumulative tobacco (kg), ldl (low density lipoprotein cholesterol), adiposity, famhist(family history of heart disease), typea (type-A behavior), obesity, alcohol (current alcoholconsumption), and age. The goal is to predict Y from all the covariates.

2. Predict if stock will go up or down based on past performance. Here X is past price and Yis the future price.

3. Predict if an email message is spam or real.

4. Identify whether glass fragments in a criminal investigation or from a window or not, basedon chemical composition.

181

Figure 70: Zip code data.

182

ut

utut

ut

ut

ut

rs

rsrs

rs rsrs

x1

x2

1

Figure 71: Two covariates and a linear decision boundary. 4 means Y = 1.2 means Y = 0. These two groups are perfectly separated by the linear decision boundary.

5. Identity handwritten digits from images. Each Y is a digit from 0 to 9. There are 256covariates x1, . . . , x256 corresponding to the intensity values from the pixels of the 16 X 16image. See Figure 70.

15.1 Example. Figure 71 shows 100 data points. The covariate X = (X1, X2) is 2-dimensionaland the outcome Y ∈ Y = 0, 1. The Y values are indicated on the plot with the trianglesrepresenting Y = 1 and the squares representing Y = 0. Also shown is a linear classification rulerepresented by the solid line. This is a rule of the form

h(x) =

1 if a+ b1x1 + b2x2 > 00 otherwise.

Everything above the line is classified as a 0 and everything below the line is classified as a 1.

183

15.1 Error Rates, The Bayes Classifier and Regression

The true error rate (or classification risk) of a classifier h is

R(h) = P(h(X) 6= Y ) (199)

and the empirical error rate or training error rate is

Rn(h) =1

n

n∑

i=1

I(h(Xi) 6= Yi). (200)

Now we will related classification to regression. Let

r(x) = E(Y |X = x) = P(Y = 1|X = x)

denote the regression function. We have the following important result.

The rule h that minimizes R(h) is

h∗(x) =

1 if r(x) > 1

2

0 otherwise.(201)

The rule h∗ is called the Bayes’ rule. The risk R∗ = R(h∗) of the Bayesrule is called the Bayes’ risk. The set

D(h) = x : r(x) = 1/2 (202)

is called the decision boundary.

PROOF. We will show thatR(h)−R(h∗) ≥ 0.

Note thatR(h) = P(Y 6= h(X)) =

∫P(Y 6= h(X)|X = xf(x)dx.

It suffices to show that

P(Y 6= h(X)|X = x − P(Y 6= h∗(X)|X = x ≥ 0 for all x. (203)

Now,

P(Y 6= h(X)|X = x = 1− P(Y = h(X)|X = x

184

= 1−(

P(Y = 1, h(X) = 1|X = x+ P(Y = 0, h(X) = 0|X = x)

= 1−(I(h(x) = 1)P(Y = 1|X = x+ I(h(x) = 0)P(Y = 0|X = x

)

= 1−(I(h(x) = 1)r(x) + I(h(x) = 0)(1− r(x))

)

= 1−(I(x)r(x) + (1− I(x))(1− r(x))

)

where I(x) = I(h(x) = 1). Hence,

P(Y 6= h(X)|X = x − P(Y 6= h∗(X)|X = x

=

(I∗(x)r(x) + (1− I∗(x))(1− r(x))

)−(I(x)r(x) + (1− I(x))(1− r(x))

)

= (2r(x)− 1)(I∗(x)− I(x))

= 2

(r(x)− 1

2

)(I∗(x)− I(x)).

When r(x) ≥ 1/2, h∗(x) = 1 so (204) is non-negative. When r(x) < 1/2, h∗(x) = 0 so so bothterms are nonposiitve and hence (204) is again non-negative. This proves (203).

To summarize, if h is any classifier, then R(h) ≥ R∗.

15.2 Classification is Easier Than RegressionLet r∗(x) = E(Y |X = x) be the true regression function and let h∗(x) denote the correspondingBayes rule. Let r(x) be an estimate of r∗(x) and define the plug-in rule:

h(x) =

1 if r(x) > 1

2

0 otherwise.(204)

In the previous proof we showed that

P(Y 6= h(X)|X = x)− P(Y 6= h∗(X)|X = x) = (2r(x)− 1)(Ih∗(x)=1 − Ibh(x)=1)

= |2r(x)− 1|Ih∗(x)6=bh(x)

= 2|r(x)− 1/2|Ih∗(x)6=bh(x)

Now, when h∗(x) 6= h(x) we have that |r(x)− r∗(x)| ≥ |r(x)− 1/2|. Therefore,

P(h(X) 6= Y )− P(h∗(X) 6= Y ) = 2

∫|r(x)− 1/2|Ih∗(x)6=bh(x)f(x)dx

≤ 2

∫|r(x)− r∗(x)|Ih∗(x)6=bh(x)f(x)dx

185

≤ 2

∫|r(x)− r∗(x)|f(x)dx

= 2E|r(X)− r∗(X)|.

This means that if r(x) is close to r∗(x) then the classification risk will be close to the Bayes risk.The converse is not true. It is possible for r to be far from r∗(x) and still lead to a good classifier.As long as r(x) and r∗(x) are on the same side of 1/2 they yield the same classifier.

15.3 The Bayes’ Rule and the Class DensitiesWe can rewrite h∗ in a different way. From Bayes’ theorem we have that

r(x) = P(Y = 1|X = x)

=f(x|Y = 1)P(Y = 1)

f(x|Y = 1)P(Y = 1) + f(x|Y = 0)P(Y = 0)

=πf1(x)

πf1(x) + (1− π)f0(x)(205)

where

f0(x) = f(x|Y = 0)

f1(x) = f(x|Y = 1)

π = P(Y = 1).

We call f0 and f1 the class densities. Thus we have:

The Bayes’ rule can be written as:

h∗(x) =

1 if f1(x)

f0(x)> (1−π)

π

0 otherwise.(206)

15.4 How to Find a Good ClassifierThe Bayes rule depends on unknown quantities so we need to use the data to find some approxi-mation to the Bayes rule. There are three main approaches:

1. Empirical Risk Minimization. Choose a set of classifiersH and find h ∈ H that minimizessome estimate of L(h).

2. Regression (Plugin Classifiers). Find an estimate r of the regression function r and define

h(x) =

1 if r(x) > 1

2

0 otherwise.

186

3. Density Estimation. Estimate f0 from the Xi’s for which Yi = 0, estimate f1 from the Xi’sfor which Yi = 1 and let π = n−1

∑ni=1 Yi. Define

r(x) = P(Y = 1|X = x) =πf1(x)

πf1(x) + (1− π)f0(x)

and

h(x) =

1 if r(x) > 1

2

0 otherwise.

15.5 Empirical Risk Minimization: The Finite CaseLet H be a finite set of classifiers. Empirical risk minimization means choosing the classifierh ∈ H to minimize the training error Rn(h), also called the empirical risk. Thus,

h = argminh∈HRn(h) = argminh∈H

(1

n

∑

i

I(h(Xi) 6= Yi)

). (207)

Let h∗ be the best classifier in H, that is, R(h∗) = minh∈HR(h). How good is h compared to h∗?We know that R(h∗) ≤ R(h). We will now show that, with high probability, R(h) ≤ R(h∗) + εfor some small ε > 0.

Our main tool for this analysis is Hoeffding’s inequality. This inequality is very fundamentaland is used in many places in statistics and machine learning.

Hoeffding’s InequalityIf X1, . . . , Xn ∼ Bernoulli(p), then, for any ε > 0,

P (|p− p| > ε) ≤ 2e−2nε2 (208)

where p = n−1∑n

i=1Xi.

Another basic fact we need is the union bound: if Z1, . . . , Zm are random variables then

P(maxjZj ≥ c) ≤

∑

j

P(Zj > c).

This follows since

P(maxjZj ≥ c) = P(Z1 ≥ c or Z2 ≥ c or · · · or Zm ≥ c) ≤

∑

j

P(Zj > c).

Recall thatH = h1, . . . , hm consists of finitely many classifiers. Now we see that:

P(

maxh∈H|Rn(h)−R(h)| > ε

)≤∑

H∈H

P(|Rn(h)−R(h)| > ε

)≤ 2me−2nε2 .

187

Fix α and let

εn =

√2

nlog

(2m

α

).

Then

P(

maxh∈H|Rn(h)−R(h)| > εn

)≤ α.

Hence, with probability at least 1− α, the following is true:

R(h) ≤ R(h) + εn ≤ R(h∗) + εn ≤ R(h∗) + 2εn.

Summarizing:

P(R(h) > R(h∗) +

√8

nlog

(2m

α

))≤ α.

We might extend our analysis to infiniteH later.

15.6 Parametric Methods I: Linear and Logistic RegressionOne approach to classification is to estimate the regression function r(x) = E(Y |X = x) =P(Y = 1|X = x) and, once we have an estimate r, use the classification rule

h(x) =

1 if r(x) > 1

2

0 otherwise.(209)

The linear regression model

Y = r(x) + ε = β0 +d∑

j=1

βjXj + ε (210)

can’t be correct since it does not force Y = 0 or 1. Nonetheless, it can sometimes lead to a goodclassifier. An alternative is to use logistic regression:

r(x) = P(Y = 1|X = x) =eβ0+

Pj βjxj

1 + eβ0+Pj βjxj

. (211)

15.2 Example. Let us return to the South African heart disease data.

> print(names(sa.data))[1] "sbp" "tobacco" "ldl" "adiposity" "famhist" "typea"[7] "obesity" "alcohol" "age" "chd"> n = nrow(sa.data)>> ### linear> out = lm(chd ˜ . ,data=sa.data)

188

> tmp = predict(out)> yhat = rep(0,n)> yhat[tmp > .5] = 1> print(table(chd,yhat))

yhatchd 0 1

0 260 421 76 84

> print(sum( chd != yhat)/n)[1] 0.2554113>> ### logistic> out = glm(chd ˜. ,data=sa.data,family=binomial)> tmp = predict(out,type="response")> yhat = rep(0,n)> yhat[tmp > .5] = 1> print(table(chd,yhat))

yhatchd 0 1

0 256 461 77 83

> print(sum( chd != yhat)/n)[1] 0.2662338

15.3 Example. For the digits example, let’s restrict ourselves only to Y = 0 and Y = 1. Here iswhat we get:

> ### linear> out = lm(ytrain ˜ .,data=as.data.frame(xtrain))> tmp = predict(out)> n = length(ytrain)> yhat = rep(0,n)> yhat[tmp > .5] = 1> b = table(ytrain,yhat)> print(b)

yhatytrain 0 1

0 600 01 0 500

> print((b[1,2]+b[2,1])/sum(b)) ###training error[1] 0> tmp = predict(out,newdata=as.data.frame(xtest))Warning message:prediction from a rank-deficient fit may be misleading in:+ predict.lm(out, newdata = as.data.frame(xtest))

189

> n = length(ytest)> yhat = rep(0,n)> yhat[tmp > .5] = 1> b = table(ytest,yhat)> print(b)

yhatytest 0 1

0 590 41 0 505

> print((b[1,2]+b[2,1])/sum(b)) ### testing error[1] 0.003639672

15.7 Parametric Methods II: Gaussian and Linear ClassifiersSuppose that f0(x) = f(x|Y = 0) and f1(x) = f(x|Y = 1) are both multivariate Gaussians:

fk(x) =1

(2π)d/2|Σk|1/2exp

−1

2(x− µk)TΣ−1

k (x− µk), k = 0, 1.

Thus, X|Y = 0 ∼ N(µ0,Σ0) and X|Y = 1 ∼ N(µ1,Σ1).

15.4 Theorem. If X|Y = 0 ∼ N(µ0,Σ0) and X|Y = 1 ∼ N(µ1,Σ1), then the Bayes rule is

h∗(x) =

1 if r2

1 < r20 + 2 log

(π1

π0

)+ log

(|Σ0||Σ1|

)

0 otherwise(212)

wherer2i = (x− µi)TΣ−1

i (x− µi), i = 0, 1 (213)is the Manalahobis distance. An equivalent way of expressing the Bayes’ rule is

h∗(x) = argmaxk∈0,1δk(x)

whereδk(x) = −1

2log |Σk| −

1

2(x− µk)TΣ−1

k (x− µk) + log πk (214)

and |A| denotes the determinant of a matrix A.

The decision boundary of the above classifier is quadratic so this procedure is called quadraticdiscriminant analysis (QDA). In practice, we use sample estimates of π, µ1, µ2,Σ0,Σ1 in placeof the true value, namely:

π0 =1

n

n∑

i=1

(1− Yi), π1 =1

n

n∑

i=1

Yi

µ0 =1

n0

∑

i: Yi=0

Xi, µ1 =1

n1

∑

i: Yi=1

Xi

S0 =1

n0

∑

i: Yi=0

(Xi − µ0)(Xi − µ0)T , S1 =1

n1

∑

i: Yi=1

(Xi − µ1)(Xi − µ1)T

190

where n0 =∑

i(1− Yi) and n1 =∑

i Yi.A simplification occurs if we assume that Σ0 = Σ0 = Σ. In that case, the Bayes rule is

h∗(x) = argmaxkδk(x) (215)

where nowδk(x) = xTΣ−1µk −

1

2µTkΣ−1µk + log πk. (216)

The parameters are estimated as before, except that the MLE of Σ is

S =n0S0 + n1S1

n0 + n1

.

The classification rule is

h∗(x) =

1 if δ1(x) > δ0(x)0 otherwise

(217)

whereδj(x) = xTS−1µj −

1

2µTj S

−1µj + log πj

is called the discriminant function. The decision boundary x : δ0(x) = δ1(x) is linear so thismethod is called linear discrimination analysis (LDA).

15.5 Example. Let us return to the South African heart disease data. In R use:

out = lda(x,y) ### or qda for quadraticyhat = predict(out)$class

The error rate of LDA is .25. For QDA we get .24. In this example, there is little advantage toQDA over LDA.

Now we generalize to the case where Y takes on more than two values.

15.6 Theorem. Suppose that Y ∈ 1, . . . , K. If fk(x) = f(x|Y = k) is Gaussian, the Bayesrule is

h(x) = argmaxkδk(x)

whereδk(x) = −1

2log |Σk| −

1

2(x− µk)TΣ−1

k (x− µk) + log πk. (218)

If the variances of the Gaussians are equal, then

δk(x) = xTΣ−1µk −1

2µTkΣ−1 + log πk. (219)

191


192


193

In Figure 72 the left panel shows LDA applied to data that happen to represent 3 classes.The linear decision boundaries separate the 3 groups fairly well using the two variables plottedon horizontal and vertical axes (X1, X2). In the right hand panel LDA is applies to 5 variables,obtained by adding the quadratic and interaction terms (X2

1 , X22 , X1, X2). The linear boundaries

from the 5-dimensional space correspond to curves in the 2-dimensional space. In Figure 73 theright panel shows QDA applied to the same data. QDA nearly matches LDA applied with thequadratic terms.

There is another version of linear discriminant analysis due to Fisher. The idea is to first reducethe dimension of covariates to one dimension by projecting the data onto a line. Algebraically,this means replacing the covariate X = (X1, . . . , Xd) with a linear combination U = wTX =∑d

j=1 wjXj . The goal is to choose the vector w = (w1, . . . , wd) that “best separates the data.”Then we perform classification with the one-dimensional covariate Z instead of X . Fisher’s ruleis the same as the Bayes linear classifier in equation (216) when π = 1/2.

15.8 Relationship Between Logistic Regression and LDALDA and logistic regression are almost the same thing. If we assume that each group is Gaussianwith the same covariance matrix, then we saw earlier that

log

(P(Y = 1|X = x)

P(Y = 0|X = x)

)= log

(π0

π1

)− 1

2(µ0 + µ1)TΣ−1(µ1 − µ0)

+ xTΣ−1(µ1 − µ0)

≡ α0 + αTx.

On the other hand, the logistic model is, by assumption,

log

(P(Y = 1|X = x)

P(Y = 0|X = x)

)= β0 + βTx.

These are the same model since they both lead to classification rules that are linear in x. Thedifference is in how we estimate the parameters.

The joint density of a single observation is f(x, y) = f(x|y)f(y) = f(y|x)f(x). In LDA weestimated the whole joint distribution by maximizing the likelihood

∏

i

f(Xi, yi) =∏

i

f(Xi|yi)︸︷︷︸

Gaussian

∏

i

f(yi)

︸︷︷︸Bernoulli

. (220)

In logistic regression we maximized the conditional likelihood∏

i f(yi|Xi) but we ignored thesecond term f(Xi): ∏

i

f(Xi, yi) =∏

i

f(yi|Xi)

︸︷︷︸logistic

∏

i

f(Xi)

︸︷︷︸ignored

. (221)

194

Since classification only requires knowing f(y|x), we don’t really need to estimate the whole jointdistribution. Logistic regression leaves the marginal distribution f(x) unspecified so it is morenonparametric than LDA. This is an advantage of the logistic regression approach over LDA.

To summarize: LDA and logistic regression both lead to a linear classification rule. In LDA weestimate the entire joint distribution f(x, y) = f(x|y)f(y). In logistic regression we only estimatef(y|x) and we don’t bother estimating f(x).

15.9 Training and Testing Data: Model validationHow do we choose a good classifier? We would like to have a classifier h with a low predictionerror rate. Usually, we can’t use the training error rate as an estimate of the true error rate becauseit is biased downward.

15.7 Example. Consider the heart disease data again. Suppose we fit a sequence of logisticregression models. In the first model we include one covariate. In the second model we includetwo covariates, and so on. The ninth model includes all the covariates. We can go even further.Let’s also fit a tenth model that includes all nine covariates plus the first covariate squared. Thenwe fit an eleventh model that includes all nine covariates plus the first covariate squared and thesecond covariate squared. Continuing this way we will get a sequence of 18 classifiers of increasingcomplexity. The solid line in Figure 74 shows the observed classification error which steadilydecreases as we make the model more complex. If we keep going, we can make a model with zeroobserved classification error. The dotted line shows the 10-fold cross-validation estimate of theerror rate (to be explained shortly) which is a better estimate of the true error rate than the observedclassification error. The estimated error decreases for a while then increases. This is essentiallythe bias–variance tradeoff phenomenon we have seen before.

How can we learn if our model is good at classifying new data? The answer involves a trickwe’ve used previouslly Cross-Validation. No analysis of prediction models is complete withoutevaluating the performance of the model using this technique.

Cross-Validation. The basic idea of cross-validation, which we have already encountered incurve estimation, is to leave out some of the data when fitting a model. The simplest version ofcross-validation involves randomly splitting the data into two pieces: the training set T and thevalidation set V . Often, about 10 per cent of the data might be set aside as the validation set. Theclassifier h is constructed from the training set. We then estimate the error by

L(h) =1

m

∑

Xi∈V

I(h(Xi) 6= YI). (222)

where m is the size of the validation set. See Figure 75.In an ideal world we would have so much data that we could split the data into two portions, use

the first portion to select a model (i.e., training) and then the second portion to test our model. Inthis way we could obtain an unbiased estimate of how well we predict future data. In reality we sel-dom have enough data to spare any of it. As a compromise we use K-fold cross-validation (KCV)to evaluate our approach. K-fold cross-validation is obtained from the following algorithm.

195

number of terms in model

error rate

0.26

0.30

0.34

5 15

1

Figure 74: The solid line is the observed error rate and dashed line is the cross-validation estimateof true error rate.

196

Training Data T Validation Data V

︸︷︷︸h

︸︷︷︸L

1

Figure 75: Cross-validation. The data are divided into two groups: the training data and thevalidation data. The training data are used to produce an estimated classifier h. Then, h is appliedto the validation data to obtain an estimate L of the error rate of h.

197

K-fold cross-validation.

1. Randomly divide the data into K chunks of approximately equal size. A common choice isK = 10.

2. For k = 1 to K, do the following:

(a) Delete chunk k from the data.

(b) Compute the classifier h(k) from the rest of the data.

(c) Use h(k) to the predict the data in chunk k. Let L(k) denote the observed error rate.

3. Let

L(h) =1

K

K∑

k=1

L(k). (223)

If tuning parameters are chosen using cross-validation, then KCV still underestimates the error.Nevertheless, KCV helps us to evaluate model performance much better than the training error rate.

15.8 Example. Simple CV and KCV in GLM

This simple example shows how to use the cv.glm function to do leave-one-out cross-validationand K-fold cross-validation when fitting a generalized linear model.

First we’ll try the simplest scenario. leave-one-out and 6-fold cross-validation prediction errorfor data that are appropriate for a linear model.

> library(boot)> data(mammals, package="MASS")> mammals.glm = glm(log(brain)˜log(body),data=mammals)> cv.err = cv.glm(mammals,mammals.glm)

cv.err$delta1 1

0.4918650 0.4916571

This is the leave-one-out cross-validation. Delta reports the computed error for whatever ”cost”function was chosen. By default the cost is average squared error. (The 2nd delta entry adjusts forthe bias in KCV versus leave one out.)

198

# Try 6-fold cross-validation

> cv.err.6 = cv.glm(mammals, mammals.glm, K=6)

> cv.err.6$delta1 1

0.5000575 0.4969159

Notice that using 6-fold cross-validation yields similar, but not identical estimate of error. Nei-ther is clearly preferable in terms of performance. 6-fold is computationally faster.

As this is a linear model we could have calculated the leave-one-out cross-validation estimatewithout any extra model-fitting using the diagonals of the hat matrix. The function glm.diag givesthe diagonals of H.

muhat = mammals.glm$fittedmammals.diag = glm.diag(mammals.glm) #to get diagonals of Hcv.err = mean((mammals.glm$y-muhat)ˆ2/(1-mammals.diag$h)ˆ2)> cv.err[1] 0.491865

Notice it matches the leave-one-out CV entry above.Next we try a logistic model to obtain leave-one-out and 11-fold cross-validation prediction

error for the nodal data set. Since the response is a binary variable we don’t want to use the defaultcost function, which is squared error. First we need to define a function, which we call cost. Anappropriate cost function is our usual fraction of misclassified subjects in the 2x2 confusion matrix.This function below computes this quantity where y is the binary outcome and pi is the fitted value,i.e., L from above.

> cost = function(y, pi)> err = mean(abs(y-pi)>0.5)> return(err)>

> nodal.glm = glm(r˜stage+xray+acid,binomial,data=nodal)# for leave-one-out CV> cv.err = cv.glm(nodal, nodal.glm, cost=cost, K=nrow(nodal))$delta> cv.err

1 10.1886792 0.1886792# for 11-fold CV> cv.11.err <- cv.glm(nodal, nodal.glm, cost=cost, K=11)$delta> cv.11.err

1 10.2264151 0.2192951

199

There are CV forms for each of the methods of prediction. They vary in convenience. Forregression trees, in the library called tree, there is a function called cv.tree. For a the method wecover next, called k’th nearest neighbors, in the library called class there is a function called knn.cv.For LDA analysis the library needs to be downloaded “manually” and it is called lda.cv.

15.9 Example. Diabetes in Pima Indians

A population of women who were at least 21 years old, of Pima Indian heritage and living nearPhoenix, Arizona, was tested for diabetes.

These data frames contains the following columns:

• npreg = number of pregnancies.

• glu = plasma glucose concentration in an oral glucose tolerance test.

• bp = diastolic blood pressure (mm Hg).

• skin = triceps skin fold thickness (mm).

• bmi = body mass index (weight in kg/(height in m)2).

• ped = diabetes pedigree function.

• age = age in years.

• type = Yes or No, for diabetic according to WHO criteria.

> library(MASS)> data(Pima.tr)> data(Pima.te)> Pima <- rbind(Pima.tr, Pima.te)> Pima$type <- ifelse(Pima$type == "Yes", 1, 0)> library(boot)

We try 4 models of various complexity and compare their error rate using KCV. M2 is themodel chosen by AIC using stepwise regression.

> M1 <- glm(type ˜ npreg + glu + bp + skin + bmi ++ age, data = Pima, family = binomial)> M2 <- glm(type ˜ npreg + glu + bmi + age, data = Pima,+ family = binomial)> M3 <- glm(type ˜ 1, data = Pima, family = binomial)> M4 <- glm(type ˜ (npreg + glu + bp + skin + bmi + age)ˆ2,+ data = Pima, family = binomial)

> F1 <- cv.glm(data = Pima,cost=cost, M1)$delta[2]

200

> F2 <- cv.glm(data = Pima,cost=cost, M2)$delta[2]> F3 <- cv.glm(data = Pima,cost=cost, M3)$delta[2]> F4 <- cv.glm(data = Pima,cost=cost, M4)$delta[2]> F <- c(F1 = F1, F2 = F2, F3 = F3, F4 = F4)> F

F1.1 F2.1 F3.1 F4.10.2247442 0.2245428 0.3327068 0.2401068

Based on this analysis, we conclude that Model M2 has slightly better predictive power.

15.10 Nearest NeighborsThe k-nearest neighbor rule is

h(x) =

1∑n

i=1wi(x)I(Yi = 1) >∑n

i=1wi(x)I(Yi = 0)0 otherwise

(224)

where wi(x) = 1 if Xi is one of the k nearest neighbors of x, wi(x) = 0, otherwise. “Nearest”depends on how you define the distance. Often we use Euclidean distance ||Xi−Xj||. In that caseyou should standardize the variables first.

15.10 Example. Digits again.

> ### knn> library(class)> yhat = knn(train = xtrain, cl = ytrain, test = xtest, k = 1)> b = table(ytest,yhat)> print(b)

yhatytest 0 1

0 594 01 0 505

> print((b[1,2]+b[2,1])/sum(b))[1] 0>>> yhat = knn.cv(train = xtrain, cl = ytrain, k = 1)> b = table(ytrain,yhat)> print(b)

yhatytrain 0 1

0 599 11 0 500

> print((b[1,2]+b[2,1])/sum(b))[1] 0.0009090909

201

An important part of this method is to choose a good value of k. For this we can use cross-validation.

15.11 Example. South African heart disease data again.

library(class)m = 50error = rep(0,m)for(i in 1:m)

out = knn.cv(train=x,cl=y,k=i)error[i] = sum(y != out)/n

postscript("knn.sa.ps")plot(1:m,error,type="l",lwd=3,xlab="k",ylab="error")

See Figure 76.

15.12 Example. Figure 77 compares the decision boundaries in a two-dimensinal example. Theboundaries are from (i) linear regression, (ii) quadratic regression, (iii) k-nearest neighbors (k = 1),(iv) k-nearest neighbors (k = 50), and (v) k-nearest neighbors (k = 200). The logistic (not shown)also yields a linear boundary.

Some Theoretecal Properties. Let h1 be the nearest neighbor classifier with k = 1. Coverand Heart (1967) showed that, under very weak assumptions,

R∗ ≤ limn→∞

R(h1) ≤ 2R∗ (225)

where R∗ is the Bayes risk. For k > 1 we have

R∗ ≤ limn→∞

R(hk) ≤ R∗ +1√ke. (226)

15.11 Density Estimation and Naive BayesThe Bayes rule can be written as

h∗(x) =

1 if f1(x)

f0(x)> (1−π)

π

0 otherwise.(227)

We can estimate π by

π =1

n

n∑

i=1

Yi.

202

0 10 20 30 40 50

0.32

0.34

0.36

0.38

0.40

0.42

0.44

k

erro

r

Figure 76: knn for South Africn heart disease data.

203

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Data

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

linear

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

quadratic

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

knn k =1

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

knn k =50

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

knn k =200

Figure 77: Comparison of decision boundaries.

204

We can estimate f0 and f1 using density estimation. For example, we could apply kernel densityestimation to D0 = Xi : Yi = 0 to get f0 and to D1 = Xi : Yi = 1 to get f1. Then weestimate h∗ with

h(x) =

1 ifbf1(x)bf0(x)

> (1−bπ)bπ0 otherwise.

(228)

But if x = (x1, . . . , xd) is high-dimensional, nonparametric density estimation is not very reliable.This problem is ameliorated if we assume that X1, . . . , Xd are independent, for then,

f0(x1, . . . , xd) =d∏

j=1

f0j(xj) (229)

f1(x1, . . . , xd) =d∏

j=1

f1j(xj). (230)

We can then use one-dimensional density estimators and multiply them:

f0(x1, . . . , xd) =d∏

j=1

f0j(xj) (231)

f1(x1, . . . , xd) =d∏

j=1

f1j(xj). (232)

The resulting classifier is called the naive Bayes classifier. The assumption that the componentsof X are independent is usually wrong yet the resulting classifier might still be accurate. Here is asummary of the steps in the naive Bayes classifier:

The Naive Bayes Classifier

1. For each group k = 0, 1, compute an estimate fkj of the densityfkj for Xj , using the data for which Yi = k.

2. Let

fk(x) = fk(x1, . . . , xd) =d∏

j=1

fkj(xj).

3. Let

πk =1

n

n∑

i=1

Yi.

4. Define h as in (228).

205

Naive Bayes is closely related to generalized additive models. Under the naive Bayes model,

logit

(P(Y = 1)|XP(Y = 0)|X

)= log

(πf1(X)

(1− π)f0(X)

)(233)

= log

(π∏d

j=1 f1j(Xj)

(1− π)∏d

j=1 f0j(Xj)

)(234)

= log

(π

1− π

)+

d∑

j=1

log

(f1j(Xj)

f0j(Xj)

)(235)

= β0 +d∑

j=1

gj(Xj) (236)

which has the form of a generalized additive model. Thus we expect similar performance usingnaive Bayes or generalized additive models.

15.13 Example. For the SA sata: Note the use of the gam package.

n = nrow(sa.data)y = chdx = sa.data[,1:9]

library(gam)s = .25out = gam(y ˜ lo(sbp,span = .25,degree=1) + lo(tobacco,span = .25,degree=1) +

lo(ldl,span = .25,degree=1) + lo(adiposity,span = .25,degree=1) +famhist + lo(typea,span = .25,degree=1) +lo(obesity,span = .25,degree=1) + lo(alcohol,span = .25,degree=1) +lo(age,span = .25,degree=1))

tmp = fitted(out)yhat = rep(0,n)yhat[tmp > .5] = 1print(table(y,yhat))

yhaty 0 1

0 256 461 77 83

print(mean(y != yhat))[1] 0.2662338

15.14 Example. Figure 78 (top) shows an artificial data set with two covariates x1 and x2. Figure78 (middle) shows kernel density estimators f1(x1), f1(x2), f0(x1), f0(x2). The top left plot showsthe resulting naive Bayes decision boundary. The bottom left plot shows the predictions from agam model. Clearly, this is similar to the naive Bayes model. The gam model has an error rate of0.03. In contrast, a linear model yields a classifier with error rate of 0.78.

206

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.4

0.6

0.8

1.0

1.2

1.4

1.6

x

f10

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

x

f11

0.0 0.2 0.4 0.6 0.8 1.0

0.6

0.8

1.0

1.2

1.4

x

f20

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

x

f21

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Data

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Predictionsx1

x2

Figure 78: Top: Artifical Data. Middle: Naive Bayes and Gam classifiers. Bottom: Naive Bayesand Gam classifiers.

207

0 1

Blood Pressure 1

Age

< 100 ≥ 100

< 50 ≥ 50

1

Figure 79: A simple classification tree.

15.12 TreesTrees are classification methods that partition the covariate space X into disjoint pieces and thenclassify the observations according to which partition element they fall in. As the name implies,the classifier can be represented as a tree.

For illustration, suppose there are two covariates, X1 = age and X2 = blood pressure. Figure79 shows a classification tree using these variables.

The tree is used in the following way. If a subject has Age≥ 50 then we classify him as Y = 1.If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure is < 100then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 80 shows the sameclassifier as a partition of the covariate space.

Here is how a tree is constructed. First, suppose that y ∈ Y = 0, 1 and that there is only asingle covariate X . We choose a split point t that divides the real line into two sets A1 = (−∞, t]and A2 = (t,∞). Let ps(j) be the proportion of observations in As such that Yi = j:

ps(j) =

∑ni=1 I(Yi = j,Xi ∈ As)∑n

i=1 I(Xi ∈ As)(237)

for s = 1, 2 and j = 0, 1. The impurity of the split t is defined to be

I(t) =2∑

s=1

γs (238)

208

1

0

1

Age

Blo

odPr

essu

re

50

110

1

Figure 80: Partition representation of classification tree.

where

γs = 1−1∑

j=0

ps(j)2. (239)

This particular measure of impurity is known as the Gini index. If a partition element As containsall 0’s or all 1’s, then γs = 0. Otherwise, γs > 0. We choose the split point t to minimize theimpurity. (Other indices of impurity besides can be used besides the Gini index.)

When there are several covariates, we choose whichever covariate and split that leads to thelowest impurity. This process is continued until some stopping criterion is met. For example,we might stop when every partition element has fewer than n0 data points, where n0 is some fixednumber. The bottom nodes of the tree are called the leaves. Each leaf is assigned a 0 or 1 dependingon whether there are more data points with Y = 0 or Y = 1 in that partition element.

This procedure is easily generalized to the case where Y ∈ 1, . . . , K. We simply define theimpurity by

γs = 1−k∑

j=1

ps(j)2 (240)

where pi(j) is the proportion of observations in the partition element for which Y = j.

15.15 Example. Heart disease data.

209

X = scan("sa.data",skip=1,sep=",")>Read 5082 itemsX = matrix(X,ncol=11,byrow=T)chd = X[,11]n = length(chd)X = X[,-c(1,11)]names = c("sbp","tobacco","ldl","adiposity","famhist","typea",+ "obesity","alcohol","age")for(i in 1:9)

assign(names[i],X[,i])

famhist = as.factor(famhist)formula = paste(names,sep="",collapse="+")formula = paste("chd ˜ ",formula)formula = as.formula(formula)print(formula)

> chd ˜ sbp + tobacco + ldl + adiposity + famhist + typea + obesity +> alcohol + age

chd = as.factor(chd)d = data.frame(chd,sbp,tobacco,ldl,adiposity,famhist,typea,+ obesity,alcohol,age)

library(tree)postscript("south.africa.tree.plot1.ps")out = tree(formula,data=d)print(summary(out))

>Classification tree:>tree(formula = formula, data = d)>Variables actually used in tree construction:>[1] "age" "tobacco" "alcohol" "typea" "famhist" "adiposity">[7] "ldl">Number of terminal nodes: 15>Residual mean deviance: 0.8733 = 390.3 / 447>Misclassification error rate: 0.2078 = 96 / 462

plot(out,type="u",lwd=3)text(out)cv = cv.tree(out,method="misclass")plot(cv,lwd=3)newtree = prune.tree(out,best=6,method="misclass")print(summary(newtree))

210

|age < 31.5

tobacco < 0.51

alcohol < 11.105

age < 50.5

typea < 68.5 famhist:a

tobacco < 7.605

typea < 42.5

adiposity < 24.435

adiposity < 28.955

ldl < 6.705

tobacco < 4.15

adiposity < 28

typea < 48

0

0 0 0 1

0

0 0

1 1

0 1

0

1

1

Figure 81: Tree

>Classification tree:>snip.tree(tree = out, nodes = c(2, 28, 29, 15))>Variables actually used in tree construction:>[1] "age" "typea" "famhist" "tobacco">Number of terminal nodes: 6>Residual mean deviance: 1.042 = 475.2 / 456>Misclassification error rate: 0.2294 = 106 / 462plot(newtree,lwd=3)text(newtree,cex=2)

See Figures 81, 82, 83.

211

size

mis

clas

s

145

150

155

160

165

2 4 6 8 10 12 14

12 10 8 3 1 0 −Inf

Figure 82: Tree

|age < 31.5

age < 50.5

typea < 68.5 famhist:a

tobacco < 7.605

0

0 1

0 1

1

Figure 83: Tree

212

15.16 Example. CV for Trees

2 4 6 8

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7.5

Tree Regression Splits

X

Y

Figure 84:

To illustrate the concept of ten-fold cross-validation in the context of tree regression, we beginwith a dataset of 40 observations (Fig. 84). Within this dataset there are two groups: 26 observa-tions from Group 1 (blue circles), and 14 observations from Group 2 (orange squares).

The 40 observations are partitioned via tree regression into three sections. In Figure 85, we seethe proportion of points that are blue circles within each region. For classification, points fallingwithin a region with pblue > 0.5 will be classified as belonging to Group 1, and points falling in a

213

region with pblue < 0.5 will be classified as belonging to Group 2. Consequently, points in the leftand bottom-right regions would be classified as belonging to Group 1.

2 3 4 5 6 7 8 9

45

67

Proportion of Blue Circles

0.79

0.27

0.80

Figure 85:

For ten-fold cross-validation, 10% of the data is removed as testing data, and a regressionmodel is created using the remaining 90%. Prediction is made for the testing data based on themodel created by the training data set, and prediction rates are computed. This process is repeatedfor each 10% chunk of the data, and the prediction rates across the 10 10% chunks are averaged toform the ten-fold cross-validation score.

In Figure 86, we illustrate this process for a single 10% chunk from this dataset. 4 of the 40

214

observations are removed as a testing dataset, and the remaining 36 observations are used to createa tree regression model.

2 4 6 8

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7.5

X

Y

Figure 86:

The new model lines shown in Figure 87. The four removed points will be classified using thenewly created tree regression model. The two orange-squares that were removed now fall into thenew left bin, and will be classified incorrectly. The two blue points that were removed still lie inpredominantly blue regions, and are classified correctly. Thus, the prediction rate for these fourobservations is 50%. To find the ten-fold cross-validation score, this process would be repeated 9more times, each time on a different group of 4 observations.

215

2 4 6 8

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7.5

X

Y

Old LinesNew Lines

Figure 87:

15.13 Support Vector MachinesIn this section we consider a class of linear classifiers called support vector machines. It will beconvenient to label the outcomes as −1 and +1 instead of 0 and 1. A linear classifier can then bewritten as

h(x) = sign(H(x)

)

where x = (x1, . . . , xd),

H(x) = a0 +d∑

i=1

aiXi

216

and

sign(z) =

−1 if z < 0

0 if z = 01 if z > 0.

Note that:

classifier correct =⇒ YiH(Xi) ≥ 0

classifier incorrect =⇒ YiH(Xi) ≤ 0.

The classification risk is R = P(Y 6= h(X)) = P(Y H(X) ≤ 0) = E(L(Y H(X))) where the lossfunction L is L(a) = 1 if a < 0 and L(a) = 0 if a ≥ 0.

Suppose that the data are linearly separable, that is, there exists a hyperplane that perfectlyseparates the two classes. How can we find a separating hyperplane? LDA is not guaranteed tofind it. A separating hyperplane will minimize

−∑

misclassified

YiH(Xi).

Rosenblatt’s perceptron algorithm takes starting values and updates them:(

ββ0

)←−

(ββ0

)+ ρ

(YiXi

Yi

).

However, there are many separating hyperplanes. The particular separating hyperplane that thisalgorithm converges to depends on the starting values.

Intuitively, it seems reasonable to choose the hyperplane “furthest” from the data in the sensethat it separates the +1s and -1s and maximizes the distance to the closest point. This hyperplaneis called the maximum margin hyperplane. The margin is the distance to from the hyperplane tothe nearest point. Points on the boundary of the margin are called support vectors. See Figure 88.

15.17 Lemma. The data can be separated by some hyperplane if and only if there exists a hy-perplane H(x) = a0 +

∑di=1 aiXi such that

YiH(Xi) ≥ 1, i = 1, . . . , n. (241)

PROOF. Suppose the data can be separated by a hyperplaneW (x) = b0 +∑d

i=1 biXi. It followsthat there exists some constant c such that Yi = 1 implies W (Xi) ≥ c and Yi = −1 impliesW (Xi) ≤ −c. Therefore, YiW (Xi) ≥ c for all i. Let H(x) = a0 +

∑di=1 aiXi where aj = bj/c.

Then YiH(Xi) ≥ 1 for all i. The reverse direction is straightforward.

The goal, then, is to maximize the margin, subject to (241). Given two vectors a and b let〈a, b〉 = aT b =

∑j ajbj denote the inner product of a and b.

15.18 Theorem. Let H(x) = a0 +∑d

i=1 aiXi denote the optimal (largest margin) hyperplane.Then, for j = 1, . . . , d,

aj =n∑

i=1

αiYiXj(i)

217

bc

bc

bc

bcbc

bc

bc

bc

bc

bc

bc

bc

bc

bc

H(x) = a0 + aT x = 0

1

Figure 88: The hyperplane H(x) has the largest margin of all hyperplanes that separate the twoclasses.

218

where Xj(i) is the value of the covariate Xj for the ith data point, and α = (α1, . . . , αn) is thevector that maximizes

n∑

i=1

αi −1

2

n∑

i=1

n∑

k=1

αiαkYiYk〈Xi, Xk〉 (242)

subject toαi ≥ 0

and0 =

∑

i

αiYi.

The points Xi for which α 6= 0 are called support vectors. a0 can be found by solving

αi

(Yi(X

Ti a+ a0

)= 0

for any support point Xi. H may be written as

H(x) = α0 +n∑

i=1

αiYi〈x,Xi〉.

There are many software packages that will solve this problem quickly.If there is no perfect linear classifier, then one allows overlap between the groups by replacing

the condition (241) with

YiH(Xi) ≥ 1− ξi, ξi ≥ 0, i = 1, . . . , n. (243)

The variables ξ1, . . . , ξn are called slack variables.We now maximize (242) subject to

0 ≤ ξi ≤ c, i = 1, . . . , n

andn∑

i=1

αiYi = 0.

The constant c is a tuning parameter that controls the amount of overlap.In R we can use the package e1071.

15.19 Example. The iris data.

library(e1071)data(iris)x = iris[51:150,]a = x[,5]x = x[,-5]

219

attributes(a)$levels[1] "setosa" "versicolor" "virginica"

$class[1] "factor"

n = length(a)y = rep(0,n)y[a == "versicolor"] = 1y = as.factor(y)out = svm(x, y)print(out)

Call:svm.default(x = x, y = y)

Parameters:SVM-Type: C-classification

SVM-Kernel: radialcost: 1gamma: 0.25

Number of Support Vectors: 33

summary(out)

Call:svm.default(x = x, y = y)

Parameters:SVM-Type: C-classification

SVM-Kernel: radialcost: 1gamma: 0.25

Number of Support Vectors: 33

( 17 16 )

Number of Classes: 2

220

Levels:0 1

## test with train datapred = predict(out, x)table(pred, y)

ypred 0 1

0 49 21 1 48

Let’s have a look at what is happening with these data and svm. In order to make a 2 dimen-sional plot of the 4 dimensional data, we will plot the first 2 principal components of the distancematrix. The supporting vectors are circled. On the whole the two species of iris are separatedand the supporting vectors mostly fall near the dividing line. Some don’t because these are 4-dimensional data shown in 2-dimensions.

M = cmdscale(dist(x))plot(M,col = as.integer(y)+1,pch = as.integer(y)+1)## support vectorsI = 1:n %in% out$indexpoints(M[I,],lwd=2)

See Figure 89.

Here is another (easier) way to think about the SVM. The SVM hyperplan H(x) = β0 + xTxcan be obtained by minimizing

n∑

i=1

(1− YiH(Xi))+ + λ||β||2.

Figure 90 compares the svm loss, squared loss, classification error and logistic loss log(1 +e−yH(x)).

221

−2 −1 0 1 2

−0.

50.

00.

5

M[,1]

M[,2

]

Figure 89:

222

−3 −2 −1 0 1 2 3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

y H(x)

Figure 90: Hinge

223

15.13.1 Kernelization

There is a trick called kernelization for improving a computationally simple classifier h. The ideais to map the covariate X — which takes values in X — into a higher dimensional space Z andapply the classifier in the bigger space Z . This can yield a more flexible classifier while retainingcomputationally simplicity.

The standard example of this idea is illustrated in Figure 91. The covariate x = (x1, x2). TheYis can be separated into two groups using an ellipse. Define a mapping φ by

z = (z1, z2, z3) = φ(x) = (x21,√

2x1x2, x22).

Thus, φ maps X = R2 into Z = R3. In the higher-dimensional space Z , the Yi’s are separable bya linear decision boundary. In other words,

a linear classifier in a higher-dimensional space corresponds to a non-linearclassifier in the original space.

The point is that to get a richer set of classifiers we do not need to give up the convenience of linearclassifiers. We simply map the covariates to a higher-dimensional space. This is akin to makinglinear regression more flexible by using polynomials.

15.14 Other ClassifiersThere are many other classifiers and space precludes a full discussion of all of them. Let us brieflymention a few.

Bagging is a method for reducing the variability of a classifier. It is most helpful for highlynonlinear classifiers such as trees. We draw B bootstrap samples from the data. The bth bootstrapsample yields a classifier hb. The final classifier is

h(x) =

1 if 1

B

∑Bb=1 hb(x) ≥ 1

2

0 otherwise.

Boosting is a method for starting with a simple classifier and gradually improving it by re-fitting the data giving higher weight to misclassified samples. Suppose that H is a collection ofclassifiers, for example, trees with only one split. Assume that Yi ∈ −1, 1 and that each h issuch that h(x) ∈ −1, 1. We usually give equal weight to all data points in the methods we havediscussed. But one can incorporate unequal weights quite easily in most algorithms. For example,in constructing a tree, we could replace the impurity measure with a weighted impurity measure.The original version of boosting, called AdaBoost, is as follows.

224

x1

x2

+

+

+

++

+

+ +

+

+

+++

+ +

z1

z2

z3

+

+

++

+

+

+

+

φ

1

Figure 91: Kernelization. Mapping the covariates into a higher-dimensional space can make acomplicated decision boundary into a simpler decision boundary.

225

1. Set the weights wi = 1/n, i = 1, . . . , n.

2. For j = 1, . . . , J , do the following steps:

(a) Constructing a classifier hj from the data using the weightsw1, . . . , wn.

(b) Compute the weighted error estimate:

Lj =

∑ni=1 wiI(Yi 6= hj(Xi))∑n

i=1wi.

(c) Let αj = log((1− Lj)/Lj).

(d) Update the weights:

wi ←− wieαjI(Yi 6=hj(Xi))

3. The final classifier is

h(x) = sign( J∑

j=1

αjhj(x)).

There is now an enormous literature trying to explain and improve on boosting. Whereasbagging is a variance reduction technique, boosting can be thought of as a bias reduction technique.We starting with a simple — and hence highly-biased — classifier, and we gradually reduce thebias. The disadvantage of boosting is that the final classifier is quite complicated.

To understand what boosting is doing, consider the following modifed algorithm: (Friedman,Hastie and Tibshirani (2000), Annals of Statistics, p. 337–407):

226

1. Set the weights wi = 1/n, i = 1, . . . , n.

2. For j = 1, . . . , J , do the following steps:

(a) Constructing a weighted binary regression

pj(x) = P(Y = 1|X = x).

(b) Let

fj(x) =1

2log

(pj(x)

1− pj(x)

).

(c) Set wi ← wie−Yifj(Xi) then normalize the weights to sum to

one.

3. The final classifier is

h(x) = sign( J∑

j=1

fj(x)).

Consider the risk functionJ(F ) = E(e−Y F (X)).

This is minimized by

F (x) =1

2log

(P(Y = 1|X = x)

P(Y = −1|X = x)

).

Thus,

P(Y = 1|X = x) =e2F (x)

1 + e2F (x).

Friedman, Hastie and Tibshirani show that stagewise regression, applied to loss J(F ) =E(e−Y F (X)) yields the boosting algorithm. Moreover, this is essentially logistic regression. Tosee this, let Y ∗ = (Y + 1)/2 so that Y ∗ ∈ 0, 1. The logistic log-likelihood is

` = Y ∗ log p(x) + (1− Y ∗) log(1− p(x)).

Insert Y ∗ = (Y + 1)/2 and p = e2F/(1 + e2F ) and then

`(F ) = − log(1 + e−2Y F (X)).

Now do a second order Taylor series expansion around F = 0 to conclude that

−`(F ) ≈ J(F ) + constant.

Hence, boosting is essentially stagewise logistic regression.

227

Neural Networks are regression models of the form 2

Y = β0 +

p∑

j=1

βjσ(α0 + αTX)

where σ is a smooth function, often taken to be σ(v) = ev/(1 + ev). This is really nothing morethan a nonlinear regression model. Neural nets were fashionable for some time but they posegreat computational difficulties. In particular, one often encounters multiple minima when tryingto find the least squares estimates of the parameters. Also, the number of terms p is essentially asmoothing parameter and there is the usual problem of trying to choose p to find a good balancebetween bias and variance.

2This is the simplest version of a neural net. There are more complex versions of the model.

228

Linear Regression

Documents

Transcript of Linear Regression