Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for...

26
Homework 7 36-350, Data Mining, Fall 2009 SOLUTIONS 1. (a) What is the conditional expectation function (r(x)= E [Y |X = x])? Answer:r(x) = tanh 5X: E [Y |X = x]= E [tanh(5X)+ |X = x] = tanh 5x+E [|X = x] = tanh 5x (b) Plot the conditional expectation function. Include your code. (Hint: try using the curve function.) Answer:Take the hint (Figure 1). (c) Write a function which randomly generates n (X, Y ) pairs and re- turns them in an n × 2 matrix or data frame. Answer:Code Example 1. This generates the X values first, then the Y values. Mathematically, there’s no reason not to go the other way — X = 1 5 tanh -1 (Y - ) — but then we’d need the marginal distribution of Y , and who wants to work that out? Now some basic tests: > z = runitanh(10) > dim(z) [1] 10 2 > is.matrix(z) [1] TRUE > summary(z) Y X Min. :-0.9780 Min. :-0.9616 1st Qu.:-0.9433 1st Qu.:-0.4801 Median :-0.9225 Median :-0.3347 Mean :-0.1693 Mean :-0.1341 3rd Qu.: 0.9310 3rd Qu.: 0.3003 Max. : 1.1394 Max. : 0.6877 This should produce a 10 × 2 matrix, and it does; and the numerical values don’t look bad at first glance. (X should be entirely inside [-1, 1], and it is; Y can be outside it a bit, and it is.) (d) Test your simulation function by checking the marginal distribution of X values it gives. 1

Transcript of Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for...

Page 1: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

Homework 7

36-350, Data Mining, Fall 2009

SOLUTIONS

1. (a) What is the conditional expectation function (r(x) = E [Y |X = x])?Answer:r(x) = tanh 5X:

E [Y |X = x] = E [tanh(5X) + ε|X = x] = tanh 5x+E [ε|X = x] = tanh 5x

(b) Plot the conditional expectation function. Include your code. (Hint:try using the curve function.)Answer:Take the hint (Figure 1).

(c) Write a function which randomly generates n (X,Y ) pairs and re-turns them in an n× 2 matrix or data frame.Answer:Code Example 1. This generates the X values first, thenthe Y values. Mathematically, there’s no reason not to go the otherway — X = 1

5 tanh−1 (Y − ε) — but then we’d need the marginaldistribution of Y , and who wants to work that out?Now some basic tests:

> z = runitanh(10)> dim(z)[1] 10 2> is.matrix(z)[1] TRUE> summary(z)

Y XMin. :-0.9780 Min. :-0.96161st Qu.:-0.9433 1st Qu.:-0.4801Median :-0.9225 Median :-0.3347Mean :-0.1693 Mean :-0.13413rd Qu.: 0.9310 3rd Qu.: 0.3003Max. : 1.1394 Max. : 0.6877

This should produce a 10× 2 matrix, and it does; and the numericalvalues don’t look bad at first glance. (X should be entirely inside[−1, 1], and it is; Y can be outside it a bit, and it is.)

(d) Test your simulation function by checking the marginal distributionof X values it gives.

1

Page 2: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

x

r(x)

curve(tanh(5*x),from=-1,to=1)

Figure 1: The true regression curve for this problem set.

# Simulate from X~Unif(-1,1), Y|X~tanh(5X)+N(0,(0.1)^2)# Inputs: number of points to simulate# Outputs: two column matrix, Y in first column, X in second.runitanh <- function(n) {X = runif(n,-1,1)Y = tanh(5*X)+rnorm(n,0,0.1)Z = matrix(c(Y,X),ncol=2,byrow=FALSE)# Filling by columns is the default, but let’s remind# ourselves of that

colnames(Z) = c("Y","X")return(Z)

}

Code Example 1: Data generator.

2

Page 3: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

-1.0 -0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ecdf(bigz[, "X"])

x

Fn(x)

plot(ecdf(bigz[,"X"]))

Figure 2: Empirical cumulative distribution function of the X values fromrunitanh. What should the CDF be?

Answer:Formally, we can use the Kolmogorov-Smirnov test.1

> bigz = runitanh(1e4)> ks.test(bigz[,"X"],punif,min=-1,max=1)

One-sample Kolmogorov-Smirnov test

data: bigz[, "X"]D = 0.0066, p-value = 0.7776alternative hypothesis: two-sided

The difference between the empirical CDF of the random sample,and what it should be, is very small, and entirely compatible withsheer chance.Less formally, we can plot the data’s empirical cumulative distribu-tion function, which should be very nearly a straight line from −1 to1. (Why?)Less formally, but still sufficiently for this problem, we can just do ahistogram.

1As you learned in baby stats: Let F̂n(a) be the empirical cumulative distribution functionof x1, x2, . . . xn, i.e., the function which says what fraction of the xi ≤ a. If the data are

IID with common CDF F , then Dn = maxx |F̂n(x)− F (x)| → 0. Kolmogorov and Smirnovshowed that then

√nDn tends to a limiting distribution which does not depend on F . The p-

value in their test is the probability of getting Dn as big or bigger than the observed departurefrom F simply through sampling fluctuations. Thus big p-values indicate conformity to F .

3

Page 4: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

Histogram of bigz[, "X"]

bigz[, "X"]

Frequency

-1.0 -0.5 0.0 0.5 1.0

020

4060

80100

120

plot(hist(bigz[,"X"],n=100))abline(h=100)abline(h=100+sqrt(10000*0.01*0.99),lty=2)abline(h=100-sqrt(10000*0.01*0.99),lty=2)

Figure 3: Histogram of X values from runitanh. How many bins does thehistogram have? What are the horizontal lines, and why are they set wherethere are?

4

Page 5: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

(e) Test your simulation function by making scatter-plots of the valuesit gives and checking that they fall near the conditional expectationcurve r(x).Answer:I’ve already got a big sample, so I’ll just use it. Things lookpretty good. It’s not easy to tell here whether the variance is steady(does it maybe shrink in the middle?), or there systematic deviationsfrom the regression curve, but it doesn’t “make your eyes bleed”, asthe saying goes.

(f) Test your simulation function by plotting the residuals Y − r(X)against X, and checking that the mean is near zero and the stan-dard deviation is near 0.1 everywhere. (You should use a large n forthis part.)Answer:First, the plot.It’s reasonable to look at Figure 5 and conclude that we don’t havea problem, but we can be a little more quantitative. The next figuredecorates the deviations with a local-linear smoothing line, which, asan estimate of the conditional mean, ought to be very close to zeroeverywhere. Then we have the trend line for the squared deviations,which ought to be very close to (0.1)2 = 0.01 everywhere.

2. Cross-validation for polynomial regression

(a) Write a function which takes as arguments an integer n and a propor-tion p, and returns np distinct numbers between 1 and n inclusive. Ifnp is not an integer, round it down to the nearest integer, unless thatwould give zero, in which case set it to 1. Test that all the returnednumbers are distinct (e.g., by using the unique function). Hint: trysample.Answer:Take the hint (Example 2).Test for distinctness:

> all(replicate(1000,length(unique(select.rows(1000,0.5))) == 500))[1] TRUE

Test for always giving at least one sample:

>all(replicate(1000,length(unique(select.rows(1000,1e-4))) == 1))[1] TRUE

Test for rejecting crazy inputs (not, strictly, part of the problemstatement):

> select.rows(100,1.1)Error: p < 1 is not TRUE> select.rows(100,-0.1)Error: p > 0 is not TRUE> select.rows(100,0)Error: p > 0 is not TRUE

5

Page 6: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

plot(bigz[,"X"],bigz[,"Y"],cex=0.2,xlab="x",ylab="y")curve(tanh(5*x),add=TRUE,col="grey",lwd=4)

Figure 4: Simulated data (points) plus the true regression curve (grey, exagger-ated thickness for clarity).

6

Page 7: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

bigz.deviations = bigz[,"Y"]-tanh(5*bigz[,"X"])plot(bigz[,"X"],bigz.deviations,cex=0.2,xlab="x",ylab=expression(Y-tanh(5*X)))abline(h=0,lty=2,lwd=4)abline(h=0.1,lty=2,lwd=4)abline(h=-0.1,lty=2,lwd=4)

Figure 5: Deviations of runitanh’s Y values from the true regression function.Dashed horizontal lines at 0 (what the mean should be) and ±0.1 (what thestandard deviation should be).

7

Page 8: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

plot(bigz[,"X"],bigz.deviations,cex=0.2,xlab="x",ylab=expression(Y-tanh(5*X)))lines(lowess(bigz[,"X"],bigz.deviations,f=1/10,iter=0),lwd=4,col="grey")

Figure 6: As in Figure 5, but with the addition of a local linear smoother line(lowess). Because we have so much data, I set the bandwidth to be compar-atively small (f = 1/10, i.e., 1/10 of the data contributing to the fit at eachpoint). Even with this small bandwidth, the line is manifestly basically flataround zero; with the default f = 2/3, it is flat. By default, lowess runs it-eratively, where in each iteration the weight of a data point declines based onhow far it is from the previous trend curve; this “robustifies” the curve againstoutliers, but I don’t want it here, so I set iter=0, rather than the default of 3.

8

Page 9: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

plot(bigz[,"X"],bigz.deviations^2,cex=0.2,xlab="x",ylab=expression((Y-tanh(5*X))^2))

lines(lowess(bigz[,"X"],bigz.deviations^2,f=1/20,iter=0),lwd=4,col="grey")abline(h=(0.1)^2,col="red")

Figure 7: Squared deviations from the true regression curve, and their lowesssmoothing curve. The latter should be constant around (0.1)2 if the simulationis working properly (why?); and it is.

9

Page 10: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

# Select a fraction p of integers 1:n, without repetition# named with an eye to using it for cross-validation

# Inputs: positive integer n, positive fraction p# Outputs: vector of numbers between 1 and nselect.rows <- function(n,p) {# Sanity-check inputsstopifnot(length(n)==1,length(p)==1)stopifnot(p > 0, p <1,length(p)==1)stopifnot(is.numeric(n), n>0, n==round(n))# you’d think is.integer(n) would handle the last test, but# "integer" actually has a special meaning in R# see ?is.integer

pn = floor(n*p) # round downpn = max(pn,1) # Make sure we give at least 1rows = sample(n,pn,replace=FALSE)return(rows)

}

Code Example 2: Selecting rows for cross-validation.

> select.rows(100.1,1)Error: p < 1 is not TRUE> select.rows(100.1,0.9)Error: identical(n, trunc(n)) is not TRUE

Visual test for uniformity of the selection: Figure 8.

(b) Write a function to do one cross-validation split on a data frame, fita polynomial to the training data, and return the mean squared erroron the testing data. The function should take as its arguments thedata frame, the degree of the polynomial, and the fraction of the datap to use for training. Include an option to also return the numbersof the rows used for the training data; test that your function worksby comparing its output to manually fitting the same polynomial tothe same training rows and predicting the same testing rows. Hint:look at the examples of polynomial fitting in the code accompanyingLecture 19.Answer:The problem statement is deliberately vague about how todecide which column of the data frame contains the response variable,and whether to include all the other columns as responses, or justselected ones. It’s adequate, for this problem-set, to simply assumethat there are two columns, named Y and X, where the former is theresponse and the latter is the input.Now a test:

> smallz = runitanh(100)> cv.out <- cv_1fold_poly(smallz,10,0.2,rows=TRUE)

10

Page 11: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

# Data-splitting MSE for univariate polynomial regression# Inputs: data frame, degree of polynomial, training fraction,# flag for returning rows used as the training sample

# Presumes: data has columns named "Y" and "X", and we want Y~X# Calls: select.rows# Output: list, giving MSE and, if desired, the vector of# training row indices

cv_1fold_poly <- function(data, degree, p,rows=FALSE) {stopifnot(length(degree)==1,degree >= 0, degree==round(degree))if(!is.data.frame(data)) {data = as.data.frame(data)

}n = nrow(data)train.rows = select.rows(n,p)# poly() won’t take degree zero, handle that speciallyif (degree > 0) {fit<-lm(Y~poly(X,degree),data=data,subset=train.rows)

} else {fit<-lm(Y~1,data=data,subset=train.rows) # fit constant

}predictions = predict(fit,newdata=data[-train.rows,])mse = mean((predictions - data[-train.rows,"Y"])^2)if (rows) {return(list(mse=mse,train.rows=train.rows))

} else {return(list(mse=mse))

}}

Code Example 3: One-fold CV for univariate polynomial regression. Noticethat the return value always has a component named mse, even if that nameisn’t necessary to distinguish it from the training rows; uniform interfaces likethis make it easier to split your programming problem into many small parts,and to treat functions like interchangeable parts.

11

Page 12: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

Histogram of select.rows(20000, 0.5)

select.rows(20000, 0.5)

Frequency

0 5000 10000 15000 20000

020

4060

80100

plot(hist(select.rows(2e4,0.5),n=100))

Figure 8: Distribution of row indices; approximately uniform, as they shouldbe.

> cv.out$mse[1] 5.213292$train.rows[1] 43 2 98 81 95 66 6 11 51 85 88 91 32 24 77 45 38 20 18 29> manual.fit = lm(Y~poly(X,10),data=as.data.frame(smallz),

subset=cv.out$train.rows)> manual.predict = predict(manual.fit,

newdata=as.data.frame(smallz[-cv.out$train.rows,]))> mean((manual.predict - smallz[-cv.out$train.rows,"Y"])^2)[1] 5.213292

So it is, indeed, doing what it should: fitting to the training rows,then predicting on the non-training rows.

(c) Write a function to estimate the generalization error of a polynomialby k-fold cross-validation. It should take as arguments a data frame,the degree of the polynomial, the training fraction p, and the numberof folds k. It should return the average of the k testing MSEs.Answer:We use the one-fold CV function as a building block, ofcourse.

(d) Generate a data set of size n = 100 from the model in problem 1. Forpolynomials of orders d = 0 through 10, calculate the in-sample MSE,and the cross-validated MSE (with p = 0.9, k = 10 — standard 10-fold cross-validation). Plot both sets of errors together as a functionof d. What order polynomial should you use?

12

Page 13: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

# k-fold cross-validated MSE for univariate polynomial regression# Inputs: data, degree of polynomial, training fraction, num. folds# Calls: cv_1fold_poly# Outputs: averaged MSEs over the k foldscv_kfold_poly <- function(data,degree,p,k) {stopifnot(k >0, k==round(k)) # must be a natural number!if(!is.data.frame(data)) {data=as.data.frame(data)

}# Leave other input checking to called functionsk_fold_mses <- replicate(k,cv_1fold_poly(data,degree,p)$mse)return(mean(k_fold_mses))

}

Code Example 4: Averaging the cross-validated MSEs over multiple train-ing/testing splits.

Answer:I’ve already made the data set (smallz). The in-sampleerrors:

inMSE.1 <- function(degree, data=as.data.frame(smallz)) {if (degree > 0) {mse <- mean((residuals(lm(Y~poly(X,degree),data=data)))^2)

} else {mse <- var(data[,"Y"])

}return(mse)

}

inMSEs = sapply(0:10,inMSE.1)

The cross-validated errors:

cvMSEs = sapply(0:10,cv_kfold_poly,data=smallz,p=0.9,k=10)

The plot (Figure 9) suggests that we’re not seeing much differenceafter about fifth degree polynomials, but that in-sample and cross-validated errors are still tracking each other reasonably well.

(e) Write a function to select which order of polynomial to use, up to amaximum which you give as an argument, by k-fold cross-validation.Which order does it select when run on the data from the previouspart, with standard 10-fold cross-validation?Answer:Use the handy which.min function, as in Code Example 5.(Why subtract 1 from its return value?) Here’s what I get when Irun it:

> select_poly_cv(smallz,10,0.9,10)[1] 7

13

Page 14: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

degree

MSE

plot(0:10,cvMSEs,type="l",xlab="degree",ylab="MSE")lines(0:10,inMSEs,lty=2)

Figure 9: MSE for polynomials of different orders on 100 data points. Solidline, ten-fold cross-validation; dashed line, in-sample.

14

Page 15: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

# Find the best order univariate polynomial by k-fold CV# Inputs: data, max. degree to use, training fraction, num. folds# Calls: cv_kfold_poly# Output: the selected degreeselect_poly_cv <- function(data,max.degree,p,k) {# should sanity-check max.degreecv_scores <- sapply(0:max.degree,cv_kfold_poly,data=data,p=p,k=k)best.order <- which.min(cv_scores) - 1return(best.order)

}

Code Example 5: Polynomial order selection by k-fold CV.

Now, because cross-validation is a stochastic method, and the differ-ences between different high-order polynomials are evidently smallhere (see Figure 9, we should expect some variability in the order,which we get:

> summary(replicate(100,select_poly_cv(smallz,10,0.9,10)))Min. 1st Qu. Median Mean 3rd Qu. Max.7.00 9.00 9.00 9.39 10.00 10.00

I’d go with nine here.(f) Fit a polynomial of the order selected in the previous part to the

whole data. Plot the estimated regression function r̂(x) together withthe data.Answer:See Figure 10. The selected curve captures a lot of thelarge-scale features of the data: fairly small variation at the far leftand far right but at different levels, and smooth bends into the nearly-linear transition region between them. But it’s wigglier at either endthan the real curve, and if I extrapolated beyond the fitting regionit’d zoom off to ±∞.

(g) Generate a new sample of size n = 104. What order of polynomialis selected by cross-validation? Should the selected order approachsome limit as n→∞?Answer:I already have the big sample. I’ll up the maximum orderI’m willing to consider to 25.

> select_poly_cv(bigz,25,0.9,10)[1] 20

When I replicate this four times, I get 17, 19, 16, 17. So I’ll say 17. Thesystematic errors are clearly much reduced (Figure 11), as (in particu-lar) there’s much less wiggling at extreme values of x, though still someoscillation.

As n → ∞, the selected order should increase without limit. tanh isa transcendental function, so it can be exactly represented only by an

15

Page 16: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

x

y

polyfit <- lm(Y~poly(X,9),data=smallz)x.ord = order(smallz[,"X"])plot(smallz[,"X"],smallz[,"Y"],cex=0.2,xlab="x",ylab="y")lines(smallz[x.ord,"X"],fitted(polyfit)[x.ord],lwd=4)curve(tanh(5*x),lwd=2,col="grey",add=TRUE)

Figure 10: Dots, actual data from simulation; grey curve, actual regressionfunction; black curve, CV-selected polynomial fit (order 9).

16

Page 17: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

# Data-splitting MSE of fixed-bandwidth univariate kernel regression# Uses npreg-specific tricks

# Inputs: data frame, bandwidth, training fraction# Presumes: data has columns named "Y" and "X", and we want Y~X# Output: list giving mean-squared error on test part of the datacv_1fold_npreg <- function(data,h,p) {require(np)n=nrow(data)train.set = select.rows(n,p)train.data = data[train.set,]test.data = data[-train.set,]mse <- npreg(bws=h,txdat=train.data[,"X"],tydat=train.data[,"Y"],

exdat=test.data[,"X"],eydat=test.data[,"Y"])$MSEreturn(list(mse=mse))

}

Code Example 6: Do one-fold evaluation of a nonparametric regression.

infinite-order polynomial. As we get more data, we can reliable estimatehigher and higher order terms in this series, which will generalize betterand better, so the selected polynomial order should grow with n, withoutlimit (or with limit ∞, if you prefer).

3. Cross-validation for kernel regression

(a) Write a function to split a data frame into training and testing sets,fit a Gaussian kernel regression with a given bandwidth to the train-ing set, and return the mean squared error on the testing set. Thefunction should take as arguments the data frame, the bandwidth h,and the fraction of data p to go into the training set.Answer:The code uses some special arguments to npreg, which areexplained in its help file.Notice that this function has the same format of output as cv 1fold poly.This isn’t strictly necessary — I could just have it return a simplenumerical value — but they do such similar jobs it’s a good idea tohave them work as similarly as possible.

(b) Write a function to calculate the k-fold cross-validated MSE of agiven bandwidth. The function should take as arguments the dataframe, the bandwidth h, the training fraction p, and the number offolds k, and return the MSE.Answer:See Code Example 7.This is extremely similar to cv kfold poly, so much so that it makessense to re-factor them into a single function, which takes one ofthe cv 1fold functions as an argument. The advantages are that,first, if we need to debug the k-fold part we only have to dhtat once,

17

Page 18: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

polyfit2 <- lm(Y~poly(X,17),data=as.data.frame(bigz))x.ord = order(bigz[,"X"])plot(bigz[,"X"],bigz[,"Y"],cex=0.2,xlab="x",ylab="y")lines(bigz[x.ord,"X"],fitted(polyfit2)[x.ord],lwd=4,col="white")curve(tanh(5*x),lwd=2,col="grey",add=TRUE)

Figure 11: Black dots, samples from the model; thick white curve, the estimatedpolynomial (of order 17); grey curve, the true regression function. (The dotsare thick enough that overlaying a white curve on them works!)

18

Page 19: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

# k-fold cross-validated MSE for univariate kernel regression# Inputs: data set, bandwidth, training fraction, num. folds# Calls: cv_1fold_npreg# Outputs: average over folds of test-set errorscv_kfold_npreg <- function(data,h,p,k) {stopifnot(k>0, k==round(k))if(!is.data.frame(data)) {data=as.data.frame(data)

}# Leave other input checking to called functionsk_fold_mses <- replicate(k,cv_1fold_npreg(data,h,p)$mse)return(mean(k_fold_mses))

}

Code Example 7: Do k-fold evaluation of a non-parametric regression.

and, second, if we want to add a different kind of regression, we justhave to write its one-fold function, and make sure it has the sameinterfaces as the others. In fact, our old cv kfold functions canbecome wrappers for the more generic one.

(c) For the n = 100 data set from the previous problem, plot the in-sample and ten-fold cross-validated MSEs for h = 0.05, 0.1, 0.15, . . . 1.0.Answer:See figure. The CV error does pick up at very small band-widths (< 0.05).

(d) Write a function to select the best bandwidth by cross-validation. Itshould take the same arguments as before, except that it needs a vectorof bandwidths and not a single bandwidth; return the best bandwidth(and not its error or its position in the vector of options). Test thatwhen given the same bandwidths as in the previous part, it selects thebest.Answer:I’ll just give the more generic version. This would work forthe polynomials, too.

> select_cv(smallz,1:20/20,0.9,10,cv_1fold_npreg)[1] 0.1> (1:20/20)[which.min(cvMSEs)][1] 0.1

In other words, the function matches what we know is the best band-width.

(e) Plot the estimated regression function r̂(x) from the best bandwidthalong with the data. How does this differ from the selected polyno-mial?Answer:Figure 13 shows the estimated function. Compared to thepolynomial fit, it seems to oscillate less at the edges, though also to

19

Page 20: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

# More generic k-fold evaluation of a model# Inputs: data, control setting, training fraction, num. folds,# function to do data-set splitting, returning an mse attribute

# Outputs: average of mses over the foldscv_kfold_model <- function(data,control,p,k,folder) {stopifnot(k>0, k==round(k))if(!is.data.frame(data)) {data=as.data.frame(data)

}k_fold_mses = replicate(k,folder(data,control,p)$mse)return(mean(k_fold_mses))

}

cv_kfold_poly.2 = function(...) {cv_kfold_model(...,folder=cv_1fold_poly)

}

cv_kfold_npreg.2 = function(...) {cv_kfold_model(...,folder=cv_1fold_npreg)

}

Code Example 8: More flexible function for k-fold cross-validation, takinglower-level functions as arguments.

# Select the best value of a control setting by cross-validation# Inputs: data frame, vector of control settings, training# fraction, number of folds, function to compute one fold

# Calls: cv_kfold_model# Returns: best value of the control settingselect_cv <- function(data,controls,p,k,folder) {cv <- sapply(controls,cv_kfold_model,data=data,p=p,k=k,

folder=folder)return(controls[which.min(cv)])

}

Code Example 9: Control-setting selection by k-fold cross-validation.

20

Page 21: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

bandwidth

MSE

inMSEs = sapply(1:20/20,function(h){npreg(bws=h,Y~X,data=smallz)$MSE})cvMSEs = sapply(1:20/20,function(h){cv_kfold_npreg(data=smallz,h=h,p=0.9,k=10)})plot(1:20/20,cvMSEs,type="l",xlab="bandwidth",ylab="MSE")lines(1:20/20,inMSEs,lty=2)abline(h=0.01,col="grey")

Figure 12: Solid line, MSE under ten-fold cross-validation as a function of band-width. Dashed line, in-sample MSE. Grey line, actual noise variance.

21

Page 22: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

be systematically low at the upper shoulder. Figure 14 compares thetwo estimated functions directly to the true regression curve (whichof course we wouldn’t know with real data. . . ). They seem to makesimilar errors in a few places (around 0 and 0.7), but over-all thekernel regression’s errors appear smaller.

(f) Repeat the cross-validated bandwidth selection with the n = 104 sam-ple. (Allow at least 20 seconds of computing time per fold per band-width.) Does the selected bandwidth change? Plot the regression func-tion with the selected bandwidth. How does this differ from the se-lected polynomial? From the true r(x)?Answer:As it says in the handouts, the best bandwidth ought toshrink as we get more data. Since each evaluation is so expensive, Iwon’t waste time on really big bandwidths, where were hopeless atn = 100 and will be pitiful at n = 104.

select_cv(bigz,1:20/100,0.9,10,cv_1fold_npreg)

(Start new pot of coffee, reply to e-mail, . . . aha!)

> select_cv(bigz,1:20/100,0.9,10,cv_1fold_npreg)[1] 0.01

So, the bandwidth has definitely changed; it’s a tenth of what it was.Plotting it (Figure 15), it looks great — slightly less steep in thetransition region than the true curve, but not by all that much, andvery good at either end. Figure 16 compares both the fits to thetrue function, as in Figure 14. This confirms the impression fromsquinting at Figure 15 — the kernel regression has less high-frequencywiggling than the polynomial fit, but a bigger systematic error in thetransition region where it gets the slope wrong.

22

Page 23: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

x

y

npfit <- npreg(Y~X,bws=0.1,data=smallz)x.ord = order(smallz[,"X"])plot(smallz[,"X"],smallz[,"Y"],cex=0.2,xlab="x",ylab="y")lines(smallz[x.ord,"X"],fitted(npfit)[x.ord],lwd=4)curve(tanh(5*x),lwd=2,col="grey",add=TRUE)

Figure 13: As in Figure 10, but the black curve is the Gaussian kernel regression.

23

Page 24: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

-1.0 -0.5 0.0 0.5 1.0

-0.10

-0.05

0.00

0.05

0.10

x

Dev

iatio

n of

est

imat

ed fu

nctio

ns fr

om tr

uth

tx = seq(from=-1,to=1,by=0.01)plot(tx,predict(npfit,exdat=tx)-tanh(5*tx),type="l",ylim=c(-0.1,0.1),

xlab="x",ylab="Deviation of estimated functions from truth")lines(tx,predict(polyfit,newdata=data.frame(X=tx)) - tanh(5*tx),lty=2)

Figure 14: Solid line, r̂mathrmnpreg(x)− r(x); dashed line, r̂poly(x)− r(x).

24

Page 25: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

npfit2 <- npreg(Y~X,bws=0.1,data=as.data.frame(bigz))x.ord = order(bigz[,"X"])plot(bigz[,"X"],bigz[,"Y"],cex=0.2,xlab="x",ylab="y")lines(bigz[x.ord,"X"],fitted(npfit2)[x.ord],lwd=4,col="white")curve(tanh(5*x),lwd=2,col="grey",add=TRUE)

Figure 15: As 11, but with the kernel regression instead of the polynomial.

25

Page 26: Homework 7 - Carnegie Mellon Universitycshalizi/350/hw/solutions/solutions-07.pdfVisual test for uniformity of the selection: Figure 8. (b) Write a function to do one cross-validation

-1.0 -0.5 0.0 0.5 1.0

-0.10

-0.05

0.00

0.05

0.10

x

Dev

iatio

n of

est

imat

ed fu

nctio

ns fr

om tr

uth

plot(tx,predict(npfit2,exdat=tx)-tanh(5*tx),type="l",ylim=c(-0.1,0.1),xlab="x",ylab="Deviation of estimated functions from truth")

lines(tx,predict(polyfit2,newdata=data.frame(X=tx)) - tanh(5*tx),lty=2)

Figure 16: As in 14, but with the fits to the larger data set.

26