5 Bootstrapping

31
ST697F: Topics in Regression. Spring 2007 c 21 4.7 Extensions The approach above can be readily extended to the case where we are interested in inverse prediction or regulation about one predictor given the other predictors are known (for inverse prediction) or set at fixed values (in regulation). For example, suppose there are two predictors x 1 and x 2 and the model for the mean is β 0 + β 1 x 1 + β 2 x 2 . For inverse prediction suppose there is a new unit with known x 2 value, say x 20 but an unknown value of x 1 , say x 10 . We would estimate the unknown x 10 via ˆ x 10 =(Y 0 - ˆ β 0 - ˆ β 2 x 20 )/ ˆ β 1 . Similarly for regulation, we could ask, for what x 1 is the expected value of Y equal to a specified constant c, when the second predictor is set at x 20 ? The result is ρ =(c - β 0 - β 2 x 20 )1 , which would be estimated by ˆ ρ =(c - ˆ β 0 - ˆ β 2 x 20 )/ ˆ β 1 . Both of these problems are in the form of a ratio and we can apply Fieller’s result or the delta method with appropriate definition of σ 11 , σ 22 , and σ 12 . The problem of doing inverse prediction or regulation for multiple x’s is more complicated, but notice that you can always attack this by trying to invert prediction or confidence intervals. 4.8 References Kutner et al., Section 4.6. (inverse prediction/calibration) Greene (p.61), Mood, Graybill and Boes ( p. 181), Casella and Berger (p. 240) for approximations for nonlinear functions. Fieller’s Theorem. Buonaccorsi(1998, 2001). Graybill and Iyer. Section 6.4 covers inverse prediction and regulation in simple linear regression (although they give a different way to compute the results are the same as ours), while sections 6.6 and 6.7 give two other situations where a ratio of linear combinations of coefficients is of interest. 5 Bootstrapping Bootstrapping has become a popular way to carry out statistical inferences. The basic idea is to mimic the original sampling method to try and generate the sampling distribution of some estimator or test statistic and then get estimates of bias, standard error or confidence intervals and tests based on these results. The “simulating” which is done to mimic this sampling makes use of the data to create a population or model from which to sample. There is a huge literature and there has been a plethora of recent books on the subject; see the references. While the bootstrap is relatively simple to describe, there are many subtle and complex issues around the performance of bootstrap based confidence intervals and tests that are beyond the scope of this text. Our objective here is to describe the basic ideas and illustrate their application in regression contexts. The fundamental concepts are most easily motivated in the context of a single random sample. 5.1 Bootstrapping for random samples (the i.i.d. case) 5.1.1 The univariate, single parameter case. Let W 1 ,...,W n be a random sample from some distribution F . That is the W i are independent where each W i has the same distribution defined by the CDF F . Let θ be any parameter of interest associated

Transcript of 5 Bootstrapping

Page 1: 5 Bootstrapping

ST697F: Topics in Regression. Spring 2007 c© 21

4.7 Extensions

The approach above can be readily extended to the case where we are interested in inverse prediction orregulation about one predictor given the other predictors are known (for inverse prediction) or setat fixed values (in regulation).

For example, suppose there are two predictors x1 and x2 and the model for the mean is β0 +β1x1 +β2x2.For inverse prediction suppose there is a new unit with known x2 value, say x20 but an unknown value ofx1, say x10. We would estimate the unknown x10 via

x10 = (Y0 − β0 − β2x20)/β1.

Similarly for regulation, we could ask, for what x1 is the expected value of Y equal to a specified constant c,when the second predictor is set at x20? The result is ρ = (c−β0−β2x20)/β1, which would be estimated by

ρ = (c− β0 − β2x20)/β1.

Both of these problems are in the form of a ratio and we can apply Fieller’s result or the delta method withappropriate definition of σ11, σ22, and σ12.

The problem of doing inverse prediction or regulation for multiple x’s is more complicated, but noticethat you can always attack this by trying to invert prediction or confidence intervals.

4.8 References

• Kutner et al., Section 4.6. (inverse prediction/calibration)

• Greene (p.61), Mood, Graybill and Boes ( p. 181), Casella and Berger (p. 240) for approximations fornonlinear functions.

• Fieller’s Theorem. Buonaccorsi(1998, 2001).

• Graybill and Iyer. Section 6.4 covers inverse prediction and regulation in simple linear regression(although they give a different way to compute the results are the same as ours), while sections 6.6and 6.7 give two other situations where a ratio of linear combinations of coefficients is of interest.

5 Bootstrapping

Bootstrapping has become a popular way to carry out statistical inferences. The basic idea is to mimic theoriginal sampling method to try and generate the sampling distribution of some estimator or test statisticand then get estimates of bias, standard error or confidence intervals and tests based on these results. The“simulating” which is done to mimic this sampling makes use of the data to create a population or modelfrom which to sample. There is a huge literature and there has been a plethora of recent books on thesubject; see the references. While the bootstrap is relatively simple to describe, there are many subtle andcomplex issues around the performance of bootstrap based confidence intervals and tests that are beyondthe scope of this text. Our objective here is to describe the basic ideas and illustrate their application inregression contexts. The fundamental concepts are most easily motivated in the context of a single randomsample.

5.1 Bootstrapping for random samples (the i.i.d. case)

5.1.1 The univariate, single parameter case.

Let W1, . . . ,Wn be a random sample from some distribution F . That is the Wi are independent whereeach Wi has the same distribution defined by the CDF F . Let θ be any parameter of interest associated

Page 2: 5 Bootstrapping

ST697F: Topics in Regression. Spring 2007 c© 22

with the population distribution F . Examples include the mean, the standard deviation, the median, somepercentile, etc. The parameter θ will be estimated by θ = g(W) = g(W1, . . .Wn) = g(W), where W containsW1, . . . ,Wn.

(Somewhat confusingly, but as is standard, we use θ to denote either the estimator g(W) which is arandom variable, or the estimate g(w) which is the actual number observed. It should be clear from thecontext how it is being used.)

The population CDF is F (w) = P (W ≤ w), while the empirical CDF is F defined by

F (w) =# of wi ≤ w

n.

Often, the parameter θ can be viewed as some function of F , say t(F ). The plug-in estimator of θ ist(F ). That is, if θ is the plug-in estimator of θ it is calculated in the same way θ would be determined fromF , but using F as if it were F . For example, W =

∑iWi/n is the plug-in estimator of the population mean

µ, while the plug-in estimator of the population standard deviation σ is

σ = [∑

i

(Wi − W )2/n]1/2. (6)

Note the slight difference from the usual sample standard deviation s = [∑

i(Wi− W )2/(n− 1)]1/2) becauseof the division by n rather than n− 1.

Often, we evaluate how good an estimator is via its bias = E(θ−θ), and its standard deviation/standarderror, denoted σθ. In many problems, the bias and standard error can be written explicitly (often as functionsof parameters) and can be estimated directly from the data. In addition, exact or approximate confidenceintervals and tests can often be obtained using some exact distributional results or via some large samplearguments. For example, with θ = µ and θ = W , we know E(W ) = µ, so the bias is 0, and standard errorof W is exactly σW = σ/n1/2. This is typically estimated via s/n1/2. Under normality, exact confidenceintervals for µ are found based on the t distribution, while for large sample sizes approximate confidenceintervals are based on the normal. In many cases where there are no exact expressions for bias, standard erroror the sampling distribution, some asymptotic or approximate result is used. For example, we previouslyused a Taylor series based method to approximate the bias and standard error of nonlinear functions; seeSection 4.3. Approximate confidence intervals are often found by assuming that (θ− θ)/σθ is approximatelystandard normal. There validity of approximations for the bias and standard error and of the normalityassumption for estimators are always in question.

Can we find a way to estimate the bias and standard error for θ, and more generally the samplingdistribution of θ that does not depend on assuming F is of a particular type (normal, exponential, etc.) anddoes not require an analytical expression for the bias or standard error?

The bootstrap estimate of standard error.The standard error of θ, σθ, is itself some function of F , say σθ = h(F ). The definition of the bootstrap

estimate of the standard error is σθ = h(F ). If we know explicitly what the function h(F ) is then we cancalculate the bootstrap estimate of standard error directly. For example, with θ = µ and θ = W , we notedabove that σW = σ/n1/2. σ/n1/2 can be viewed as h(F ) as it is a function of F via σ which is the populationstandard deviation associated with F . So the bootstrap estimate of σW is σ/n1/2 where σ is given in (6).

Usually we do not have an analytical expression for how σθ is a function of F , in which case the definitionof the bootstrap estimate of standard error above is not useful. We can however calculate the bootstrapestimate of standard error via simulation without knowing what h is. The algorithm for a random samplewith observed data w1, . . . wn, proceeds as follows:

For b = 1 to B, where B is a large value:

1. Take a random sample of size n WITH REPLACEMENT from the values w1, . . . wn. Denote this

Page 3: 5 Bootstrapping

ST697F: Topics in Regression. Spring 2007 c© 23

bootstrap sample by wb1, . . . , wbn. Notice that some of the original values in the sample will typicallyoccur more than once in the bootstrap sample.

2. Compute your estimate in the same way as you did for original data now using the bootstrap sample

θ∗b = g(wb1, . . . , wbn).

This leads to B bootstrap estimates θ∗1 , . . . θ∗B . The bootstrap estimate of standard error is

calculated as

σBθ =

[∑Bb=1(θ

∗b − θ∗)2

B − 1

]1/2

,

where θ∗ =∑B

b=1 θ∗b/B. (Technically the bootstrap estimate of standard error is the limit of the above as

B →∞, but it is common to use the term in this way.)The bootstrap estimate of Bias is

θ∗ − t(F ).

REMARKS

• Note that even if the original estimator θ is not the plug in estimator, the bias is calculated using theplug-in estimator t(F ). If we don’t know t(), we cannot calculate the bootstrap estimate of bias.

• The mean of the bootstrap values should NOT be taken as the estimate of θ.

The Bootstrap estimate of the distribution of θ is simply the distribution of the B values θ∗1 , . . . θ∗B .

We’ll call this the Empirical Bootstrap Distribution and denote the empirical CDF by G; note that thisis an estimate of the CDF of θ. Typically we will use a histogram, smoothed histogram or stem and leaf plotto represent this distribution rather than give it in CDF form.

5.1.2 General multivariate and/or multiparameter

The univariate single parameter case is easily generalized to cases of a random sample where there aremultiple parameters of interest and/or each observation is multivariate. Let W1, . . . ,Wn be i.i.d. with somedistribution F , where the Wi can now be vector valued. Let θ be a collection of q parameters of interest,with estimator θ; q could be 1 as when we are interested in a single parameter even if the data is multivariate.We now resample with replacement as before from the observed w, . . . ,wn and write θ

∗b for the estimate of

θ from the bth bootstrap sample. The bootstrap mean isˆθ∗

=∑

b θ∗b/n

and the bootstrap covariance matrix is

SB =∑

b(θ∗b − ˆθ

∗)(θ

∗b − ˆθ

∗)′

B − 1.

SB is the bootstrap estimate of Cov(θ), the covariance matrix of θ. The square roots of the diagonalelements are the bootstrap standard errors of the individual components.

Page 4: 5 Bootstrapping

ST697F: Topics in Regression. Spring 2007 c© 24

5.2 Bootstrap Confidence Intervals

As noted earlier, a common method of obtaining approximate confidence intervals is to assume (θ− θ)/σθ isapproximately standard normal, leading to an approximate confidence interval for θ of the form θ ± z(1 −α/2)σθ, where σθ is the estimated standard error of θ.

Nonparametric confidence intervals can be found using the bootstrap in a variety of ways. There is plentyof discussion that can be found about the different methods in the cited literature. The two most commonlyused ones that have emerged in practice are the percentile method and the BCa method.

The Percentile Method:Consider α1 and α2 with α1 + α2 = α, where usually α1 = α2 = α/2. The percentile confidence interval

for θ is [L,U ], where L = G−1(α1) and U = G−1(1− α2).That is, 100α1% of the bootstrap values are less than or equal to L and 100(1− α2)% of the bootstrap

values are less than or equal to U . Note that if G is normal with mean at θ, then the bootstrap percentileinterval would agree with θ ± z(1− α/2)σBθ.

Notice that with the percentile method, if [L,U ] is the percentile interval for θ then the percentile intervalfor any function of θ, say φ = g(θ) is simply [g(L), g(U)]. This is a nice property that does not hold forthe delta method interval, where confidence intervals for nonlinear functions do not simply transform in thisway. It is not transparent as to why the percentile method works. It can be shown that it works well ifthere is some transformation g and a constant c for which t g(θ)− g(θ)/c follows approximately a standardnormal distribution. (It is not necessary to know what the g is).

The percentile method, which is very easy to calculate has been found to work well in many casesbut it can encounter problems. In practice an improved bootstrap method called the BCa method (biascorrected accelerated method) is often preferred. The BCa bootstrap interval is given by G−1(α∗1)and U = G−1(1 − α∗2). This resembles the percentile method but uses quantities α∗1 and α∗2. These aresomewhat involved to calculate and depend on two quantities, one is a bias correction term and the otheris an acceleration term. The first accounts for potential bias (actually median biasedness) in θ while the“acceleration term” addresses the fact that the standard error of θ may itself be a function of θ. See forexample Efron and Tibshirani for details (but there are some notational differences with our treatment.)

5.3 Bootstrapping in Regression Models

The methods of using bootstrap estimates for bias, standard error and confidence intervals are the same as inthe random sample setting. What changes in the manner in which the bootstrap samples are generated. Wecontinue to assume, as we have done to this point, that the observations on different units are independent.

5.3.1 Random Regressors: bootstrapping (Y,X) together

Suppose all the regressors are random quantities, so when we choose the ith unit in the sample we obtainthe response and all of the predictors, collected in

Wi =

Yi

Xi1

.

.

.Xi,p−1

.

A regression model is specified for E(Y |x). A variance model may be specified, possibly with het-eroscedasticity depending on the X’s, but need not be. If the W1, . . . ,Wn arise from a random sample of

Page 5: 5 Bootstrapping

ST697F: Topics in Regression. Spring 2007 c© 25

individual units (i.e., can be treated as i.i.d.), then we can proceed to use the bootstrap as in Section 5.1.2for any estimator or collection of estimators of interest. This could be the coefficients, variance parameters,correlations, or any functions of these, including nonlinear ones.

We illustrate first with β, which is estimated by β, either through least squares or some type of weightedleast squares (possibly iterative). In the bth bootstrap sample the estimator is β

∗b and the bootstrap estimate

of the covariance of β is SB . The square roots of the diagonal elements of SB gives the bootstrap standarderrors for the individual coefficients. For any one of the coefficients the bootstrap estimates can be used toget the bootstrap estimate of bias and standard error and nonparametric confidence intervals.

For estimating a linear combination of interest, say θ = c′β, because of the linearity, the bootstrapstandard error can be calculated via (c′SBc)1/2. To get the empirical distribution and confidence intervalsthough we need to obtain each θb = c′βb.

If the interest is in some nonlinear function of β, say θ = g(β), then one calculates θ∗b = g(β∗b) for b = 1to B.

REMARKS:

• Since under constant variance, we know the exact covariance of the least squares estimator and how toget an unbiased of it, there is no need to use the bootstrap for this purpose. However the bootstrap isuseful for getting nonparametric confidence intervals for the coefficients and functions of them. Theseprovide an alternative to the usual t based intervals or the delta method type intervals for nonlinearfunctions.

• Since the homogeneity of variance assumption does not have to hold in this case, if we are using leastsquares, the bootstrap estimate SB provides an alternative to White’s robust estimate of the covariance(Section 3.2).

5.3.2 Bootstrapping residuals

Consider fixing the x values and suppose the model is

Yi = m(xi, β) + εi (7)

where the εi are assumed to be independent and identically distributed (iid) with mean 0 from some dis-tribution F . (Note that this implies the errors have constant variance σ2 and are uncorrelated). The ithresidual is ri = Yi − Yi.

The bth bootstrap sample is generated by getting

Ybi = m(xi, β) + rbi, i = 1 to n,

where rb1, . . . , rbn are bootstrapped residuals obtained by sampling with replacement from some collectionof values that reflect the distribution of the original εi. Recall that this distribution is assumed to havemean 0 and variance σ2. One possibility is to just sample with replacement from the residuals r1, . . . , rn.Notice that when we sample with replacement from the residuals the variance of the resulting value is∑

i r2i /n = (n−p)MSE/n where MSE is our unbiased estimator of σ2. This suggest a modification, namely

to sample from modified residuals (n/(n−p))1/2ri; when we sample from this the variance equals MSE. Thismeans that rbi is generated by sampling one of the residuals and if it is the kth residual that is selected thenrbi = (n/(n − p))1/2rk. We will always use this modification although it is clearly not very important as ngets large.

For the linear case with an intercept in the model, r =∑

i ri/n = 0, so sampling from the residuals ormodified residuals is sampling from a distribution with mean 0. Without an intercept r is not zero so weshould resample from centered residuals ri − r.

Page 6: 5 Bootstrapping

ST697F: Topics in Regression. Spring 2007 c© 26

The bth bootstrap sample consists of (yb1, x1), . . . (ybn, xn) from which we get β∗b

and any other estimateof interest. It can be shown that using the modified residuals, as B → ∞ then SB (the sample covarianceamong the bootstrap estimates β

∗1, . . . , β

∗B

) converges to MSE(X′X)−1. So bootstrapping from the modifiedresiduals is doing exactly the right thing in terms of estimating the covariance of β. As noted before thebootstrap is not needed in order to estimate Σβ under constant variance since MSE(X′X)−1 provides anunbiased estimator. The usefulness of the bootstrap here is that it allows us to construct confidence intervalson the coefficients and linear combinations of them that are not dependent on normality of β. We can alsodo other things such as carry out inferences for σ2 (which under the normality assumption are based ona chi-square distribution with n-p degrees of freedom) or handle inferences for nonlinear functions of theparameters.

5.3.3 Bootstrap Prediction Intervals

Suppose we want to predict Y0 at x0. It is the distribution of Y0 − Y0 that we need, where Y0 = x′0β. Ifthe CDF of this distribution is denoted H then P (H−1(α/2) ≤ Y0 − Y0 ≤ H−1(1− α/2)) = 1− α, implyingP (Y0 −H−1(1− α/2) ≤ Y0 ≤ Y0 −H−1(α/2)) = 1− α. This means that

[Y0 −H−1(1− α/2), Y0 −H−1(α/2)]

is a 100(1 − α)% prediction interval for Y0. Since we don’t know H we estimate it via the bootstrap asfollows:

For b = 1 to B,

i) generate β∗b

as in Section 5.3.2 and construct Y0b = x′0β∗b.

ii) generate Y0b = Y0 + r0b where r0b is a newly generated bootstrap residual.iii) Construct Db = Y0b − Y0b.Consider the empirical distribution of D1, . . . , DB and denote the percentiles by H(α/2) (value with α/2

to the left) and H(1−α/2). The prediction interval for Y0 (based on the percentile method) is (Y0−H−1(1−α/2), Y0 − H−1(α/2)).

5.3.4 Bootstrapping the residuals for the heteroscedastic case

Consider the variance model V (εi) = σ2a2i , where ai is known. Equivalently, we can view εi as aiδi where the

δi are assumed to be independent and identically distributed with mean 0 from some distribution F (havingmean 0 and variance σ2). So δi = (wi)1/2εi where wi = 1/a2

i is the weight. We could estimate the distributionof the residuals by using the modified weighted residuals δi = (n/(n − p))1/2(wi)1/2(Yi − m(xi, β)). Theweighted residuals do not necessarily add to zero, however, since there isn’t an intercept in the transformedmodel. With ri = (wi)1/2(Yi −m(xi, β)), in general it is better to resample from δi = (n/(n− p))1/2(ri − r)rather than δi = (n/(n− p))1/2ri. All this really does this is change the intercept used in the bootstrapping.In practice the mean of these weighted residuals is often small so it doesn’t make much of a difference.

The bootstrap sample is generated by

Ybi = m(xi, β) + aidbi

where the dbi are sampled with replacement from δ1, . . . δn. Notice that the ai is fixed to the ith position inthe sample, while the dbi comes from selecting one of the weighted residuals. The β could be either ordinaryof weighted least squares, depending on which is being used.

If there is a parametric model for the variance, a2i = vi(β, λ), then λ need to be estimated in order to do

the bootstrapping. This leads to the use of δi = (n/(n− p))1/2(wi)1/2(Yi −m(xi, β)), where wi = 1/vi(β, λ)

Page 7: 5 Bootstrapping

ST697F: Topics in Regression. Spring 2007 c© 27

andYbi = m(xi, β) + (vi(β, λ))1/2dbi.

If the β being used in the original analysis is two-stage or iteratively reweighted least squares then withineach bootstrap sample you would carry out this same procedure; that is do two-stage or iteratively reweightedleast squares on each bootstrap sample.

5.3.5 Bootstrapping with replication

Suppose there are k distinct collections of regressors, say x∗1, . . . ,x∗k, with nj observations at x∗j , so

∑j nj = n.

In this case with the x’s treated as fixed we can still resample in a way which allows for changing variances.One option is to use the fitted model (which means we believe we have the right function for the mean),

then when we generate an observation at x∗j , we resample from just the residuals at x∗j . If the residuals atx∗j are denoted rj1, . . . , rjnj

then we would create modified residuals

rjm =[nj − 1nj

]1/2

(rjm − rj),

where rj is the mean of rj1, . . . , rjnj. The resampling is from these modified residuals.

Another option is to generate nj response at x∗j by resampling nj times with replacement from theoriginal nj responses at x∗j . If there are only a few replicates at each distinct x∗, this may not work verywell.

5.4 Bootstrap Hypothesis Testing

Hypothesis testing is a popular way of carrying out inferences. There are a couple of ways to carry outbootstrap tests of hypotheses.

Method 1: Try to generate the null distribution.One way is to emulate carrying out a test based on some test statistic. Suppose the test is based on

a test statistic Q and rejects H0 if Q is large (almost all tests can be put into this form). The bootstrapapproach simulates the null distribution of the test statistic Q. With this approach, the bootstrap samplesmust be generated under the null model (the model incorporating the null hypothesis). Then for eachbootstrap sample the test statistic is calculated and the empirical distribution of these test statistics overthe B bootstrap samples is an estimate of the null distribution. Suppose the test statistic being used hasan observed value Qobs and the bootstrap values are Q1, . . . , QB (so Qb was the value of the test statisticgenerated in the bth bootstrap sample). The bootstrap P-value of the test is

Pboot =number of times Qb ≥ Qobs

B.

The null hypothesis of independence is rejected if Pboot is less than α, the desired level of the test (e.g., .05for a test at the 5% level).

This method of bootstrapping mimics our usual approach to testing based on the null distribution andis a popular way to approach testing. In more complicated situations it can be difficult to figure out how toproperly resample under the null. Even when you can do so, this method can have problems. Oftentimesthere are parameters involved in the model which must be estimated and the distribution of the test statisticunder the null hypothesis may depend on these parameters. That is, the test statistic is not what is known asa pivotal quantity. For regression problems, in the case of uncorrelated errors and constant variance, wherethe bootstrap is being used to protect against non-normal errors, this approach does fairly well. More seriousproblems can arise in more complicated models, including heteroscedasticity or correlated errors. This issue

Page 8: 5 Bootstrapping

ST697F: Topics in Regression. Spring 2007 c© 28

has not received the attention it deserves in practice and further work is needed on the magnitude of theproblem.

Method 2: Invert a confidence interval.For a single parameter, hypothesis testing can also be carried out using the bootstrap confidence interval.

We illustrate for two sided test of H0 : θ = θ0, versus HA : θ 6= θ0. An approximate test of size α is to rejectH0 if the 100(1 − α)% confidence interval for θ does not contain θ0. A test at level .05 would use a 95%confidence interval, a level .10 test a 90% confidence interval, etc. A P-value for the test can be obtained byfinding the smallest level at which the the null hypothesis is rejected. Equivalently, this means finding thelargest value C for which a 100C% confidence interval contains 0 and then taking the P-value as 1−C. Forone-sided tests, a similar approach can be taken using one-sided confidence intervals.

Because of the potential problems mentioned earlier with obtaining a P-value by directly bootstrappingthe test statistic under the null, we recommend carrying out the test via confidence intervals when possible.

5.4.1 Summary

Briefly, the main points in this section are:

• Bootstrapping the residuals requires that we have a model for the mean and the variance, except whenwe have replication.

• If we have a random sample of units, then you can either bootstrap the (yi, x) sets or you can bootstrapthe residuals if you believe you have the right model for Y |x. Bootstrapping the (yi,xi) sets does notrequire you have a model for the variance, and it allows heteroscedasticity.

• If the x’s are fixed then in generally you should bootstrap the residuals, although there are some caseswhere bootstrapping the sets works okay.

• If you have a random sampling of units that is not a single random sample (i.e., it comes from somethinglike a stratified or multi-stage sampling) then you can bootstrap the residuals or bootstrap the (y, x)sets in a way which reflects the original sampling scheme.

5.5 References

Kutner et al. (Section 11.5)Efron and Tibshirani (1993). A comprehensive look but not too technical. Chapter 9 handles regression indetail.Diaconis and Efron (1983), Efron and Tibshirani (1986 ), Leger et. al. (1992, p. 378), Davison and Hinkley(1997), Manly (2000).

Page 9: 5 Bootstrapping

7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 28

7 Univariate bootstrap - food expenditures

This example illustrates the use of the bootstrap in a single sample where we use the sample mean, variance, standarddeviation, coefficient of variation (standard deviation/mean), median and the 90th percentile for illustration. Thedata consists of food expenditures in dollars for 40 randomly sampled households in Ohio. (Data from ”ExploringStatistics” by Kitchens originally from Bureau of Labor Statistics). B = 1000 bootstrap samples were used.

The original data has sample mean x = 3609.6, standard deviation s = 1509.3, median 3165 and 90th percentile5683.5.

• The sample mean is known to be an unbiased estimator of the population mean µ, with exact standard errorσ/n1/2, usually estimated by s/n1/2 = 238.6412 (Std Mean). An approximate confidence interval for µ, basedon the approximate normality of the sample mean is found using x± t(1− α/2, n− 1)s/n1/2. (We usually usethe t rather than a normal value even when we think the population is not normal, but sample size is “large”,in order to be conservative. There is little difference between the t and z values though for even moderate n).The 95% interval here is (3207.52,4011.68).

The mean of the 1000 bootstrap sample mean is 3617.708. As we know the sample mean is unbiased so thereis no need to estimate bias. If we do, the bootstrap estimate of bias is 3617.708 - 3609.6. This is small relativeto the estimate and the only reason it is not 0 is due to the use of a finite number (1000) of bootstrap samples.As you increase B this will go to 0. The bootstrap estimate of standard error of x, from the 1000 samplesis 239.0997 (the standard deviation of the 1000 samples means) The theoretical bootstrap estimate of thestandard error, what it converges to as B → ∞, is σ/n1/2 (see notes for σ) which can be computed directlyand differs modestly from s/n1/2.)

The bootstrap is not needed for assessing bias or getting the standard error of x and is done here for illustrationand to show agreement with the usual results. For a “nonparametric” confidence interval, the 90% percentileinterval uses the 5th and 95th percentile as endpoints. This yields (3256.8,4034.2). This is not that differentfrom the earlier interval which is not surprising given the normality of x as demonstrated by the empiricalbootstrap distribution of the sample mean.

• There are analytical procedures (often approximate) for treating the variance, standard devation or coefficientof variation.

Note that the sample variance is unbiased for σ2 but s is biased for σ. The bootstrap estimate of the bias is san estimator of σ is 1476.5− 1509.3(39/40)1/2 = −13.8143.

For the variance, if we assume the population is normal then an exact confidence interval is available basedon the chi-square with n − 1 degrees of freedom and taking the square root give a confidence interval for σ.The 90% intervals are given from proc univariate under the normality assumption and are (1627961,3457486)for the variance and (1276,1859) for the standard deviation. Without the normality assumption, there arelarge sample results that can be used to based on the approximate normality of the sample variance or samplestandard deviation. These involve an analytical expression for the asymptotic standard. Note that if you usethe approximate normality of the sample variance and then separately work with the standard deviation, theinterval for the standard devation is not just the square root of the interal for the variance. The bootstrappercentile intervals though transform directly.

The 90% bootstrap percentile confidence interval for σ is (1069.64, 1840.1).

The sample coefficient of variation is s/x = .418. As an estimator s/x is a ratio and cannot get an exactexpression for its expected value and hence its bias (although we can approximate it using our earlier methods).The bootstrap estimate of bias is .407− .418(39/40)1/2 = −.00574 indicating bias is not a serious issue. It ispossible using a multivariate central limit theorem and the delta method to determine that for large samplesize the sample coefficient of variation is approximately normal with mean equal to the population coefficientof variation and some standard deviation that depends on a number of parameters that must be estimated.This is one way to approach the problem, but relies on approximations and estimation of unknowns in theapproximate standard error. Using the bootstrap, the 90% bootstrap percentile confidence interval for thepopulation CV is (.32, .48).

Page 10: 5 Bootstrapping

7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 29

Notice from the empirical distributions the sampling distributions are approximately normal (as large sampletheory tells us they will be). In these cases, the intervals that comes out of the bootstrap perecentile methodwill be close to what is obtained using a normal approximation and using intervals of the form estimate±z(1−α/2)SE, where SE is the bootstrap standard errror. For the standard deviation, which has the least normallooking of the sampling distributions, the resulting interval is (1092.3, 1860.6), which is not too different fromthe percentile method.

Proc univariate gives approximate confidence intervals for population percentiles but under normality assump-tions and without normality. The distribution free intervals are based on using the order statistics. These aredescribed in the SAS online documentation.

• The median can be addressed in a similar manner. The bootstrap estimate of bias (3182.78 - 3165) is relativelysmall. The 90% bootstrap percentile interval for the median is (2837.5,3679).

• The 90th percentile shows some difficulty with employing the bootstrap. There are only some values thattypically end up as the 90th percentile in the bootstrap sample, leading to a very discrete distribution. Whilethis is not particularly problematic in estimating the bias or standard deviation (though it could be in smallsample sizes) it poses problems with the confidence intervals. Notice that the 95th and 99th percentiles ofthe empirical distribution are the same. A 90% percentile interval is (4367,7580) while a 98% is (3970,7580),which is a bit unsatisfactory with the upper point staying the same. One way to deal with this problem, anda general strategy employed that can be employed in bootstrapping is to smooth the data before resampling,so the resampling is from a continous distribution rather than from a set of points.

The UNIVARIATE Procedure

Variable: expend

Moments

N 40 Sum Weights 40

Mean 3609.6 Sum Observations 144384

Std Deviation 1509.29978 Variance 2277985.84

Skewness 1.4033352 Kurtosis 1.9356254

Uncorrected SS 610009934 Corrected SS 88841447.6

Coeff Variation 41.8134913 Std Error Mean 238.641249

Basic Statistical Measures

Location Variability

Mean 3609.600 Std Deviation 1509

Median 3165.000 Variance 2277986

Mode 2648.000 Range 6967

Interquartile Range 1462

Basic Confidence Limits Assuming Normality

Parameter Estimate 90% Confidence Limits

Mean 3610 3208 4012

Std Deviation 1509 1276 1859

Variance 2277986 1627961 3457486

Tests for Location: Mu0=0

Test -Statistic- -----p Value------

Student’s t t 15.12563 Pr > |t| <.0001

Sign M 20 Pr >= |M| <.0001

Signed Rank S 410 Pr >= |S| <.0001

Page 11: 5 Bootstrapping

7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 30

Quantiles (Definition 5)

90% Confidence Limits

Quantile Estimate Assuming Normality

100% Max 8147.0

99% 8147.0 6471.642 8048.363

95% 7225.0 5566.512 6817.607

90% 5683.5 5073.978 6171.152

75% Q3 4110.0 4222.654 5117.472

50% Median 3165.0 3207.519 4011.681

25% Q1 2648.0 2101.728 2996.546

10% 2263.5 1048.048 2145.222

5% 2031.0 401.593 1652.688

1% 1180.0 -829.163 747.558

0% Min 1180.0

90% Confidence Limits -------Order Statistics-------

Quantile Distribution Free LCL Rank UCL Rank Coverage

100% Max

99% 6870 8147 38 40 32.35

95% 5587 8147 36 40 82.35

90% 4670 8147 33 40 94.33

75% Q3 3679 5176 26 35 90.23

50% Median 2830 3679 15 26 91.93

25% Q1 2416 2830 6 15 90.23

10% 1180 2534 1 8 94.33

5% 1180 2352 1 5 82.35

1% 1180 2086 1 3 32.35

0% Min

Variable: expend

Stem Leaf # Boxplot

8 1 1 0

7 6 1 0

7

6 9 1 0

6

5 68 2 |

5 02 2 |

4 7 1 |

4 024 3 +-----+

3 67789 5 | + |

3 122234 6 *-----*

2 556667888889 12 +-----+

2 01244 5 |

1 |

1 2 1 |

----+----+----+----+

Multiply Stem.Leaf by 10**+3

Page 12: 5 Bootstrapping

7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 31

Variable=SMEAN

Moments

N 1000 Sum Wgts 1000

Mean 3617.708 Sum 3617708

Std Dev 239.0997 Variance 57168.68

Quantiles(Def=5)

100% Max 4391.55 99% 4210.488

75% Q3 3771.675 95% 4034.175

50% Med 3609.85 90% 3934.488

25% Q1 3445.788 10% 3323.463

0% Min 2893.725 5% 3256.838

1% 3117.375

Variable=SMEAN

Histogram # Boxplot

4350+* 4 0

.** 8 0

.***** 18 |

4050+********** 37 |

.**************** 63 |

.*********************** 92 |

3750+****************************** 118 +-----+

.******************************************** 176 *--+--*

.************************************** 149 | |

3450+*************************************** 154 +-----+

.************************* 99 |

.************** 56 |

3150+***** 17 |

.** 6 |

.* 2 |

2850+* 1 0

----+----+----+----+----+----+----+----+----

* may represent up to 4 counts

** ELIMINATED OUTPUT ON VARIANCE

Page 13: 5 Bootstrapping

7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 32

Variable=SD

N 1000 Sum Wgts 1000

Mean 1476.469 Sum 1476469

Std Dev 233.542 Variance 54541.85

Skewness -0.27602 Kurtosis -0.15722

Quantiles(Def=5)

100% Max 2049.384 99% 1959.248

75% Q3 1650.457 95% 1840.111

50% Med 1486.777 90% 1772.416

25% Q1 1327.667 10% 1161.406

0% Min 734.4266 5% 1069.64

1% 894.3367

Histogram # Boxplot

2050+** 6 |

.***** 17 |

.************** 54 |

.*********************** 91 |

.************************************* 147 +-----+

.****************************************** 166 | |

.**************************************** 157 *--+--*

.************************************ 142 +-----+

.*********************** 92 |

.****************** 71 |

.******** 29 |

.***** 17 |

.*** 9 0

750+* 2 0

----+----+----+----+----+----+----+----+--

* may represent up to 4 counts

Page 14: 5 Bootstrapping

7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 33

Variable=CV

N 1000 Sum Wgts 1000

Mean 0.406827 Sum 406.8272

Std Dev 0.050113 Variance 0.002511

Quantiles(Def=5)

100% Max 0.562396 99% 0.514871

75% Q3 0.440757 95% 0.482886

50% Med 0.410499 90% 0.466752

25% Q1 0.376334 10% 0.34338

0% Min 0.233583 5% 0.318583

1% 0.273199

Variable=CV

Histogram # Boxplot

0.57+* 1 0

.* 2 0

.* 4 |

.*** 11 |

.********** 38 |

.******************** 78 |

.******************************* 122 +-----+

.***************************************** 163 | |

.****************************************** 167 *--+--*

.************************************* 146 | |

.************************ 95 +-----+

.********************* 83 |

.********** 37 |

.****** 21 |

.***** 17 |

.*** 9 0

.** 5 0

0.23+* 1 0

----+----+----+----+----+----+----+----+--

* may represent up to 4 counts

Page 15: 5 Bootstrapping

7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 34

Variable=MEDIAN

Moments

N 1000 Sum Wgts 1000

Mean 3182.78 Sum 3182780

Std Dev 247.4359 Variance 61224.53

Quantiles(Def=5)

100% Max 3869 99% 3781.5

75% Q3 3320 95% 3679

50% Med 3165 90% 3530

25% Q1 2999.5 10% 2847

0% Min 2734 5% 2837.5

1% 2774.5

Variable=MEDIAN

Histogram # Boxplot

3875+* 1 0

.** 8 0

.** 7 |

.****** 22 |

.******* 26 |

.****** 23 |

.* 1 |

.******** 32 |

.** 8 |

.******** 32 |

.********** 37 |

.*************** 60 +-----+

.*************** 60 | |

.********************************* 129 | |

.************************************* 145 *--+--*

.************************* 100 | |

.* 2 | |

.*********** 41 | |

.****************** 69 +-----+

. |

.********************* 84 |

.*********************** 90 |

.***** 20 |

2725+* 3 |

----+----+----+----+----+----+----+--

* may represent up to 4 counts

Page 16: 5 Bootstrapping

7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 35

Variable=P90

N 1000 Sum Wgts 1000

Mean 5609.665 Sum 5609665

Std Dev 846.6281 Variance 716779.1

Quantiles(Def=5)

100% Max 8147 99% 7580

75% Q3 5780 95% 7580

50% Med 5587 90% 6870

25% Q1 4978 10% 4670

0% Min 3743 5% 4367

1% 3970

Variable=P90

Histogram # Boxplot

8100+* 3 0

.

.

.********** 48 0

.

.

.***************************** 142 |

. |

. |

. |

. |

5900+ |

.************************************** 187 +--+--+

.***************************************** 203 *-----*

. | |

.********************************* 161 | |

.*********************** 115 +-----+

.************ 59 |

. |

.************** 67 |

. |

.*** 14 |

3700+* 1 0

----+----+----+----+----+----+----+----+-

* may represent up to 5 counts

Page 17: 5 Bootstrapping

7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 36

title ’Bootstrap with a single sample’;

options pagesize=60 linesize=80;

/* THIS IS A PROGRAM TO BOOTSTRAP WITH A SINGLE

RANDOM SAMPLE USING THE MEAN, VARIANCE, STANDARD

DEVIATION, COEFFICIENT OF VARIATION, MEDIAN, 90TH PERCENTILE */

filename bb ’boot.out’;

/* READ IN ORIGINAL DATA INTO INTERNAL SAS FILE values */

data values;

infile ’food.dat’;

input expend;

run;

title ’descriptive statistics on original sample’;

proc univariate cibasic cipctlnormal cipctldf plot alpha=.10;

run;

/* START INTO IML WHERE BOOTSTRAPPING WILL BE DONE */

proc iml;

/* put data into vector x */

use values;

read all var{expend} into x;

close values;

n=nrow(x); /* = sample size */

xb=x; /* initializes xb to be same size as x */

nboot = 1000; /* specify number of bootstrap replicates */

do j=1 to nboot;

/* get the n samples with replacement. i indicates

the sampling within bootstrap replicate j. The generated

k is a discrete uniform over 1 to n; the function int

takes the integer part */

do i= 1 to n;

uv=uniform(0);

k=int(uv*n+1);

xb[i]=x[k];

end;

/* xb contains the n values in bootstrap sample */

/*compute statistics of interest, sum and ssq are matrix functions

that do sum and sum of squares */

smean = sum(xb)/n; /* get sample mean */

svar=(ssq(xb) - (n*(smean**2)))/(n-1); /* sample variance */

sd = sqrt(svar); /* sample s.dev. */

cv = sd/smean; /* coefficient of variation */

Page 18: 5 Bootstrapping

7 UNIVARIATE BOOTSTRAP - FOOD EXPENDITURES 37

/*compute median and 90th percentile */

b=xb; /* initializes b*/

xb[rank(xb)] = b; /* xb has ranked values */

c1=int(n/2);

c2=c1+1;

median=(xb[c1]+xb[c2])/2; /* use if n is even */

diff = c1 - (n/2);

if diff <0 then median = xb[c2]; /* if n is odd */

d= int(.9*n);

p90= xb[d]; /* rough 90th percentile. Can be refined */

/* the next two commands puts the results to

file bb which is aliased with external file boot.out

through the filename statement at beginning. The

+1 in the put statement says to skip one space. */

file bb;

put smean +1 svar +1 sd +1 cv +1 median +1 p90;

end;

quit;

run;

/* Get descriptive statistics via proc univariate */

data new;

infile ’boot.out’;

input smean svar sd cv median p90;

run;

proc univariate plot;

run;

Page 19: 5 Bootstrapping

8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 38

8 Bootstrap Regression Sets - Esterase Assay

Here we demonstrate the bootstrap where it is assumed that the n units in the study are a random sample (indepen-dent and identically distributed) so we can resample the (Y,x) sets. We will demonstrate using the Esterase Assaydata in order to compare it to our earlier results. This assumes there is a sample of 106 individuals and for eachindividual we then get true esterase concentration via some exact method and at the same time get a binding countfrom running the radioimmunoassy. This would not be the right way to proceed if this is a designed experimentusing standards with known concentrations. That would involve bootstrapping residuals.

Here we will use the bootstrap to get an estimated covariance matrix for the least squares estimates of thecoefficients and to get confidence intervals. Using least squares does not require that we model the variance, butthe usual estimated covariance is known to be wrong. Analytically one option was to use White’s robust estimator.See Example 4.1 for the least squares results and the robust estimate of covariance (labeled consistent covarianceof estimates). The least squares estimates are unbiased (assuming the linear model is right) so we don’t need thebootstrap to assess bias. The bootstrap estimates of standard error are 21.1 for the intercept and 1.33 for the slope;these are the square roots of the diagonal elements of the bootstrap estimate of Σβ , which is labeled CovarianceMatrix below. Notice the similarity between this and White’s robust estimator. The empirical bootstrap distributionsdemonstrate the normality of the estimators and intervals based on the normal approximation using the bootstrapstandard errors should be reasonable. The 90% percentile intervals are (-53.1, 15.6) for β0 and (15, 19.3) for β1.

BHAT MSE

-15.99447 10414.091

17.041412

Covariance Matrix B0 B1

B0 445.4256684 -26.5742032

B1 -26.5742032 1.7741409

Variable=B0 Mean -17.072 Std Dev 21.10511

Quantiles(Def=5)

100% Max 53.45373 99% 27.74107

75% Q3 -2.20834 95% 15.56278

50% Med -16.3072 90% 8.888825

25% Q1 -30.7712 10% -44.3327

0% Min -105.704 5% -53.0877

1% -68.0202

Histogram # Boxplot

55+* 1 0

.

.** 6 |

.***** 24 |

.************ 58 |

.*********************** 115 |

.************************************ 179 +-----+

.**************************************** 197 *--+--*

-25+********************************* 161 | |

.************************ 119 +-----+

.*************** 75 |

.******** 39 |

.**** 17 |

.* 4 0

.* 2 0

.* 2 0

-105+* 1 0

Page 20: 5 Bootstrapping

8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 39

Variable=B1

N 1000 Sum Wgts 1000

Mean 17.08731 Sum 17087.31

Std Dev 1.331969 Variance 1.774141

Quantiles(Def=5)

100% Max 22.39998 99% 20.61564

75% Q3 17.9565 95% 19.31641

50% Med 17.05589 90% 18.81349

25% Q1 16.19792 10% 15.38463

0% Min 12.68837 5% 14.96907

1% 14.18281

Variable=B1

Histogram # Boxplot

22.25+* 1 0

.* 1 0

.* 2 0

.*** 9 0

.** 7 |

.***** 19 |

.********** 40 |

.************** 55 |

.*************************** 105 |

.******************************* 124 +-----+

.************************************** 152 *--+--*

.************************************ 143 | |

.*********************************** 139 +-----+

.********************** 87 |

.**************** 64 |

.********* 35 |

.**** 13 |

.* 2 |

.* 1 0

12.75+* 1 0

----+----+----+----+----+----+----+---

* may represent up to 4 counts

Page 21: 5 Bootstrapping

8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 40

title ’Esterase Hormone data subset -bootstrap’;

options pagesize=60 linesize=80;

/* THIS IS A PROGRAM TO BOOTSTRAP IN REGRESSION WITH

RESAMPLING OF THE X AND Y’S TOGETHER ,

NEXT LINE SETS UP CORRESPONDENCE BETWEEN FILE bb INSIDE SAS

AND THE EXTERNAL FILE bhat.out */

filename bb ’bhat.out’;

/* READ IN ORIGINAL DATA INTO INTERNAL SAS FILE a */

data a;

infile ’ester.dat’;

input ester count;

con=1.0;

run;

proc iml;

/* put data into y vector and x matrix */

use a;

read all var {count} into y;

read all var {con ester} into x;

close a;

/* need to do next two lines just to define the vector

yb and matrix xb that will be used in bootstrap */

yb=y;

xb=x;

/* get usual least squares estimators and

mean squared error using matrix forms

bhat is beta(hat) and MSE is the mean square error

t(x) stands for transpose of the matrix x

r is vector of residuals and ssq is a function which

gets the sums of squares of the vector in argument */

xpxinv=inv(t(x)*x);

bhat=xpxinv*(t(x)*y);

yhat=x*bhat;

r=y-yhat;

sse=ssq(r);

df=nrow(x)-ncol(x);

mse=sse/df;

n=nrow(x);

print bhat mse;

/* j indexes the number of bootstrap replicates */

do j=1 to 500;

Page 22: 5 Bootstrapping

8 BOOTSTRAP REGRESSION SETS - ESTERASE ASSAY 41

/* get the n samples with replacement i indicates

the sampling with bootstrap replicate j, the generated

k is a discrete uniform over 1 to n; the function int

takes the integer part */

do i= 1 to n;

uv=uniform(0);

k=int(uv*n+1);

yb[i]=y[k];

xb[i,1]=x[k,1];

xb[i,2]=x[k,2];

end;

/* now do least squares with xb the new x matrix and

yb the new repsonse vector bhatb and mseb have

the results in there (the b at end stands for bootstrap) */

xpxinvb=inv(t(xb)*xb);

bhatb=xpxinvb*(t(xb)*yb);

yhatb=xb*bhatb;

rb=yb-yhatb;

sseb=ssq(rb);

mseb=sseb/df;

b0=bhatb[1];

b1=bhatb[2];

/* the next two commands puts the results to

file bb which is aliased with external file bhat.out

through the filename statement at beginning. The

+1 in the put statement says to skip one space. If

you don’t do this, there are no blanks between variables. */

file bb;

put b0 +1 b1 +1;

end;

quit;

run;

/* Now go and get descriptive statistics through proc corr

and proc univariate. I ran proc corr so can get the

estimated covariance of beta(hat). This is the sample

covariance of the bhatb’s over the bootstrap samples; this

is obtained with the cov option */

data new;

infile ’bhat.out’;

input b0 b1;

run;

proc corr cov;

run;

proc univariate plot;

run;

Page 23: 5 Bootstrapping

9 BOOTSTRAP REGRESSION: RESIDUALS - ESTERASE ASSAY/WEIGHTED 42

9 Bootstrap Regression: Residuals - Esterase Assay/weighted

Here we demonstrate the bootstrap by resampling the residuals. We allow for specifying fixed weights to use forweighted least squares. We demonstrate using the Esterase Assay data, where it is assumed that V (εi) = x2

i σ2.

We fit this model and got estimated standard errors in Example 4.2. There is no need for the bootstrap for thesepurposes, but the bootstrap will be useful for assessing the distribution and getting confidence intervals. In additionto working with the coefficients we also estimate the mean value at x = 20, called M20 in the output.

NOTE: In the SAS code, the ith component of ynew is Y ∗

i = w1/2i Yi and the ith row of xb is w

1/2i x′i. This means

the ith component of yhatw is w1/2i x′iβ and the ith component of rw is Y ∗

i −w1/2i x′iβ = w

1/2i (Yi−x′iβ). If we multiply

this by (n/(n− p))1/2, this is what is called δi in Section 5.3.4 of the notes.

The estimated coefficient and MSE agree with the weighted analysis in Example 4.2. The estimated covariancematrix of the coefficients (and associated standard errors) differ modestly from the covariance matrix and standarderrors in Example 4.2 in part due to the use of B = 1000.

The estimators are all approximately normal and the normal based confidence intervals will be very similar tothe percentile intervals.

BHATW MSE

-39.69989 18.929723

18.329729

Covariance Matrix DF = 999

B0 B1

B0 173.7401744 -10.9221134

B1 -10.9221134 0.8742081

Variable=B0 N 1000 Sum Wgts 1000

Mean -40.4334 Sum -40433.4

Std Dev 13.18105 Variance 173.7402

Quantiles(Def=5)

100% Max 1.447395 99% -7.49063

75% Q3 -31.84 95% -18.3454

50% Med -40.4421 90% -23.6088

25% Q1 -49.6165 10% -57.8415

0% Min -81.9774 5% -61.7578

1% -70.0547

Histogram # Boxplot

2.5+* 1 0

.** 5 0

.*** 9 |

.*** 11 |

.*********** 41 |

.************** 55 |

.********************* 83 |

.******************************* 121 +-----+

.**************************************** 160 | |

.*************************************** 153 *--+--*

.****************************** 118 +-----+

.**************************** 109 |

.***************** 68 |

.********** 40 |

.**** 16 |

.** 6 |

.* 3 0

-82.5+* 1 0

Page 24: 5 Bootstrapping

9 BOOTSTRAP REGRESSION: RESIDUALS - ESTERASE ASSAY/WEIGHTED 43

Variable=B1 N 1000 Sum Wgts 1000

Mean 18.37718 Sum 18377.18

Std Dev 0.934991 Variance 0.874208

Quantiles(Def=5)

100% Max 21.39577 99% 20.44042

75% Q3 19.01847 95% 19.84456

50% Med 18.36889 90% 19.62005

25% Q1 17.77287 10% 17.18073

0% Min 15.33568 5% 16.73356

1% 16.13866

Histogram # Boxplot

21.25+* 1 0

.** 7 |

.****** 27 |

.****************** 89 |

.**************************** 136 +-----+

.************************************** 187 | |

18.25+******************************************* 213 *--+--*

.************************************ 179 +-----+

.****************** 86 |

.********* 45 |

.***** 22 |

.** 7 0

15.25+* 1 0

----+----+----+----+----+----+----+----+---

* may represent up to 5 counts

Variable=M20 Mean 327.1102 Sum 327110.2

Std Dev 9.302627 Variance 86.53886

100% Max 349.3078 99% 346.8507

75% Q3 333.6212 95% 342.377

50% Med 327.4056 90% 339.2269

25% Q1 320.7747 10% 314.7616

0% Min 297.3015 5% 311.4262

1% 304.0914

Histogram # Boxplot

347.5+***** 25 |

.************** 66 |

.********************** 106 |

.**************************************** 199 +-----+

.**************************************** 199 *--+--*

322.5+************************************* 184 +-----+

.*********************** 115 |

.*************** 72 |

.***** 21 |

.*** 11 0

297.5+* 2 0

----+----+----+----+----+----+----+----+

* may represent up to 5 counts

Page 25: 5 Bootstrapping

9 BOOTSTRAP REGRESSION: RESIDUALS - ESTERASE ASSAY/WEIGHTED 44

title ’Esterase Hormone data -bootstrap’;

options pagesize=60 linesize=80;

/* THIS IS A PROGRAM TO BOOTSTRAP IN REGRESSION WITH

BOOTSRAPPING ON THE RESIDUALS.

- VARIANCE IS ASSUMED OF THE FORM SIGMA^2*a_i^2.

- USES WEIGHTED LEAST SQUARES.

- IF EQUAL VARIANCE, SET a = 1 AND LEAVE REST THE SAME. */

filename bb ’bhat2.out’;

/* DOING ESTERASE ASSAY EXAMPLE WITH VARIANCE PROPORTIONAL TO X SQUARED.*/

data a;

infile ’ester.dat’;

input ester count;

con=1.0;

a2 = ester**2;

wt=1/a2;

ystar=count*sqrt(wt);

x1star=1*sqrt(wt);

x2star=ester*sqrt(wt);

run;

proc iml;

/* put transformed data into ynew vector and xnew matrix

and variances into v */

use a;

read all var {ystar} into ynew;

read all var {x1star x2star} into xnew;

close a;

yb=ynew; /* intialize */

xb=xnew;

/* get’s weighted least squares estimators and MSE */

xpxinv=inv(t(xnew)*xnew);

bhatw=xpxinv*(t(xnew)*ynew);

yhatw=xnew*bhatw;

rw=ynew-yhatw;

sse=ssq(rw);

df=nrow(xnew)-ncol(xnew);

mse=sse/df;

n=nrow(xnew);

print bhatw mse;

nboot = 1000;

do j=1 to nboot;

/* Resample from residuals and add to fitted value

note that already has weighting built in and modifies residual.*/

do i= 1 to n;

Page 26: 5 Bootstrapping

9 BOOTSTRAP REGRESSION: RESIDUALS - ESTERASE ASSAY/WEIGHTED 45

uv=uniform(0);

k=int(uv*n+1);

yb[i]=yhatw[i] + sqrt(n/(n-2))*rw[k];

end;

/* Weighted leasts squares on bootstrap sample. */

xpxinvb=inv(t(xb)*xb);

bhatb=xpxinvb*(t(xb)*yb);

yhatb=xb*bhatb;

rb=yb-yhatb;

sseb=ssq(rb);

mseb=sseb/df;

b0=bhatb[1];

b1=bhatb[2];

m20=bhatb[1]+bhatb[2]*20; /* estimate of mean at x = 20 */

/* the next two commands puts the results to

file bb which is aliased with external file bhat2.out

through the filename statement at beginning. The

+1 in the put statement says to skip one space. If

you don’t do this, there are no blanks between variables. */

file bb;

put b0 +1 b1 +1 m20;

end;

quit;

run;

data new;

infile ’bhat2.out’;

input b0 b1 m20;

run;

proc corr cov;

var b0 b1;

run;

proc univariate plot;

run;

Page 27: 5 Bootstrapping

10 ESTIMATING A RATIO - OPTIMAL PRICING OF HAMBURGERS 46

10 Estimating a ratio - optimal pricing of hamburgers

This example demonstrates the estimation of a ratio in an economic example. The problem and data is from Giffithset al. (“Learning and Practicing Econometrics”, p. 314) Y = number of hamburgers sold, p = price of a hamburger(in rubles) and a = advertising expenditures (in 100s of rubles). The model fit is Yi = β0 + β1pi + β2ai + β3a

2i + εi,

where it is assumed that the epsiloni are uncorrelated with mean 0 and constant variance. The residual plotsindicate no serious problems with these assumptions, but as the individual observations are for consectuive monthswe should be worried about whether the errors are correlated over time. We will assume they are not and examinethe potential for serial correlation later. The expected profit at price p and advertising a is E(Y )p− E(Y )c− 100a= E(Y )(p − c) − 100a = (β0 + β1p + β2a + β3a

2)(p − c) − 100a where c is the cost of producing a hamburger. Ata fixed advertising cost a, differentiating with respect to p and setting equal to 0 yields an optimal price (the onemaximizing expected profit) of θ = (β1c− β0 − β2a− β3a

2)/2β1.

The first part of the anlysis gives a least squares fit using proc reg and examines the residuals. There is noindication that the model is wrong and although there is some indication that the variance may be changing it doesnot appear to be dramatic. The robust estimate of Cov(β) via White’s method is quite different than our usualestimate under constant variance. One issue here is that with a moderate number of observations White’s estimatorcan be erratic even when the constant variance assumption is met.

Secondly, using IML we estimate the optimal price at c = 1 and a = 2.8 (280 rubles). The estimated optimal priceis 5.32. The Fieller and delta method 90 % confidence intervals for θ are (4.76, 6.42) and (4.58,6.06) respectively.The estimate of the standard error from the delta method is .425.

The second part of the analysis runs a bootstrap analysis to analyze θ and get a nonparametric confidenceinterval. The mean of the bootstrap sample is 5.406 and the bootstrap estimate of standard error is .501. Theempirical bootstrap distribution shows some skewness and the 90% percentile methods is (4.79,6.29) which is ingeneral agreement with the Fieller and delta method interval.

OBS NUMBER PRICE AD AD2 CON

1 425 4.92 4.79 22.9441 1

2 467 5.50 3.61 13.0321 1

3 296 5.54 5.49 30.1401 1

4 626 5.11 2.78 7.7284 1

5 165 5.62 5.74 32.9476 1

6 515 5.24 1.34 1.7956 1

7 270 4.15 5.81 33.7561 1

8 689 4.02 3.39 11.4921 1

9 413 5.77 3.74 13.9876 1

10 561 4.57 3.59 12.8881 1

11 307 5.67 5.19 26.9361 1

12 508 5.92 3.27 10.6929 1

13 299 5.97 4.69 21.9961 1

14 531 5.59 3.79 14.3641 1

15 445 5.50 4.29 18.4041 1

16 412 5.86 2.71 7.3441 1

17 845 4.09 2.21 4.8841 1

18 471 5.08 3.09 9.5481 1

19 439 5.36 4.65 21.6225 1

20 520 5.22 1.97 3.8809 1

Parameter Standard T for H0:

Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 1035.383898 154.98855053 6.680 0.0001

PRICE 1 -127.924996 24.12982087 -5.302 0.0001

AD 1 162.360399 65.19387129 2.490 0.0241

AD2 1 -32.685857 8.52154015 -3.836 0.0015

Page 28: 5 Bootstrapping

10 ESTIMATING A RATIO - OPTIMAL PRICING OF HAMBURGERS 47

Covariance of Estimates

COVB INTERCEP PRICE AD AD2

INTERCEP 24021.450794 -2509.31243 -5966.682871 750.59759468

PRICE -2509.31243 582.24825526 -277.1356729 32.229888596

AD -5966.682871 -277.1356729 4250.240854 -547.0408592

AD2 750.59759468 32.229888596 -547.0408592 72.616646567

Consistent Covariance of Estimates

ACOV INTERCEP PRICE AD AD2

INTERCEP 32017.432715 -4435.20003 -4198.419527 511.52763284

PRICE -4435.20003 689.00724899 363.74452602 -43.48959035

AD -4198.419527 363.74452602 1432.0284844 -198.6088433

AD2 511.52763284 -43.48959035 -198.6088433 29.364321064

200 +

R |

e | A

s 100 +

i | A A

d | A A A A A

u 0 + A A A A A

a | A A A

l | A A A

-100 + A

-+---------+---------+---------+---------+---------+---------+---------+-

100 200 300 400 500 600 700 800

Predicted Value of NUMBER

200 +

R |

e | A

s 100 +

i | A A

d | A A A A A

u 0 + A A A A A

a | A A A

l | A A A

-100 + A

---+-------------+-------------+-------------+-------------+--

4.0 4.5 5.0 5.5 6.0

PRICE

200 +

R |

e | A

s 100 +

i | A A

d | A A A A A

u 0 + A A AA A

a | A AA

l | A A A

-100 + A

-+-------------+-------------+-------------+-------------+-------------+-

1 2 3 4 5 6

AD

Page 29: 5 Bootstrapping

10 ESTIMATING A RATIO - OPTIMAL PRICING OF HAMBURGERS 48

BHAT MSE

least squares estimation 1035.3839 3880.8309

-127.925

162.3604

-32.68586

XHAT

estimated optimal price is 5.3221064

XHAT R1 R2

estimated value 5.3221064 Fieller Interval 4.7615724 to 6.422214

SE

approximate standard error of xhat 0.4246941

LOW UP

Delta Method Interval 4.58064 to 6.0635729

** BOOTSTRAP ANALYSIS

Variable N Mean Std Dev Minimum Maximum

----------------------------------------------------------------------

B0 1000 1038.10 157.7894603 543.7993800 1541.51

B1 1000 -128.2003186 23.8604919 -209.8480000 -53.1294500

B2 1000 161.8234649 65.7316481 -39.3363500 369.7672600

B3 1000 -32.6224384 8.5118622 -60.3828600 -6.6781000

Variable=XHATB

Mean 5.405612 Sum 5405.612

Std Dev 0.501297 Variance 0.251299

100% Max 8.496416 99% 7.146886

75% Q3 5.635321 95% 6.28675

50% Med 5.314994 90% 6.022039

25% Q1 5.072022 10% 4.893085

0% Min 4.471948 5% 4.794123

Histogram # Boxplot

8.5+* 2 *

.

8.1+* 2 *

.

7.7+

.

7.3+* 5 0

.* 4 0

6.9+** 7 0

.** 6 0

6.5+*** 12 0

.**** 19 |

6.1+********** 49 |

.************** 68 |

5.7+******************** 96 +-----+

.****************************** 150 | + |

5.3+************************************** 186 *-----*

.******************************************* 211 +-----+

4.9+*************************** 131 |

.********* 44 |

4.5+** 8 |

----+----+----+----+----+----+----+----+---

Page 30: 5 Bootstrapping

10 ESTIMATING A RATIO - OPTIMAL PRICING OF HAMBURGERS 49

title ’hamburger example ’;

options pagesize=60 linesize=80;

filename bb ’boot.out’;

data a;

infile ’ham.dat’;

input number price ad;

ad2=ad*ad;

con=1.0;

proc print;

run;

proc reg;

model number = price ad ad2/covb acov;

output out=result r=resid p = yhat;

run;

proc plot data=result vpercent=30;

plot resid*yhat;

plot resid*price;

plot resid*ad;

run;

proc iml;

use a;

read all var {number} into ynew;

read all var {con price ad ad2} into xnew;

close a;

yb=ynew;

xb=xnew;

xpxinv=inv(t(xnew)*xnew);

bhat=xpxinv*(t(xnew)*ynew);

yhat=xnew*bhat;

r=ynew-yhat;

sse=ssq(r);

df=nrow(xnew)-ncol(xnew);

mse=sse/df;

n=nrow(xnew);

covb=mse*xpxinv;

print ’least squares estimation ’ bhat mse;

/* estimate price at which maximize profit at cost c per unit

and overall adverstising cost of 280 */

c = 1; /* cost of production per unit */

tota = 2.8; /* total advertising cost */

l1 = J(4,1,0);

l1[1]=-1;

l1[2]=c;

l1[3]=-tota;

l1[4] = -(tota*tota);

t1hat = (t(l1)*bhat);

l2={0 , 2, 0, 0};

t2hat = t(l2)*bhat;

xhat = t1hat/t2hat;

print ’estimated optimal price is’ xhat;

tval = tinv(.95,df); /* used for 90\% confidence intervals */

sigma11= t(l1)*covb*l1;

sigma22= t(l2)*covb*l2;

Page 31: 5 Bootstrapping

10 ESTIMATING A RATIO - OPTIMAL PRICING OF HAMBURGERS 50

sigma12=t(l1)*covb*l2;

f0=t1hat**2-(tval**2)*sigma11;

f1=t1hat*t2hat-(tval**2)*sigma12;

f2=t2hat**2-(tval**2)*sigma22;

D=f1**2-f0*f2;

r1=(f1-sqrt(D))/f2;

r2=(f1+sqrt(D))/f2;

if f2<0 then return;

print ’ estimated value ’ xhat ’Fieller Interval ’ r1 ’ to ’ r2 ;

/* GET INTERVAL VIA DELTA METHOD */

se = sqrt((sigma11 + (xhat**2)*sigma22 - (2*xhat*sigma12))/(t2hat**2));

low = xhat-(tval*se);

up=xhat+(tval*se);

print ’ approximate standard error of xhat’ se;

print ’Delta Method Interval ’ low ’ to ’ up ;

/* RUN BOOTSTRAP */

nboot=1000;

do j=1 to nboot;

do i= 1 to n;

uv=uniform(0);

k=int(uv*n+1);

yb[i]=yhat[i] + (sqrt(n/df))*r[k];

end;

xpxinvb=inv(t(xb)*xb);

bhatb=xpxinvb*(t(xb)*yb);

yhatb=xb*bhatb;

rb=yb-yhatb;

sseb=ssq(rb);

mseb=sseb/df;

b0=bhatb[1];

b1=bhatb[2];

b2=bhatb[3];

b3=bhatb[4];

t1hatb = t(l1)*bhatb;

l2={0 , 2, 0, 0};

t2hatb = t(l2)*bhatb;

xhatb = t1hatb/t2hatb;

file bb;

put b0 +1 b1 +1 b2 +1 b3 +1 xhatb;

end;

quit;

run;

data new;

infile ’boot.out’;

input b0 b1 b2 b3 xhatb;

run;

proc means;

var b0 b1 b2 b3;

run;

proc univariate plot;

var xhatb;

run;