Lecture on Bootstrapckuan/pdf/2014fall/Lecture-Bootstrap... · Bootstrap can be served as an...

Post on 05-Sep-2020

8 views 0 download

Transcript of Lecture on Bootstrapckuan/pdf/2014fall/Lecture-Bootstrap... · Bootstrap can be served as an...

Lecture on Bootstrap

Yu-Chin Hsu (許育進)

Academia Sinica

December 4, 2014

1 / 29

This lecture is based on Xiaoxia Shi’s lecture note and myunderstanding of bootstrap.

This is an introductory lecture to Bootstrap Method in that Iwon’t provide any proofs.

For further reading, please seeHorowitz, J.L. (2001). “The Bootstrap”, in J.J. Heckman andE. Leamer, eds, Handbook of Econometrics, vol. 5, ElsevierScience, B.V., p. 3159-3228,and references within.

Xiaoxia Shi’s lecture note is available athttp://www.ssc.wisc.edu/∼ xshi/econ715/Lecture 10 bootstrap.pdf

2 / 29

Introduction

Let W = (W1, . . . ,Wn) denote an i.i.d. sample withdistribution function F .

Let θ0 be the parameter of interest and θn(W ) denote theestimator based on W.

For example, θ0 can be the mean of W , E [W ], andθ(W) = n−1

∑Wi .

To make inference or to construct confidence interval (CI) forθ0, in general, we need to know the exact (or limiting)distribution of

√n(θ − θ0).

Most of the time,√n(θ − θ0)

d→ N(0, σ2).

Then given the availability of consistent estimator σn for σ,we can make inference and construct CI.

3 / 29

Introduction (cont’d)

However, sometimes, (1) the exact form of σ is hard to obtainor (2) it is very complicated to construct consistent estimatorσ for σ.

(1) can happen when the derivatives of the objective functionof a maximum likelihood model are complicated.(2) can happen when the σ involves nonparametriccomponents, e.g., θ0 is the medium of of F and θn(W) is thesample medium. Then σ2 will be 1/(4f (θ0)) where f (·) is thepdf of F . Then to estimate σ, one needs f (θn) which is anonparametric estimator.

Therefore, it is hard to make inference and to construct CI forθ0.

Bootstrap can be served as an alternate method for thispurpose.

4 / 29

What is bootstrap?

Shi: “Bootstrap is an alternative to asymptotic approximationfor carrying out inference. The idea is to mimic the variationfrom drawing different samples from a population by thevariation from redrawing samples from a sample.”

Horowitz: “The bootstrap is a method for estimating thedistribution of an estimator or test statistic by resamplingone’s data or a model estimated from the data.”

Shi: “The name comes from the common English phrase“bootstrap” which alludes to “pulling oneself over the fenceby pulling on ones own bootstrap”, and means solving aproblem without external help.”

5 / 29

How does bootstrap work?

Let W∗ = (W ∗1 , . . . ,W

∗n ) denote an i.i.d. sample with

distribution function F ∗.

Let θ∗0 be the parameter under F ∗ and θn(W∗) denote the

estimator based on W∗.

Basic Idea:

When F ∗ is close to F , then the distribution of√n(θ∗ − θ∗0)

should be close to√n(θ − θ0).

Therefore, if we can find an F ∗ (known to us) that is close toF (unknown), then we can approximate

√n(θ− θ0) (unknown)

by√n(θ∗ − θ∗0 ) (known).

A natural choice of F ∗ is the empirical cdf Fn since we canshow that when sample size is large enough, Fn is consistentfor F . This leads to the “nonparametric bootstrap”.

6 / 29

Confidence Interval

What is the concept of a confidence interval (from afrequentist point of view)?

Suppose√n(θ − θ0)

D→ N(0, σ2) and σ2 p→ σ2.

Then the two-sided 95% confidence interval of θ is:(θ − 1.96σ√

n, θ +

1.96σ√n

).

Why?

P(−1.96 <√n(θ − θ0)/σ < 1.96) ≈ 95%

⇒P(−1.96σ <√n(θ − θ0) < 1.96σ) ≈ 95%

⇒P(−1.96σ/√n < (θ − θ0) < 1.96σ/

√n) ≈ 95%

⇒P(−1.96σ/√n < (θ0 − θ) < 1.96σ/

√n) ≈ 95%

⇒P(θ − 1.96σ/√n < θ0 < θ + 1.96σ/

√n) ≈ 95%

7 / 29

Bootstrap Confidence Interval

Suppose that the limiting distribution of√n(θ∗ − θ) is close

to√n(θ − θ0).

Let’s pretend that we know the 2.5% and 97.5% quantiles of√n(θ∗ − θ) for now and they are denoted as q∗2.5 and q∗97.5.

Then the 95% CI is (θ−q∗97.5/√n, θ−q∗2.5/

√n).

Note that

P(q∗2.5 <√n(θ∗ − θ) < q∗97.5) = 95%

⇒P(q∗2.5 <√n(θ − θ0) < q∗97.5) ≈ 95%

⇒P(q∗2.5/√n < (θ − θ0) < q∗97.5/

√n) ≈ 95%

⇒P(−q∗97.5/√n < (θ0 − θ) < −q∗2.5/

√n) ≈ 95%

⇒P(θ−q∗97.5/√n < θ0 < θ−q∗2.5/

√n) ≈ 95%

8 / 29

Remarks:

In this example, we know that the limiting distribution of√n(θ − θ0) is symmetric. Let α95 such that

P(√n|(θ∗ − θ)| < α95) = 95%.

Then P(θ−α∗95/

√n < θ0 < θ−α∗

95/√n) = 95%.

Both CI’s are asymptotically valid.

In general, the second one can have higher-order improvementin that the converge rate of this CI converge to 95% at afaster rate than the first one. (Why?)

In general, if the finite sample distribution of√n(θ − θ0) is

known to be skewed, then the first one might be a better oneto use.

9 / 29

How to obtain those quantiles?

So far, we pretend that q∗2.5, q∗97.5 and α∗

95 are known.√n(θ∗ − θ) is known as we pointed out, because we know W ∗

i

are drawn from F , the empirical CDF.

Of course, the close form of the CDF of√n(θ∗ − θ) is still

hard to get!

Then this is where the well-known bootstrap simulations comeinto play.

10 / 29

How to obtain those quantiles? (Cont’d)

We know that W ∗i ’s are drawn from F which is equivalent to

randomly draw W ∗i from {W1, . . . ,Wn} with prob 1/n.

Therefore, a bootstrap sample {W ∗1 , . . . ,W

∗n } is formed from

n random sample with replacement.

This step can be done by computer.Generate U[0, 1] random variables. Let u be a realization andwe can have the index be k if (k − 1)/n < u ≤ k/n.

11 / 29

Bootstrap simulations:

1. We can use computer to draw {W ∗1,b, . . . ,W

∗n,b} for

b = 1, . . . ,B and obtain θ∗b.

2. Then the√n(θ∗ − θ) can be further approximated by the

empirical distribution of√n(θ∗b − θ) from b = 1, . . . ,B .

3. Rank√n(θ∗(b) − θ) in an ascending order such that

√n(θ∗(1) − θ) ≤ √

n(θ∗(2) − θ) ≤ . . . ≤ √n(θ∗(B) − θ).

4. q∗2.5 and q∗97.5 can be approximated by q∗2.5 =√n(θ∗(⌊2.5∗B⌋) − θ)

and q∗97.5 =√n(θ∗(⌊97.5∗B⌋) − θ), respectively, where ⌊c⌋ denote

the largest integer a such that a ≤ c .

5. That is, if B = 1000, then q∗2.5 =√n(θ∗(25) − θ) and

q∗97.5 =√n(θ∗(975) − θ), respectively.

6. α∗95 is defined similarly except that the ranking is based on√n|(θ∗(b) − θ)|.

12 / 29

How to obtain those quantiles? (Cont’d)

Note that this approximation can be as accurate as you pleaseby setting B large enough.

When B is too large, it might take too much time to compute.Therefore, there is a trade-off between accuracy and time.

In general, setting B = 700 ∼ 1000, the approximation can begood.

Note that q∗97.5 =√n(θ∗(975) − θ). Therefore, the lower bound

of the CI can be rewritten as

θ − q∗97.5√n

= θ −√n(θ∗(975) − θ)

√n

= θ − (θ∗(975) − θ).

Similarly, the upper bound can be rewritten as θ − (θ∗(25) − θ).

13 / 29

Hypothesis testing

Let Wi ∼ N(µ, 1). We want to test H0 : µ = 1 v.s.H0 : µ 6= 1 at 5% significance level.

Test statistic:√n(µn − 1) where µn is the sample average.

We would reject H0 when |√n(µn − 1)| > 1.96

Under H0 : µ = 1, we will falsely reject the null hypothesis5% of the time.

Under H1 : µ 6= 1, we will reject the null hypothesis withprobability 1 asymptotically. (when n → ∞)

14 / 29

A wrong bootstrap procedure!

The following procedure is WRONG!

1 Generate bootstrap samples: {W1,b, . . . ,Wn,b} forb = 1, . . . ,B, say B = 1000.

2 Calculate√n(µ∗

b − 1) and obtain q∗2.5 =√n(µ∗

(25) − 1) and

q∗97.5 =√n(µ∗

(975) − 1).

3 Reject H0 when√n(µn − 1) < q∗(25) or

√n(µn − 1) > q∗(975).

To see why?

15 / 29

Note that

P(√n(µn − 1) < q∗(25))

=P(√n(µn − 1) <

√n(µ∗

(25) − 1))

=P(0 <√n(µ∗

(25) − µn)) → 0.

Similarly,

P(√n(µn − 1) > q∗(975))

=P(√n(µn − 1) >

√n(µ∗

(975) − 1))

=P(0 >√n(µ∗

(975) − µn)) → 0.

Note that the previous two results hold no matter the trueparameters are.

Therefore, no matter under the null or under the alternative,the size or the power of such test is zero.

16 / 29

A Right way to do!

The following procedure is correct!

1 Generate bootstrap samples: {W1,b, . . . ,Wn,b} forb = 1, . . . ,B, say B = 1000.

2 Calculate√n(µ∗

b − µn) and obtain q∗2.5 =√n(µ∗

(25) − µn) and

q∗97.5 =√n(µ∗

(975) − µn).

3 Reject H0 when√n(µn − 1) < q∗(25) or

√n(µn − 1) > q∗(975).

Why this is a valid procedure?

17 / 29

Under the null hypothesis H0 : µ = 1,

P(√n(µn − 1) < q∗(25) or

√n(µn − 1) > q∗(975))

=1− P(q∗(25) <√n(µn − 1) < q∗(975))

=1− P(q∗(25) <√n(µn − µ) < q∗(975)) ≈ 0.05.

Under the alternative H1 : µ 6= 1, we have√n(µn − 1) → ±∞. Also, q∗(25) and q∗(975) are bounded in

probability. Therefore,

P(√n(µn − 1) < q∗(25) or

√n(µn − 1) > q∗(975)) → 1.

18 / 29

Remarks

The key is to approximate the “null distribution” no matterwe are under the null or under the alternative.

In this case, the null distribution is√n(µn − µ) no matter the

value of true parameter is.

Therefore, we cannot just plug in the value that we want totest in the bootstrap repetitions.

19 / 29

Other uses of Bootstrap

Standard Error

We can use bootstrap method to approximate the asymptoticstandard error of an estimator, σ.

As we mentioned, when constructing CI’s or conductinghypothesis testing, we need a consistent estimator for σ.

We can use bootstrap to obtain an consistent estimator:

σ∗n =

1

B

B∑

i=1

(θ∗b − θ

∗)2,

where θ∗is the sample average of θ∗b’s.

Then we can replace σ with σ∗n in the previous cases.

Shi’s remark: To use bootstrap for standard error, theestimator under consideration must be asymptotically normal.Otherwise, the use of standard error itself is misguided.

20 / 29

Other uses of Bootstrap

Bias Correction

We can use bootstrap to correct the bias of an estimator.The exact bias is Bias(θn, θ) = E [θn]− θ and is unknown.The bootstrap estimator of the bias is:

Bias∗

(θn, θ) =1

B

B∑

i=1

θ∗b − θn.

Then the bootstrap bias-corrected estimator for θ is

θBC ,n = θn − Bias∗

(θn, θ) = 2θn −1

B

B∑

i=1

θ∗b .

Shi’s remark: Bias correction usually increases the variancebecause the bias is estimated. (This causes a trade-offbetween bias and variance.) Therefore it should not be usedindiscriminately.

21 / 29

Higher-order improvements of the Bootstrap

This part is rather theoretical, so we will skip it. Please seeShi’s note for more discussions.

22 / 29

Bootstrap for Regression Models

The regression model we consider is

Yi = Xiβ + Ui , for i = 1, . . . , n,

where Wi = (Yi ,X′i ) is iid with F .

Let βn denote the OLS estimator for β such that

βn =(1n

n∑

i=1

XiX′i

)−1 1

n

n∑

i=1

XiYi .

Under regularity conditions,√n(βn − β)

D→ N(0,V ) where

V = E [XX ′]−1E [U2XX ′]E [XX ′]−1.

23 / 29

Bootstrap for Regression Models (Cont’d)

The nonparametric bootstrap works here.

Bootstrap sample are form by the pairs of Wi = (Yi ,X′i )

′.

That is, a bootstrap sample {(Y ∗i ,X

∗i )}ni=1 is a random

sample with replacement from {(Yi ,Xi )}ni=1.

β∗n is calculated in the same way as βn:

β∗n =

(1n

n∑

i=1

X ∗i X

∗i′)−1 1

n

n∑

i=1

X ∗i Y

∗i .

Then, results similar to what we discussed before would holdin this case under suitable conditions.

24 / 29

Wild Bootstrap for Regression Models

In OLS, we have Yi = Xi βn + ei where ei ’s are the residuals.

Let Ubi ’s denote iid pseudo random variables with mean 0 and

variance 1.

Let the b-th bootstrap sample be

Y ∗b,i = Xi βn + ei · Ub

i ,

and regressors are Xi ’s.

Then the β∗b is

β∗

b =(1n

n∑

i=1

XiX′

i

)−1 1

n

n∑

i=1

XiY∗

b,i = βn +(1n

n∑

i=1

XiX′

i

)−1 1

n

n∑

i=1

Xi ei · Ubi .

Then we can show that√n(β∗

n − βn) can approximate√n(βn − β) well.

This is a residual-based bootstrap and this only works for OLS.

25 / 29

Bootstrap method for weakly dependent data

The bootstrap method we discuss above only works in iidframework.

For weakly dependent data, the dependence amongobservations plays an important role in the asymptotics.

Doing the nonparametric bootstrap above will not workbecause it will break down the dependence.

We need a method that can mimic the dependence structure.

26 / 29

Blockwise Bootstrap

Instead of resample an observation, we resample a bunch ofobservations together.To be specific, let the block size be k and the sample size beT . Then we have T − k + 1 blocks:

(W1,W2, . . . ,Wk)

(W2,W3, . . . ,Wk+1)

...

(WT−k+1, . . . ,WT ).

To form a bootstrap sample,1. we randomly select m blocks (with replacement) such that

m · k ≥ T and (m − 1) · k < T .2. laying them end-to-end in the order sampled.3. Drop the last m · k − T observations from the last sampled

block so that the sample size of the bootstrap sample is equalto T . 27 / 29

Blockwise Bootstrap (Cont’d)

For this method to work asymptotically, we require the blocksize k → ∞, but k/T → 0 at a suitable rate.

Why this method would work?

Why k → ∞, but k/T → 0??

28 / 29

Conclusion!

29 / 29