Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

32
-4 -2 0 2 4 0.0 0.1 0.2 0.3 0.4 x Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3

description

Statistical Data Analysis 3 Today’s topics: Exploring distributions (Chapter 3: 3.5.2, ) Bootstrap (Chapter 4: 4.1, 4.2) Exploring distributions 3.5. Tests for goodness of fit Shapiro-Wilk test for normal distribution (last week) Kolmogorov-Smirnov test for general distribution Chi-square test for goodness of fit for general distribution Bootstrap 4.1. Simulation (read yourself) 4.2. Bootstrap estimators for distribution Parametric bootstrap estimators Empirical bootstrap estimators

Transcript of Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

Page 1: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis

2011/2012

M. de Gunst

Lecture 3

Page 2: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 2

Statistical Data Analysis: Introduction

TopicsSummarizing dataExploring distributions (continued)Bootstrap (first part)Robust methodsNonparametric testsAnalysis of categorical dataMultiple linear regression

Page 3: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 3

Today’s topics:Exploring distributions (Chapter 3: 3.5.2, 3.5.3 )Bootstrap (Chapter 4: 4.1, 4.2)

Exploring distributions3.5. Tests for goodness of fit 3.5.1. Shapiro-Wilk test for normal distribution (last week)3.5.2. Kolmogorov-Smirnov test for general distribution 3.5.3. Chi-square test for goodness of fit for general distribution

Bootstrap4.1. Simulation (read yourself)4.2. Bootstrap estimators for distribution Parametric bootstrap estimators Empirical bootstrap estimators

Page 4: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 4

3.5. Exploring distributions: reminder

TestingIngredients of test? Hypotheses H0 and H1 Test statistic T Distribution of T under H0 and know how it is changed/shifted under H1 Rule for when H0 will be rejected:

Rejection rule either based on critical region or on p-value

How to perform test? Describe the above Choose significance level α Calculate and report value t of T Report whether t is in critical region, or whether p-value < α Formulate conclusion of test: “H0 rejected” or “H0 not rejected” If possible translate conclusion to practical context

NB. When asked to perform a test,you have to do

all 6 steps!

Page 5: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 5

Tests for goodness of fit: for (one) general distribution

Situation independent realizations from unknown distribution F

now: , one specific distribution:

Which statistic gives information about distribution F?

Page 6: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 6

3.5.2. Kolmogorov-Smirnov test (1)

independent realizations from unknown distribution F

Idea: use empirical distribution function

Makes sense: is r.v., ~ binom(n, F(x))

so that for n → ∞,

Then also under H0 , for n → ∞,

Base test on distance between and

Page 7: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 7

Kolmogorov-Smirnov test (2)

Test statistic:

Distribution of Dn under H0: same for all continuous F0: Dn is distribution free over class of continuous distribution functions K-S test is nonparametric test

Because

When is H0 rejected? For large values of Dn

Page 8: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 8

Kolmogorov-Smirnov test (3)

Test statistic:

p-values from tables or computer package.

Note: standard K-S test with these p-values not suitable for composite H0 Then adjusted K-S test with adjusted p-values

Example: for “H0: F is normal” adjusted test statistic for K-S test is

What is difference?

adj

Additional stochasticity!

Page 9: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 9

Kolmogorov-Smirnov test (4)

Data: x H0: F = N(0,1) H1 : F ≠ N(0,1)

Test statistic:

R:> ks.test(x,pnorm)

One-sample Kolmogorov-Smirnov test

data: x D = 0.1163, p-value = 0.4735alternative hypothesis: two-sided

H0 rejected?

of xExample

Page 10: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 10

Kolmogorov-Smirnov test (5)

Data: yH0: F is normal ← composite null hypothesisH1 : F is not normal Test statistic:

R:> ks.test(y,pnorm)D = 0.6922, p-value = 6.661e-16

> ks.test(y,pnorm,mean=mean(y),sd=sd(y))D = 0.1081, p-value = 0.5655> mean(y)[1] 3.62158> sd(y)[1] 3.043356

adj

Incorrect: this is test for H0: F = N(0,1) H1: F ≠ N(0,1)

Incorrect : this is test for

H0: F = N(3.62158,(3.04335)2)

H1: F ≠ N(3.62158,(3.04335)2)

of y

Example

We have not used Dadj ! ! p-value should be

0.126 (next week)

Correct?

Correct?

Page 11: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 11

3.5.3. Chi-square test for goodness of fit (1)

independent realizations from unknown distribution F

Idea: use empirical distribution in different way:divide real line in intervals I1, … ,Ik and compare number of data in intervals with expected number in intervals under H0

Ni = number of observations in Ii

pi = probability of observation in Ii under F0

Then npi = expected number in intervals under H0

Test statistic:

Page 12: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 12

Chi-square test for goodness of fit (2)

Test statistic:

Distribution of X2 under H0: different for different F0, but for n → ∞ distribution of X2 under H0 : chi-square with k-1 dfsame for all F0

For large enough n, X2 distribution free chi-square test nonparametric test

When is H0 rejected? For large values of X2

Page 13: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 13

Chi-square test for goodness of fit (3)

Test statistic:

How to choose intervals I1, … ,Ik ?

How many?More is better, but not too manyRule of Thumb: at least 5 observations expected in each interval under H0

Where?About same number expected in each interval under H0

Page 14: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 14

Chi-square test for goodness of fit (4)

Data: y H0: F = N(4,9) H1 : F ≠ N(4,9)

Test statistic: R: > chisquare(y,pnorm,k=8, lb=0, ub=16,mean=4,sd=3)$chisquare[1] 13.11222$pr[1] 0.06942085$N (0,2] (2,4] (4,6] (6,8] (8,10] (10,12] (12,14] (14,16] 14 13 9 4 5 0 1 0 $np[1] 8 12 12 8 3 0 0 0 #Expected numbers under H0 do not satisfy rule of thumbBetter: choose suitable vector b of `breaks’ > chisquare(y,pnorm,breaks=b,mean=4,sd=3)

of y

Example

Page 15: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 15

Chi-square test for goodness of fit (5)

Test statistic:

under H0: χ2k-1

Standard chi-square test not suitable for composite H0 Then adjusted chi-square test with adjusted chi-square distribution

Example: for “H0: F is normal” adjusted chi-square test statistic is

under H0: χ2k-m-1 only for one specific type of estimators

Page 16: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 16

Recap

Exploring distributions3.5. Tests for goodness of fit 3.5.2. Kolmogorov-Smirnov test for general distribution 3.5.3. Chi-square test for goodness of fit for general distribution

Page 17: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 17

4. Bootstrap

Page 18: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 18

Bootstrap: Introduction(1) Data: 59 melting temperatures of beewax

P unknown true underlying distribution of beewax data  Estimator of location of P? Tn = (sample) Mean Estimate of location of P?tn = mean(beewax) = 63.589

How accurate is estimate?How good is estimator?Distribution of Tn ? Broad/narrow?

Main question:How to estimate unknown distribution of estimator TnNotation: QP

 

Example

R:> beewax [1] 63.78 63.34

63.36 63.51 ….> mean(beewax)[1] 63.58881 > sd(beewax)[1] 0.3472209> var(beewax)[1] 0.1205624

R:> beewax [1] 63.78 63.34

63.36 63.51 ….> mean(beewax)[1] 63.58881 > sd(beewax)[1] 0.3472209> var(beewax)[1] 0.1205624

Page 19: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 19

Bootstrap: Introduction(2) (Continued)

Simple case: assume P ~ N(μ,σ2) Tn = (sample) Mean

What is distribution QP of Tn ? We estimate: N(63.589, 0.121/59)

How did we find this?

i) Estimator of P: N((sample) Mean, (sample) Variance)ii) Estimate: N(63.589,0.121)iii) QP is distribution of Mean of 59 independent observations from Piv) Estimator of QP : N((sample) Mean, (sample)Variance/59)v) Estimate of QP : N(63.589, 0.121/59)

Example

R:> beewax [1] 63.78 63.34

63.36 63.51 ….> mean(beewax)[1] 63.58881 > sd(beewax)[1] 0.3472209> var(beewax)[1] 0.1205624> length(beewax) [1] 59

Page 20: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 20

Bootstrap: Introduction(3) (Continued)

Other case: assume P ~ N(μ,σ2) Now Tn = (sample) Median What is distribution QP of Tn ?

How to proceed now?i) Estimator of P: N((sample) Mean, (sample) Variance)ii) Estimate: N(63.589,0.121)iii) QP is distribution of Median of 59 independent observations from Piv) Estimator of QP : ?v) Estimate of QP : ?

This is what bootstrap is about: estimate distribution QP of function Tn of 59 independent observations from unknown P n

Example

R:> beewax [1] 63.78 63.34

63.36 63.51 ….> mean(beewax)[1] 63.58881 > sd(beewax)[1] 0.3472209> var(beewax)[1] 0.1205624> length(beewax) [1] 59

Page 21: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 21

Bootstrap: Introduction(4) (Continued)

Again other case: no assumption about PTn = (sample) MeanWhat is distribution QP of Tn ?

How to proceed now?i) Estimator of P: ?ii) Estimate: ?iii) QP is distribution of Mean of 59 independent observations from Piv) Estimator of QP : ?v) Estimate of QP : ?

This is what bootstrap is about: estimate distribution QP of function Tn of 59 independent observations from unknown P n

Example

R:> beewax [1] 63.78 63.34

63.36 63.51 ….> mean(beewax)[1] 63.58881 > sd(beewax)[1] 0.3472209> var(beewax)[1] 0.1205624> length(beewax) [1] 59

Page 22: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 22

4.2. Bootstrap estimators for a distribution

This is what bootstrap is about: estimate distribution QP of function Tn of n independent observations from unknown P

Situation realizations of , independent, unknown distr.

P

Goal Estimate distribution of estimator

Cases1. Assume P is some parametric distribution with unknown parameters2. Assume nothing about P

Page 23: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 23

Bootstrap estimators for a distribution;Case 1: parametric bootstrap estimator (1)

(Beewax; case 1)

Case 1: Assume P ~ N(μ,σ2) ; Tn = (sample) Median What is distribution QP of Tn ?

How to proceed?i) Estimator of P: N( X ,S2) = P ii) Estimate: N(63.589,0.121)iii) QP is distribution of Median of 59 independent observations from Piv) Estimator of QP : distribution of Median of 59 independent observations from N( X ,S2) = Pv) Estimate of QP : distribution of Median of 59 independent observations from N(63.589,0.121) Unknown: use computer to generate realizations from estimate of QP

Empirical distribution of generated set is parametric bootstrap estimate of QP

Example

θ̂n

θ̂n

Which distribution is this?

Page 24: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 24

Bootstrap estimators for a distribution;Case 1: parametric bootstrap estimator (2)

(Continued; case 1)

How to generate realizations from estimate of QP : i.e. from distribution of Median of 59 independent observations from N(63.589,0.121)?# 1. Generate one bootstrap sample:> xstar=rnorm(59, 63.589,sqrt(0.121)) # Check:> xstar [1] 63.84819 62.88915 63.71705 64.06793…..[57] 63.56481 64.03403 63.75276 #Note: xstar is of same length as beewax

# 2. Now compute one bootstrap value tstar from xstar:> tstar=median(xstar)> tstar[1] 63.70498

# 3. Do 1 and 2 B times. The B values tstar are generated realizations from estimate of QP

Example

Page 25: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 25

Bootstrap estimators for a distribution;Case 1: parametric bootstrap estimator (3)

(Continued; case 1)

The B values tstar are generated realizations from estimate of QP

i.e. from distribution of Median of 59 independent observations from N(63.589,0.121)

Recall: empirical distribution of generated set is parametric bootstrap estimate of QP

Also: sample variance 0.0038 of generated set is parametric bootstrap estimate of variance of QP

Example

Page 26: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 26

Bootstrap estimators for a distribution;Case 2: empirical bootstrap estimator (1)

(Beewax; case 2)

Case 2: Assume nothing about P; Tn = (sample) Mean What is distribution QP of Tn ?

How to proceed?i) Estimator of P: empirical distribution of data = Pn ii) Estimate: empirical distribution of beewax data iii) QP is distribution of Mean of 59 independent observations from Piv) Estimator of QP : distribution of Mean of 59 independent observations from empirical distribution of data v) Estimate of QP : distribution of Mean of 59 independent observations from empirical distribution of beewax data Unknown: use computer to generate realizations from estimate of QP

Empirical distribution of generated set is Empirical bootstrap estimate of QP

Example

Which distribution is this?

^

Page 27: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 27

Bootstrap estimators for a distribution;Case 2: empirical bootstrap estimator (2)

(Continued; case 2)

How to generate realizations from estimate of QP : i.e. from distribution of Mean of 59 independent observations from empirical distribution of beewax data ?# 1. Generate one bootstrap sample:> xstar=sample(beewax, replace = TRUE) # Check:> xstar [1] 63.69 64.42 63.30 63.03 63.13 63.13 63.08 63.27 63.08 64.12 64.21 63.43 …..#Note: xstar is of same length as beewax and consists of values sampled from the set of #beewax values.

# 2. Now compute one bootstrap value tstar from xstar:> tstar=mean(xstar)> tstar[1] 63.60271

# 3. Do 1 and 2 B times. The B values tstar are generated realizations from estimate of QP

Example

Page 28: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 28

Bootstrap estimators for a distribution;Case 2: empirical bootstrap estimator (3)

(Continued; case 2)

The B values tstar are generated realizations from estimate of QP

i.e. from distribution of Mean of 59 independent observations from empirical distribution of beewax data

Recall: empirical distribution of this generated set is empirical bootstrap estimate of QP

Also: sample variance 0.00193 of this generated set is empirical bootstrap estimate of variance of QP

Note: this value is comparable to value 0.121/59 = 0.0020of estimate of variance of QP under normality assumption for P

Example

Page 29: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 29

Empirical bootstrap with R

# Can be done in one go with local R-function bootstrap:> bootstrap = function(x, statistic, B = 100., ...){# returns a vector of B bootstrap values of real-valued statistic.# statistic(x) should be R-function ; arguments of # statistic can be inserted on ...# resampling is done from empirical distribution of x y <- numeric(B) for(j in 1.:B) y[j] <- statistic(sample(x, replace = TRUE), ...) y} 

# Compute 1000 bootstrap values tstar:> tstarvector=bootstrap(beewax,mean,B=1000)

Page 30: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 30

Bootstrap: two errors

Recall goal: to estimate distribution QP of function Tn of n independent observations from unknown P Note: in bootstrap estimation procedure two types of “errors” are made; Which ones?

Given the data: - Estimate of QP : distribution of function Tn of n independent observations from estimate of P

- Which is estimated in turn by empirical distribution of computer generated realizations of this distribution

How can we make these errors small?Size first error depends on quality estimator of P Size second error can be made small by taking B large

First errorSecond error

Page 31: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 31

Recap

Bootstrap4.1. Simulation (read yourself)4.2. Bootstrap estimators for distribution Parametric bootstrap estimators Empirical bootstrap estimators

Page 32: Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Statistical Data Analysis 32

Exploring distributions/Bootstrap

The end