STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10:...

12
STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example We return to the problem of estimating the mean number of overseas-born people per NSW LGA (1996). It seems plausible that overseas-born people would be more likely to settle in urban rather than rural areas. So perhaps a stratification based on broad geographical groupings of LGAs would be sensible. 2 SD id SD name Number of LGAs 5 Sydney 46 10 Hunter 14 15 Illawarra 4 20 Richmond-Tweed 7 25 Mid-North Coast 11 30 Northern 20 35 North Western 14 40 Central West 14 45 South Eastern 19 50 Murrumbidgee 14 55 Murray 16 60 Far West 3 Statistical Divisions (SD) NOTE: LGAs (182 of them) are grouped into 12 Statistical Divisions (SDs). These become our strata. 3 Descriptive statistics of OS born by SD_id Descriptive Statistics Variable SD_id N Mean Median TrMean StDev OSBorn P 5 46 26426 20093 24728 19966 10 14 9731 1192 4070 23028 15 4 15311 7439 15311 19457 20 7 3048 3263 3048 3000 25 11 2113 1281 1823 2193 30 20 1084 350 554 2554 35 14 476 239 388 555 40 14 913 407 796 1001 45 19 1420 879 1267 1498 50 14 644 264 427 994 55 16 511 262 297 947 60 3 1486 972 1486 1333 4 55 50 60 35 30 20 25 10 5 15 45 40 5 Number of OS born in NSW LGAs by SD 5 10 15 20 25 30 35 40 45 50 55 60 0 50000 100000 SD_id OSBorn P 6 A reasonable stratification strategy for sampling LGAs would be to have the following three strata: Stratum 1: 5 (Sydney), 15 (Illawarra) Stratum 2: 10 (Hunter), 20 (Richmond-Tweed), 25 (Mid- North Coast), 45 (South Eastern) Stratum 3: rest of NSW Note: Alternatively, Hunter (10), Sydney (5) or Illawarra (15) may be considered as a separate stratum due to its difference in variability.

Transcript of STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10:...

Page 1: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 1

1

Week 10: STRATIFIED SAMPLINGLGA example

• We return to the problem of estimating the mean number of overseas-born people per NSW LGA (1996).

• It seems plausible that overseas-born people would be more likely to settle in urban rather than rural areas.

• So perhaps a stratification based on broad geographical groupings of LGAs would be sensible.

2

SD id SD name Number of LGAs5 Sydney 46

10 Hunter 14

15 Illawarra 4

20 Richmond-Tweed 7

25 Mid-North Coast 11

30 Northern 20

35 North Western 14

40 Central West 14

45 South Eastern 19

50 Murrumbidgee 14

55 Murray 16

60 Far West 3

Statistical Divisions (SD)NOTE: LGAs (182 of them) are grouped into 12 Statistical

Divisions (SDs). These become our strata.

3

Descriptive statistics of OS born by SD_id

Descriptive Statistics

Variable SD_id N Mean Median TrMean StDev

OSBorn P 5 46 26426 20093 24728 19966

10 14 9731 1192 4070 23028

15 4 15311 7439 15311 19457

20 7 3048 3263 3048 3000

25 11 2113 1281 1823 2193

30 20 1084 350 554 2554

35 14 476 239 388 555

40 14 913 407 796 1001

45 19 1420 879 1267 1498

50 14 644 264 427 994

55 16 511 262 297 947

60 3 1486 972 1486 1333

4

5550

60

35

30

20

25

10

5

15

45

40

5

Number of OS born in NSW LGAs by SD

5 10 15 20 25 30 35 40 45 50 55 60

0

50000

100000

SD_id

OS

Bor

n P

6

A reasonable stratification strategy for sampling LGAs would be to have the following three strata:

• Stratum 1:5 (Sydney), 15 (Illawarra)

• Stratum 2:10 (Hunter), 20 (Richmond-Tweed), 25 (Mid-North Coast), 45 (South Eastern)

• Stratum 3:rest of NSW

Note: Alternatively, Hunter (10), Sydney (5) or Illawarra (15) may be considered as a separate stratum due to its difference in variability.

Page 2: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 2

7

Descriptive Statistics

Variable stratum N Mean Median TrMean StDev

OSBorn P 1 50 25537 19435 23429 19964

2 51 4074 1211 1933 12384

3 81 775 322 549 1488

Variable stratum SE Mean Minimum Maximum Q1 Q3

OSBorn P 1 2823 2216 97203 9695 33953

2 1734 102 87264 377 3522

3 165 18 11607 159 651

Descriptive statistics of the three strata

8

Now let’s draw a simple random sample of size 10 LGAs from each of the three strata.

9

Sample from Stratum 1Pittwater 11177

Baulkham Hills 30267

Marrickville 33538

Leichhardt 16556

Manly 9759

Blacktown 72350

Botany 16002

Willoughby 19180

Waverley 24006

Hunter's Hill 2690

Variable N Mean Median TrMean StDev SE Mean

OS_born 10 23552 17868 20061 19522 6173

Variable Minimum Maximum Q1 Q3

OS_born 2690 72350 10823 31085

10

Sample from Stratum 2Merriwa 150

Eurobodalla 3996

Newcastle 16266

Muswellbrook 930

Yarrowlumla 1449

Scone 543

Gunning 203

Singleton 1455

Cntrl Darling 126

Indigo 1086

Variable N Mean Median TrMean StDev SE Mean

OS_born 10 2620 1008 1227 4927 1558

Variable Minimum Maximum Q1 Q3

OS_born 126 16266 190 2090

11

Sample from Stratum 3Cobar 300

Tamworth 1888

Parkes 808

Holbrook 137

Walgett 940

Jerilderie 127

Cessnock 2999

Warren 109

Evans 420

Parry 642

Variable N Mean Median TrMean StDev SE Mean

OS_born 10 837 531 658 932 295

Variable Minimum Maximum Q1 Q3

OS_born 109 2999 135 1177

12

Estimation of

.30nnnn size, sample Total

932s

837y

10n

81N

927,4s

620,2y

10n

51N

522,19s

552,23y

10n

50N

have We

321

3

3

3

3

2

2

2

2

1

1

1

1

Page 3: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 3

13

577,7

837182

81620,2

182

51552,23

182

50

:isLGA NSW per peopleborn

-oversea ofnumber mean estimated Therefore,

STy

14

Estimated variance and standard error

.571,1

424,469,2)y(ES

424,469,2

10

932

81

101

182

81

10

927,4

51

101

182

51

10

522,19

50

101

182

50

n

sf1

N

N)y(raV

ST

22

22

22

3

1i i

2i

i

2

iST

15

Comparison of SRS with stratified sampling results

Recall: Last week we estimated the mean LGA OS-born people, based on a simple random sample of size n=30.

We obtained

1,620. 30/)165.01(713,9ySE

with 527,6y

16

With stratified sampling, for the same sample size n=30, we have estimated with increased precision (ie, smaller standard error/variance as shown in Slide 14):

.620,1)y(s571,1yVarySE STST

17

Design Effect (Lohr §7.5)

The Design Effect is defined as

This quantifies the effect on the sampling variance obtained by using the current sampling scheme (e.g. stratified sampling) over SRS.

We have here

Note: Usually the design effect for a stratified sample will be less than one (ie, higher precision), unless all the stratum means are equal.

size sample same with SRSunder estimate

plan samplingcurrent under estimate

Var

Vardeff

94.0620,1

571,12

2

deff

18

Estimation of the population total

STSTT

T

yNy

yNy

N

,

sample, stratified aFor

is SRS) aon (basedestimator sample its and

is totalpopulation theRecall

Page 4: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 4

19

L

i i

iii

L

i i

iii

ST

STSTT

STT

nfN

nfWN

yVarN

yNVaryVar

y

1

22

1

222

2

,

,

1

1

)(

show.) (Easy to

. ofestimator unbiased an is

20

Confidence intervals for and

STTSTT

STST

yVarNy

yVarNy

,, ,~

and

,~

:ionsapproximat normal usual thehave We

21

for ySENzyN

ySEzy

and

for ySEzy

by given are

intervals confidence )%-100(1 eApproximat

ST2/ST

ST,T2/ST,T

ST2/ST

t distribution?

number of df unclear, and so use z instead.

22

OS-born example

)656,10,498,4(

079,3577,7

571,196.1577,7

: mean Population

2/

STST ySEzy

Includes =8,502

23

392,939,1;636,818

656,10;498,4182

ySEzyN

:CI 95%

014,379,1

577,7182yNy

: totalPopulation

ST2/ST

STST,T

Includes =1,547,364

24

Choice of stratum sample sizes ni

1. Proportional allocation

Sometimes Ni’s and n are known, but we need to work out the ni’s ( ).

One approach is to insist on sampling the same proportion of each stratum, i.e

LifNn

N

n

i

i ,,1

L21 nnnn

Page 5: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 5

25

.or

i.e.

Then

11

ii

ii

i

L

ii

L

ii

NfN

Nnn

Nn

N

n

NfNfnn

This scheme is known as proportional allocation.

26

LGA example

30

182

10

81

10

51

10

50

3

3

2

2

1

1

n

N

n

N

n

N

n

N

Recall we had taken a sample of size 30 as follows:

27

30. of size the to total thekeep to

sensibly rounded werenumbers The :Note

.13,9,8 take would weThus,

35.1381165.0

41.851165.0

24.850165.0

with

165.0182

30

:have would wemethod, allocation alproportion theusing

but units, 30 of a total sample togoing still are weIf

321

3

2

1

nnn

n

n

n

N

nf

28

Choice of stratum sample sizes ni

2. Optimal allocation• An important objective of sample survey design is to

provide estimates with small variances at the lowest possible cost.

• The best allocation scheme will depend on:

– The number of elements in each stratum (Ni)

– The variability of observations within each stratum (i

2)

– The cost of obtaining an observation from each stratum (ci).

29

• Let’s consider the question of how to choose the sample size n to satisfy certain precision and cost requirements.

• This will also involve choice of the ni.

30

L

1iii0 cncC

: strata L all across sampling ofcost Total

Model for cost structure:

Overhead = cost of administering the survey = c0

Cost of a single observation from stratum i = ci

Page 6: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 6

31

What allocation of stratum sample sizes n1, n2,…, nL should be chosen to

• Minimise for a given total cost C?

• Minimise the total cost C, for a given value of

STyVar

?

STyVar

32

Minimise variance for fixed cost, C

. of method by the isSolution

constraint thesubject to

1

1

minimise to,, choose We

01

1 1

22

2

1

22

1

ultipliersLagrange m

cCnc

WNn

W

nfWyVar

nn

L

iii

L

i

L

iii

i

ii

L

i i

iiiST

L

33

Minimise cost for fixed variance, V

s.multiplier Lagrange of method by theagain isSolution

1

constraint thesubject to

minimise to,, choose We

22

2

10

1

VWNn

WyVar

ncc

nn

i i iii

iiST

L

iii

L

34

Solution to optimal allocation problem

The allocation for minimising cost for fixed variance turns out to be the same as that for minimising variance for fixed cost.

35

L

j j

jj

i

ii

i

i

ii

i

c

Nc

N

nn

i

c

N

n

1

is stratum in size sample optimal theThus

toalproportion has allocation optimal The

36

Thus sampling effort is directed to strata that:

• account for a large proportion of the population;

• have a large variance;

• have a low sampling cost.

Page 7: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 7

37

. asknown is This

toreduces this, i.e. equal,

are strata over the costs samplingWhen

1

21

ocationNeyman all

N

Nnn

ccc

L

jjj

iii

L

38

equal. are variancesstratum allwhen

allocationNeyman of case special a asseen

becan allocation alproportion Thus

.ional allocatproportion iswhich

fNN

Nnn

becomes formula allocation then the

i.e. equal, are variances

stratum headdition tin if Further,

ii

i

2L

21

39

In order to compute the ni according to the Neyman allocation for sample size, we need to know:

• the stratum variances i2; and

• the total sample size n.

40

Stratum variances i2

These have to be estimated from previous surveys or other knowledge about the population.

41

Total sample size n

The determination of n depends on whether:

• Variance was being minimised for a fixed cost; or

• Total cost was being minimised for a fixed variance.

42

Total sample size required for a fixed total cost, C

L

jjjj

L

i i

ii

Nc

c

N

cCn

1

1

0

Substitute ni expression on Slide 35 into the total cost

constrain on Slide 32, , and work out the expression for n.

0

L

1iii cCnc

Page 8: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 8

43

Total sample size required for a fixed Var(yST), V

Here we determine the total sample size n that satisfies the requirement

Based on this constrain below,

, and (contd.)

BySTPr

.)y(VZB where st2/

2

2/

22

22 1

z

BVW

NnWyVar

i i iii

iiST

44

i. stratum toallocated

nsobservatio of proportion theis

where

/

get welogic usual thefollowing

1

2

2

2/

1

22

n

nw

Nz

BN

wNn

ii

L

iii

L

iiii

45

.z

B ie, V,at fixed is where

get we

ngSubstituti

2/2

2

1

22

11/

ST

ii

yVar

n

nw

L

i iiNVN

L

i iciiNL

k kckkN

n

jjc

jjNic

iiN

46

Optimal allocation: LGA example

• We like to sample a total of 30 (n =30) LGAs.

• We assume that sampling costs in the strata are equal, i.e. c1= c2= c3

• We know the values of i: from Slide 7,1=19,964 ; 2=12,384 ; 3=1,488

• Usually we would not know the values of i. In this case we would estimate i from the previous census.

47

.2n ,11n ,17n takeWe

1.2312,750,1

488,18130n

8.10312,750,1

384,125130n

1.17312,750,1

200,99830

488,181384,1251964,1950

964,195030n

N

Nnn :formula allocationNeyman theuse We

321

2

2

1

L

1jjj

iii

48

• Recall the proportional allocation was : n1=8, n2=9, n3=13 (shown on Slide 27), where Stratum 3 received a high allocation of observations (sampling units) because this stratum accounts for 81/182=45% of LGAs.

• However, in the Neyman allocation we take into account the fact that Stratum 3 has the lowest variance across all three strata, and accordingly allocate it very few units to be sampled.

Page 9: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 9

49

Results : Neyman allocation sample

Stratum 1Variable N Mean Median TrMean StDev SE Mean

OSBorn s 17 26788 26358 25357 18428 4469

Stratum 2Variable N Mean Median TrMean StDev SE Mean

OSBorn s 11 2661 1455 2235 2764 833

Stratum 3Variable N Mean Median TrMean StDev SE Mean

OSBorn s 2 294.0 294.0 294.0 118.8 84.0

50

9.235,8

0.29418281661,2

18251788,26

18250

STy

51

.019,1

195,039,1)y(ES

195,039,1

2

8.118

81

21

182

81

11

764,2

51

111

182

51

17

428,18

50

171

182

50

n

sf1

N

N)y(raV

ST

22

22

22

3

1i i

2i

i

2

iST

52

True variances of the estimators

• Because we have complete population information, we are in fact in a position to compute the actual variances ( ) of the estimators of .

• The variances we have computed until now have been estimates ( ), based on the samples that we drew.

)yVar( ie, ,2y

)yr(aV ,ie ,s2y

53

For a SRS (n=30):

.709,2

433,339,7)y(SE

433,339,7

30

237,16

182

301

n)f1()y(Var

y

2

2

From Slide 7 last week

54

For a stratified sample of size 30with equal allocation:

.847,1

051,413,3)y(SE

051,413,3

10

488,1

81

101

182

81

10

384,12

51

101

182

51

10

964,19

50

101

182

50

nf1

N

N)y()y(Var

st

22

22

22

3

1i i

2i

i

2

iST

2st

i’s from Slide 7

Page 10: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 10

55

Stratified, proportional allocation (n=30)

.071,2

762,288,4)y(SE

762,288,4

13

488,1

81

131

182

81

9

384,12

51

91

182

51

8

964,19

50

81

182

50

nf1

N

N)y(Var

styST

22

22

22

3

1i i

2i

i

2

iST

56

Stratified, Neyman allocation (n=30)

.497,1

369,240,2)y(SE

369,240,2

2

488,1

81

21

182

81

11

384,12

51

111

182

51

17

964,19

50

171

182

50

nf1

N

N)y(Var

ST

22

22

22

3

1i i

2i

i

2

iST

57

Variances of the estimators: summary

Sampling scheme deff

SRS 2,709 -

Stratified:

Equal 1,847 0.46

Proportional 2,071 0.58

Neyman 1,497 0.31

SE

58

Comparison of SRS mean and stratified sample mean

The point of stratifying is the expectation that

will be a more precise estimator of than from a SRS, ie,

We can see, for the LGA example, that stratification has achieved this, for all three allocation methods that we tried.

.yVaryVar ST

STy

y

59

It can be shown that

if variation between stratum means is large compared with within-stratum variation.

yVaryVar ST

60

This means that there will be a benefit to stratification (in terms of increased precision of the estimates) if

• within-stratum variances are small; and

• there are large differences between the stratum means

Page 11: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 11

61

• With stratification we therefore aim to subdivide the population into homogeneous layers/groups/sections. (See next slide)

• In the LGA example, another reasonable stratification strategy would have been– urban

– rural

These are two fairly homogeneous subsections of the population.

62

1 2 3

0

50000

100000

stratum

OS

Bor

n P

Boxplots of OS Bornby the three strata:

63

Comments on the stratification

• We can see that we have achieved small within-stratum variances in strata 2 and 3, but stratum 1 still has a large variance.

• We have successfully separated out the section of the population with a much larger mean.

• The sampling scheme could be further improved by a finer stratification within Sydney.

64

Estimation of other parameters

Estimation of the population total and proportion p on the basis of a stratified sample follows the same lines as estimation of the population mean .

65

Population total

L

i i

iii

STST

L

iiiSTST

n

sfN

yraVNraV

yNyN

1

22

2

1

1

ˆ

ˆ

ˆˆ

Note: For sample size calculations (n or 𝑛 ), you may use the ones for estimating mean, .

66

Population proportion p

L

i i

iii

i

L

ii

iST

L

ii

iST

n

qpf

N

N

praVN

NpraV

pN

Np

1

2

1

2

1

1

ˆˆ1

ˆˆ

ˆˆ

ˆˆ

Page 12: STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10: STRATIFIED SAMPLING LGA example • We return to the problem of estimating the mean number

STAT373/STAT814_STAT714

Week 10

2019 12

67

Sample Size

• The total sample size, n, required to estimate P within “d” percentage points for fixed Var( ) is,

used. allocation the

of typeon the depends ,nsize, sample

satratum and i, stratumfor where

)]}1([({

}/)]1([({

i

1

2

2/

1

2

n

nw

ppNz

dN

wppNn

ii

L

iiii

L

iiiii

STp

68

Obtain a stratified sample in Minitab

Eg: Let’s draw a stratified sample that consists of a SRS of size n=10 from each of the three broad LGA strata (to get a total sample size of 30) as described on Slide 8.

Run Minitab, and open LGA.mtw file;

Code the SD_id variable in the file into stratum variable, say named with ST_id, coded as 1, 2, or 3 according to Slide 6 as follow: ‾ Click Data button on top menu, and choose Code and then

Numeric to Numeric, and the dialog window will appear as shown on slide.

69

Code – Numeric to Numeric window:

70

Obtain a stratified sample in Minitab (Contd.):

‾ Click OK button on the Code window;

‾ Click Data button on top menu, and choose Split Worksheet;

‾ On the dialog window, select the column containingstratum code, then click OK button;

Draw a SRS of size 10 from each stratum worksheet, following the procedures given in Week 9 lecture.

Finally you have a stratified sample consisting of those three (stratum) samples of size 10 obtained.

71

SGTA exercises Week 10(Try completing all exercises listed below before your

SGTA class in Week 11.)

Lohr 3.9 Exercises:

• 9

• 6

• 7

Note: Questions and data file are available on the unit iLearn.