STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10:...
Transcript of STAT373/ Week 10 STAT814 STAT714€¦ · STAT373/ STAT814_STAT714 Week 10 2019 1 1 Week 10:...
STAT373/STAT814_STAT714
Week 10
2019 1
1
Week 10: STRATIFIED SAMPLINGLGA example
• We return to the problem of estimating the mean number of overseas-born people per NSW LGA (1996).
• It seems plausible that overseas-born people would be more likely to settle in urban rather than rural areas.
• So perhaps a stratification based on broad geographical groupings of LGAs would be sensible.
2
SD id SD name Number of LGAs5 Sydney 46
10 Hunter 14
15 Illawarra 4
20 Richmond-Tweed 7
25 Mid-North Coast 11
30 Northern 20
35 North Western 14
40 Central West 14
45 South Eastern 19
50 Murrumbidgee 14
55 Murray 16
60 Far West 3
Statistical Divisions (SD)NOTE: LGAs (182 of them) are grouped into 12 Statistical
Divisions (SDs). These become our strata.
3
Descriptive statistics of OS born by SD_id
Descriptive Statistics
Variable SD_id N Mean Median TrMean StDev
OSBorn P 5 46 26426 20093 24728 19966
10 14 9731 1192 4070 23028
15 4 15311 7439 15311 19457
20 7 3048 3263 3048 3000
25 11 2113 1281 1823 2193
30 20 1084 350 554 2554
35 14 476 239 388 555
40 14 913 407 796 1001
45 19 1420 879 1267 1498
50 14 644 264 427 994
55 16 511 262 297 947
60 3 1486 972 1486 1333
4
5550
60
35
30
20
25
10
5
15
45
40
5
Number of OS born in NSW LGAs by SD
5 10 15 20 25 30 35 40 45 50 55 60
0
50000
100000
SD_id
OS
Bor
n P
6
A reasonable stratification strategy for sampling LGAs would be to have the following three strata:
• Stratum 1:5 (Sydney), 15 (Illawarra)
• Stratum 2:10 (Hunter), 20 (Richmond-Tweed), 25 (Mid-North Coast), 45 (South Eastern)
• Stratum 3:rest of NSW
Note: Alternatively, Hunter (10), Sydney (5) or Illawarra (15) may be considered as a separate stratum due to its difference in variability.
STAT373/STAT814_STAT714
Week 10
2019 2
7
Descriptive Statistics
Variable stratum N Mean Median TrMean StDev
OSBorn P 1 50 25537 19435 23429 19964
2 51 4074 1211 1933 12384
3 81 775 322 549 1488
Variable stratum SE Mean Minimum Maximum Q1 Q3
OSBorn P 1 2823 2216 97203 9695 33953
2 1734 102 87264 377 3522
3 165 18 11607 159 651
Descriptive statistics of the three strata
8
Now let’s draw a simple random sample of size 10 LGAs from each of the three strata.
9
Sample from Stratum 1Pittwater 11177
Baulkham Hills 30267
Marrickville 33538
Leichhardt 16556
Manly 9759
Blacktown 72350
Botany 16002
Willoughby 19180
Waverley 24006
Hunter's Hill 2690
Variable N Mean Median TrMean StDev SE Mean
OS_born 10 23552 17868 20061 19522 6173
Variable Minimum Maximum Q1 Q3
OS_born 2690 72350 10823 31085
10
Sample from Stratum 2Merriwa 150
Eurobodalla 3996
Newcastle 16266
Muswellbrook 930
Yarrowlumla 1449
Scone 543
Gunning 203
Singleton 1455
Cntrl Darling 126
Indigo 1086
Variable N Mean Median TrMean StDev SE Mean
OS_born 10 2620 1008 1227 4927 1558
Variable Minimum Maximum Q1 Q3
OS_born 126 16266 190 2090
11
Sample from Stratum 3Cobar 300
Tamworth 1888
Parkes 808
Holbrook 137
Walgett 940
Jerilderie 127
Cessnock 2999
Warren 109
Evans 420
Parry 642
Variable N Mean Median TrMean StDev SE Mean
OS_born 10 837 531 658 932 295
Variable Minimum Maximum Q1 Q3
OS_born 109 2999 135 1177
12
Estimation of
.30nnnn size, sample Total
932s
837y
10n
81N
927,4s
620,2y
10n
51N
522,19s
552,23y
10n
50N
have We
321
3
3
3
3
2
2
2
2
1
1
1
1
STAT373/STAT814_STAT714
Week 10
2019 3
13
577,7
837182
81620,2
182
51552,23
182
50
:isLGA NSW per peopleborn
-oversea ofnumber mean estimated Therefore,
STy
14
Estimated variance and standard error
.571,1
424,469,2)y(ES
424,469,2
10
932
81
101
182
81
10
927,4
51
101
182
51
10
522,19
50
101
182
50
n
sf1
N
N)y(raV
ST
22
22
22
3
1i i
2i
i
2
iST
15
Comparison of SRS with stratified sampling results
Recall: Last week we estimated the mean LGA OS-born people, based on a simple random sample of size n=30.
We obtained
1,620. 30/)165.01(713,9ySE
with 527,6y
16
With stratified sampling, for the same sample size n=30, we have estimated with increased precision (ie, smaller standard error/variance as shown in Slide 14):
.620,1)y(s571,1yVarySE STST
17
Design Effect (Lohr §7.5)
The Design Effect is defined as
This quantifies the effect on the sampling variance obtained by using the current sampling scheme (e.g. stratified sampling) over SRS.
We have here
Note: Usually the design effect for a stratified sample will be less than one (ie, higher precision), unless all the stratum means are equal.
size sample same with SRSunder estimate
plan samplingcurrent under estimate
Var
Vardeff
94.0620,1
571,12
2
deff
18
Estimation of the population total
STSTT
T
yNy
yNy
N
,
sample, stratified aFor
is SRS) aon (basedestimator sample its and
is totalpopulation theRecall
STAT373/STAT814_STAT714
Week 10
2019 4
19
L
i i
iii
L
i i
iii
ST
STSTT
STT
nfN
nfWN
yVarN
yNVaryVar
y
1
22
1
222
2
,
,
1
1
)(
show.) (Easy to
. ofestimator unbiased an is
20
Confidence intervals for and
STTSTT
STST
yVarNy
yVarNy
,, ,~
and
,~
:ionsapproximat normal usual thehave We
21
for ySENzyN
ySEzy
and
for ySEzy
by given are
intervals confidence )%-100(1 eApproximat
ST2/ST
ST,T2/ST,T
ST2/ST
t distribution?
number of df unclear, and so use z instead.
22
OS-born example
)656,10,498,4(
079,3577,7
571,196.1577,7
: mean Population
2/
STST ySEzy
Includes =8,502
23
392,939,1;636,818
656,10;498,4182
ySEzyN
:CI 95%
014,379,1
577,7182yNy
: totalPopulation
ST2/ST
STST,T
Includes =1,547,364
24
Choice of stratum sample sizes ni
1. Proportional allocation
Sometimes Ni’s and n are known, but we need to work out the ni’s ( ).
One approach is to insist on sampling the same proportion of each stratum, i.e
LifNn
N
n
i
i ,,1
L21 nnnn
STAT373/STAT814_STAT714
Week 10
2019 5
25
.or
i.e.
Then
11
ii
ii
i
L
ii
L
ii
NfN
Nnn
Nn
N
n
NfNfnn
This scheme is known as proportional allocation.
26
LGA example
30
182
10
81
10
51
10
50
3
3
2
2
1
1
n
N
n
N
n
N
n
N
Recall we had taken a sample of size 30 as follows:
27
30. of size the to total thekeep to
sensibly rounded werenumbers The :Note
.13,9,8 take would weThus,
35.1381165.0
41.851165.0
24.850165.0
with
165.0182
30
:have would wemethod, allocation alproportion theusing
but units, 30 of a total sample togoing still are weIf
321
3
2
1
nnn
n
n
n
N
nf
28
Choice of stratum sample sizes ni
2. Optimal allocation• An important objective of sample survey design is to
provide estimates with small variances at the lowest possible cost.
• The best allocation scheme will depend on:
– The number of elements in each stratum (Ni)
– The variability of observations within each stratum (i
2)
– The cost of obtaining an observation from each stratum (ci).
29
• Let’s consider the question of how to choose the sample size n to satisfy certain precision and cost requirements.
• This will also involve choice of the ni.
30
L
1iii0 cncC
: strata L all across sampling ofcost Total
Model for cost structure:
Overhead = cost of administering the survey = c0
Cost of a single observation from stratum i = ci
STAT373/STAT814_STAT714
Week 10
2019 6
31
What allocation of stratum sample sizes n1, n2,…, nL should be chosen to
• Minimise for a given total cost C?
• Minimise the total cost C, for a given value of
STyVar
?
STyVar
32
Minimise variance for fixed cost, C
. of method by the isSolution
constraint thesubject to
1
1
minimise to,, choose We
01
1 1
22
2
1
22
1
ultipliersLagrange m
cCnc
WNn
W
nfWyVar
nn
L
iii
L
i
L
iii
i
ii
L
i i
iiiST
L
33
Minimise cost for fixed variance, V
s.multiplier Lagrange of method by theagain isSolution
1
constraint thesubject to
minimise to,, choose We
22
2
10
1
VWNn
WyVar
ncc
nn
i i iii
iiST
L
iii
L
34
Solution to optimal allocation problem
The allocation for minimising cost for fixed variance turns out to be the same as that for minimising variance for fixed cost.
35
L
j j
jj
i
ii
i
i
ii
i
c
Nc
N
nn
i
c
N
n
1
is stratum in size sample optimal theThus
toalproportion has allocation optimal The
36
Thus sampling effort is directed to strata that:
• account for a large proportion of the population;
• have a large variance;
• have a low sampling cost.
STAT373/STAT814_STAT714
Week 10
2019 7
37
. asknown is This
toreduces this, i.e. equal,
are strata over the costs samplingWhen
1
21
ocationNeyman all
N
Nnn
ccc
L
jjj
iii
L
38
equal. are variancesstratum allwhen
allocationNeyman of case special a asseen
becan allocation alproportion Thus
.ional allocatproportion iswhich
fNN
Nnn
becomes formula allocation then the
i.e. equal, are variances
stratum headdition tin if Further,
ii
i
2L
21
39
In order to compute the ni according to the Neyman allocation for sample size, we need to know:
• the stratum variances i2; and
• the total sample size n.
40
Stratum variances i2
These have to be estimated from previous surveys or other knowledge about the population.
41
Total sample size n
The determination of n depends on whether:
• Variance was being minimised for a fixed cost; or
• Total cost was being minimised for a fixed variance.
42
Total sample size required for a fixed total cost, C
L
jjjj
L
i i
ii
Nc
c
N
cCn
1
1
0
Substitute ni expression on Slide 35 into the total cost
constrain on Slide 32, , and work out the expression for n.
0
L
1iii cCnc
STAT373/STAT814_STAT714
Week 10
2019 8
43
Total sample size required for a fixed Var(yST), V
Here we determine the total sample size n that satisfies the requirement
Based on this constrain below,
, and (contd.)
BySTPr
.)y(VZB where st2/
2
2/
22
22 1
z
BVW
NnWyVar
i i iii
iiST
44
i. stratum toallocated
nsobservatio of proportion theis
where
/
get welogic usual thefollowing
1
2
2
2/
1
22
n
nw
Nz
BN
wNn
ii
L
iii
L
iiii
45
.z
B ie, V,at fixed is where
get we
ngSubstituti
2/2
2
1
22
11/
ST
ii
yVar
n
nw
L
i iiNVN
L
i iciiNL
k kckkN
n
jjc
jjNic
iiN
46
Optimal allocation: LGA example
• We like to sample a total of 30 (n =30) LGAs.
• We assume that sampling costs in the strata are equal, i.e. c1= c2= c3
• We know the values of i: from Slide 7,1=19,964 ; 2=12,384 ; 3=1,488
• Usually we would not know the values of i. In this case we would estimate i from the previous census.
47
.2n ,11n ,17n takeWe
1.2312,750,1
488,18130n
8.10312,750,1
384,125130n
1.17312,750,1
200,99830
488,181384,1251964,1950
964,195030n
N
Nnn :formula allocationNeyman theuse We
321
2
2
1
L
1jjj
iii
48
• Recall the proportional allocation was : n1=8, n2=9, n3=13 (shown on Slide 27), where Stratum 3 received a high allocation of observations (sampling units) because this stratum accounts for 81/182=45% of LGAs.
• However, in the Neyman allocation we take into account the fact that Stratum 3 has the lowest variance across all three strata, and accordingly allocate it very few units to be sampled.
STAT373/STAT814_STAT714
Week 10
2019 9
49
Results : Neyman allocation sample
Stratum 1Variable N Mean Median TrMean StDev SE Mean
OSBorn s 17 26788 26358 25357 18428 4469
Stratum 2Variable N Mean Median TrMean StDev SE Mean
OSBorn s 11 2661 1455 2235 2764 833
Stratum 3Variable N Mean Median TrMean StDev SE Mean
OSBorn s 2 294.0 294.0 294.0 118.8 84.0
50
9.235,8
0.29418281661,2
18251788,26
18250
STy
51
.019,1
195,039,1)y(ES
195,039,1
2
8.118
81
21
182
81
11
764,2
51
111
182
51
17
428,18
50
171
182
50
n
sf1
N
N)y(raV
ST
22
22
22
3
1i i
2i
i
2
iST
52
True variances of the estimators
• Because we have complete population information, we are in fact in a position to compute the actual variances ( ) of the estimators of .
• The variances we have computed until now have been estimates ( ), based on the samples that we drew.
)yVar( ie, ,2y
)yr(aV ,ie ,s2y
53
For a SRS (n=30):
.709,2
433,339,7)y(SE
433,339,7
30
237,16
182
301
n)f1()y(Var
y
2
2
From Slide 7 last week
54
For a stratified sample of size 30with equal allocation:
.847,1
051,413,3)y(SE
051,413,3
10
488,1
81
101
182
81
10
384,12
51
101
182
51
10
964,19
50
101
182
50
nf1
N
N)y()y(Var
st
22
22
22
3
1i i
2i
i
2
iST
2st
i’s from Slide 7
STAT373/STAT814_STAT714
Week 10
2019 10
55
Stratified, proportional allocation (n=30)
.071,2
762,288,4)y(SE
762,288,4
13
488,1
81
131
182
81
9
384,12
51
91
182
51
8
964,19
50
81
182
50
nf1
N
N)y(Var
styST
22
22
22
3
1i i
2i
i
2
iST
56
Stratified, Neyman allocation (n=30)
.497,1
369,240,2)y(SE
369,240,2
2
488,1
81
21
182
81
11
384,12
51
111
182
51
17
964,19
50
171
182
50
nf1
N
N)y(Var
ST
22
22
22
3
1i i
2i
i
2
iST
57
Variances of the estimators: summary
Sampling scheme deff
SRS 2,709 -
Stratified:
Equal 1,847 0.46
Proportional 2,071 0.58
Neyman 1,497 0.31
SE
58
Comparison of SRS mean and stratified sample mean
The point of stratifying is the expectation that
will be a more precise estimator of than from a SRS, ie,
We can see, for the LGA example, that stratification has achieved this, for all three allocation methods that we tried.
.yVaryVar ST
STy
y
59
It can be shown that
if variation between stratum means is large compared with within-stratum variation.
yVaryVar ST
60
This means that there will be a benefit to stratification (in terms of increased precision of the estimates) if
• within-stratum variances are small; and
• there are large differences between the stratum means
STAT373/STAT814_STAT714
Week 10
2019 11
61
• With stratification we therefore aim to subdivide the population into homogeneous layers/groups/sections. (See next slide)
• In the LGA example, another reasonable stratification strategy would have been– urban
– rural
These are two fairly homogeneous subsections of the population.
62
1 2 3
0
50000
100000
stratum
OS
Bor
n P
Boxplots of OS Bornby the three strata:
63
Comments on the stratification
• We can see that we have achieved small within-stratum variances in strata 2 and 3, but stratum 1 still has a large variance.
• We have successfully separated out the section of the population with a much larger mean.
• The sampling scheme could be further improved by a finer stratification within Sydney.
64
Estimation of other parameters
Estimation of the population total and proportion p on the basis of a stratified sample follows the same lines as estimation of the population mean .
65
Population total
L
i i
iii
STST
L
iiiSTST
n
sfN
yraVNraV
yNyN
1
22
2
1
1
ˆ
ˆ
ˆˆ
Note: For sample size calculations (n or 𝑛 ), you may use the ones for estimating mean, .
66
Population proportion p
L
i i
iii
i
L
ii
iST
L
ii
iST
n
qpf
N
N
praVN
NpraV
pN
Np
1
2
1
2
1
1
ˆˆ1
ˆˆ
ˆˆ
ˆˆ
STAT373/STAT814_STAT714
Week 10
2019 12
67
Sample Size
• The total sample size, n, required to estimate P within “d” percentage points for fixed Var( ) is,
used. allocation the
of typeon the depends ,nsize, sample
satratum and i, stratumfor where
)]}1([({
}/)]1([({
i
1
2
2/
1
2
n
nw
ppNz
dN
wppNn
ii
L
iiii
L
iiiii
STp
68
Obtain a stratified sample in Minitab
Eg: Let’s draw a stratified sample that consists of a SRS of size n=10 from each of the three broad LGA strata (to get a total sample size of 30) as described on Slide 8.
Run Minitab, and open LGA.mtw file;
Code the SD_id variable in the file into stratum variable, say named with ST_id, coded as 1, 2, or 3 according to Slide 6 as follow: ‾ Click Data button on top menu, and choose Code and then
Numeric to Numeric, and the dialog window will appear as shown on slide.
69
Code – Numeric to Numeric window:
70
Obtain a stratified sample in Minitab (Contd.):
‾ Click OK button on the Code window;
‾ Click Data button on top menu, and choose Split Worksheet;
‾ On the dialog window, select the column containingstratum code, then click OK button;
Draw a SRS of size 10 from each stratum worksheet, following the procedures given in Week 9 lecture.
Finally you have a stratified sample consisting of those three (stratum) samples of size 10 obtained.
71
SGTA exercises Week 10(Try completing all exercises listed below before your
SGTA class in Week 11.)
Lohr 3.9 Exercises:
• 9
• 6
• 7
Note: Questions and data file are available on the unit iLearn.