STA304H1F/1003HF Summer 2015: Lecture 10 - … Summer 2015: Lecture 10 ... Lecture 10 June 16, 2015...

STA304H1F/1003HF Summer 2015: Lecture 10

We should learn...

I When is systematic sampling not appropriate?

I When is systematic sampling nearly equivalent to SRS?

I Why is repeated systematic sampling used?

I What is one-stage cluster sampling?

I What is the trade-off in cluster sampling?

I What is two-stage cluster sampling?

Lecture 10 June 16, 2015 1

Cluster Sampling Ch. 8

I What is a cluster?

I What is cluster sampling?

I psu- primary sampling units

I ssu- secondary sampling units

I Types of cluster samples:

I one-stage cluster sampling

I two-stage cluster sampling

I Why?










I Why?



I A natural contiguous grouping

I A probability sample (SRS for eg.) of clusters

I psu- clusters of population elements

I ssu- elements of interest


I all ssu’s in a randomly selected psu are in the sample

I ssu’s are themselves randomly sampled

I easier to get a sampling frame of clusters, than of elements ofinterest

I cheaper/more convenient to sample contiguous units


Notation (§8.3)

I mi = cluster size

I n =number of clusters in the sample

I N =number of clusters in the population

I M =∑N

i=1mi = population size

I Therefore, what are:

M =M

N=

∑Ni=1mi

N=

m =

∑ni=1mi

n=


Notation (§8.3)

I mi = cluster size

I n =number of clusters in the sample

I N =number of clusters in the population

I M =∑N

i=1mi = population size

I Therefore, what are:

M =M

N=

∑Ni=1mi

N=average cluster size for all clusters

m =

∑ni=1mi

n=average cluster size for the sample of clusters


Example: Lohr §5.2.1“A student wants to estimate the average GPA in his dormitory.Instead of obtaining a listing of all students in the dormitory andconducting an SRS, he notices that the dorm contains 100 suites,each with four students: he chooses 5 of those suites at randomand asks every person in the 5 suites what her or his GPA is. Theresults are as follows:”

Suite Person Number1 2 3 4 Total

1 3.08 2.60 3.44 3.04 12.16. . .

5 2.68 1.92 3.28 3.20 11.08

Thus,n = ,N =mi = ,M =




1 3.08 2.60 3.44 3.04 12.16. . .

5 2.68 1.92 3.28 3.20 11.08

Thus,n = ,N =mi = ,M =




1 3.08 2.60 3.44 3.04 12.16. . .

5 2.68 1.92 3.28 3.20 11.08

Thus,n = 5 ,N = 100mi = 4 ,M = 400


Estimation (§8.3): Biased vs unbiased

Ratio estimator Unbiased estimator

µ: y =

∑ni=1 yi∑ni=1mi

τ : My = M

∑ni=1 yi∑ni=1mi

Nyt = N∑n

i=1 yin

p: p =

∑ni=1 ai∑ni=1mi

For ratio estimation, the auxiliary variable is:


Estimation (§8.3, 8.6): Ratio vs unbiased

Ratio estimator Unbiased estimator

µ: y =

∑ni=1 yi∑ni=1mi

τ : My = M

∑ni=1 yi∑ni=1mi

Nyt = N∑n

i=1 yin

p: p =

∑ni=1 ai∑ni=1mi

I For ratio estimation, the auxiliary variable is: cluster size, mi

I we need M for ratio estimator of τ

I unbiased estimator does not use mi ; may be less precise thanMy


Cluster sampling: proportions – §8.6 and Example8.9

I see Table 8.2: Cluster 1 m1 = 8, a1 = 4; Cluster 2m2 = 12, a2 = 7, etc.

I ai is the number of residents renting their homes

I ratio estimate of population proportion of renters:

p =

∑ni=1 ai∑ni=1mi

=72

151= 0.48

I Variance estimate??? – p is just a ratio estimator:

V (p) = (1− n

N)

s2p

nM2, s2p =

∑ni=1(ai − pmi )

2

n − 1

I why not use pi (1− pi )? – too small, because of clustersampling

I omit §8.7 and §8.8Lecture 10 June 16, 2015 16





p =

∑ni=1 ai∑ni=1mi

=72

151= 0.48


V (p) = (1− n

N)

s2p

nM2, s2p =


2

n − 1



Cluster sampling: how many clusters? §8.5

I as always, bound on ±2√

V used to determine sample size

I for ratio estimator, this depends on

σ2r , M,N, and n

I when the first three are known and/or can be guessed and/orare available from a preliminary study,

I solve for n in2√V = B

I there is a trade-off between n and mi


... cluster sizes equal

I so how do we assess the trade-off?I special case: all clusters of the same size, m and

I there are n clusters in the sample,I M = Nm elements in the population, andI the total sample size is nm

cluster elements (ssu’s)1 2 . . . m

1 y11 y12 . . . y1m y1.cluster 2 y12 y22 . . . y2m y2.

(psu)...

......

...n yn1 yn2 . . . ynm yn.



I so how do we assess the trade-off?I special case: all clusters of the same size, m and

I there are n clusters in the sample,I M = Nm elements in the population, andI the total sample size is nm



(psu)...

......

...n yn1 yn2 . . . ynm yn.



I



(psu)...

......

...n yn1 yn2 . . . ynm yn.

I Equivalence: estimates of population total My = NytI

y = y.. =n∑

i=1

yi ./n =∑i ,j

yij/mn, V (y) =(

1− n

N

)( 1

m2

)s2rn

where

s2r =1

n − 1

n∑i=1

(myi . −my..)2 =

m2

n − 1

N∑i=1

(yi . − y..)2



I



(psu)...

......

...n yn1 yn2 . . . ynm yn.


y = y.. =n∑

i=1

yi ./n =∑i ,j

yij/mn, V (y) =(

1− n

N

)( 1

m2

)s2rn

where

s2r =1

n − 1

n∑i=1

(myi . −my..)2 =

m2

n − 1

N∑i=1

(yi . − y..)2



(p. 261 §8.4)

n∑i=1

n∑j=1

(yij − y..)2 =



(p. 261 §8.4)

n∑i=1

m∑j=1

(yij − y..)2 =

n∑i=1

m∑j=1

(yij − yi .)2 +

n∑i=1

m∑j=1

(yi . − y..)2

=n∑

i=1

m∑j=1

(yij − yi .)2 + m

n∑i=1

(yi . − y..)2

SST = SSW + SSB

SST = n(m − 1)MSW + (n − 1)MSB


... Example (Lohr, §5.2.1)

Suite Person Number(Cluster) 1 2 3 4 Total

1 3.08 2.60 3.44 3.04 12.162 2.36 3.04 3.28 2.68 11.363 2.00 2.56 2.52 1.88 8.964 3.00 2.88 3.44 3.64 12.965 2.68 1.92 3.28 3.20 11.08

11.304The variation between clusters and within clusters can be described

in an ANALYSIS OF VARIANCE table.

Source degrees of Sum of Meanfreedom Squares Square

Between Suites 4 2.2557 0.56392Within Suites 15 2.7756 0.18504

Total 19 5.0313 0.26480


... Example, using R

> GPA = scan()

1: 308 260 344 304 236 304 328 268 200 256

11: 252 188 300 288 344 364 268 192 328 320

21:

Read 20 items

> GPA = GPA/100

> suite = factor(rep(1:5,each=4))

> suite

[1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5

> anova(aov(GPA ~ suite))

Analysis of Variance Table

Response: GPA

Df Sum Sq Mean Sq F value Pr(>F)

suite 4 2.2557 0.56392 3.0476 0.05039 .

Residuals 15 2.7756 0.18504

---



I text compares variance estimate to that from SRS

I for which we need an estimate of population variance, σ2

I sadly,SSTotal/(nm − 1)

is not a good estimate (because of clustering)

I instead, we use

s2 =N(m − 1)MSW + (N − 1)MSB

Nm − 1' 1

m{(m−1)MSW+MSB}

I for GPA example s2 = 0.279, and efficiency of clustersampling ' 0.5

I see Example 8.5 and p. 263 of new edition for a differentexample, where cluster sampling happens to be more efficientthan SRS (unusual)

I HW: Exercise 8.2, 3, 4, 5, 6, 16, 20, 21, 22Lecture 10 June 16, 2015 33






I instead, we use

s2 =N(m − 1)MSW + (N − 1)MSB

Nm − 1' 1

m{(m−1)MSW+MSB}




One-stage cluster sampling summary

I There are two types of estimation methods in clustersampling: unbiased and ratio

I We aim to have high within − cluster variability and on theother hand, low between − cluster variability

I There is a trade off between number of clusters (n) and sizeof clusters (mi )

I When cluster sizes are equal, unbiased and ratio estimatorsare the same. The results can be displayed in an analysis ofvariance table.


Two-stage cluster sampling Ch. 9

I select a sample of clusters, called primary sampling units – psu

I usually by SRS

I select a random sample of units within each cluster, calledsecondary sampling units – ssu

I also often by SRS

I but could use stratified random sampling at any step

I or even more complicated probability sampling methods

I as with one-stage cluster sampling, easier/cheaper/safer, dueto geographic proximity of elements within a cluster (often)

I can be used without a list of all population elements, justneed a list of population clusters, and then a list of ssu’s foreach sampled cluster

I example: sample universities (psu); sample students at chosenuniversities (ssu)




I usually by SRS


I also often by SRS







... two-stage cluster sampling

I as with one-stage cluster sampling, hope that clustersrepresent the population well

I so are quite variable within the cluster, but similar betweenclusters

I although the opposite usually happens

I sample data has the structure:

y11, y12, . . . , y1m1 cluster 1

y21, y22, . . . , y2m2 cluster 2...

yn1, yn2, . . . , ynmn cluster n


Estimation in two-stage cluster sampling §9.4

I no longer know the cluster totals yi . =∑mi

j=1 yijI so now we estimate the cluster totals first Mi yi .I and then use these estimates to estimate the population total

µ = (1/M)n∑

i=1

Mi yi ./n

I this assumes we know the size of the cluster, Mi as well as thesample size mi

I and the average cluster size in the whole populationM =

∑Ni=1Mi/N

I we can also use ratio estimation, as in Ch. 8





µ = (1/M)n∑

i=1

Mi yi ./n



∑Ni=1Mi/N



... UNBIASED estimation, two-stage clustersampling

I

µ =1

M

∑ni=1Mi yi .

nI

V (µ) =

(1

M2

){(1− n

N

) s2bn

+1

nN

n∑i=1

M2i

(1− mi

Mi

)s2imi

}

where

s2b =

∑ni=1(Mi yi − Mµ)2

n − 1

and

s2i =

∑mij=1(yij − yi .)

2

mi − 1


... UNBIASED estimation, two-stage clustersampling

I

µ =1

M

∑ni=1Mi yi .

nI

V (µ) =

(1

M2

){(1− n

N

) s2bn

+1

nN

n∑i=1

M2i

(1− mi

Mi

)s2imi

}

where

s2b =

∑ni=1(Mi yi − Mµ)2

n − 1

and

s2i =


2

mi − 1


... RATIO estimation, two-stage cluster sampling

I

µr =

∑ni=1Mi yi .∑ni=1Mi

I

V (µr ) =

(1

M2

){(1− n

N

) s2rn

+1

nN

n∑i=1

M2i

(1− mi

Mi

)s2imi

}

where

s2r =

∑ni=1(Mi yi − Mµr )2

n − 1

and

s2i =


2

mi − 1

I as in Ch. 8, if Mi are all equal, then ratio and unbiasedestimate are the same



I

µr =


I

V (µr ) =

(1

M2

){(1− n

N

) s2rn

+1

nN

n∑i=1

M2i

(1− mi

Mi

)s2imi

}

where

s2r =


n − 1

and

s2i =


2

mi − 1



Example

Exercise 9.2, 9.3: “A nurseryman wants to estimate the averageheight of seedlings in a large field...”

Number of Heights ofNumber of seedlings seedlings

Plot seedlings sampled (in inches)∑mi

j=1 yij1 52 5 12, 11 ,12 ,10 ,13 582 56 6 10, 9, 7, 9, 8, 10 533 60 6 6, 5, 7, 5, 6, 4 334 46 5 7, 8, 7, 7, 6 355 49 5 10, 11, 13, 12, 12 586 51 5 14, 15, 13, 12, 13 677 50 5 6, 7, 6, 8, 7 348 61 6 9, 10, 8, 9, 9, 10 559 60 6 7, 10, 8, 9, 9, 10 53

10 45 6 12, 11, 12, 13, 12, 12 72


... example

I N = 50 plots are the primary sampling units; n = 10 aresampled

I Mi seedlings in each plot; mi ≈ 10%

I yij height of jth sampled seedling in the ith plot

I∑N

i=1Mi is unknown, so we use ratio estimation

I

µr =


=4970.833

530= 9.38

I Exercise 9.3: Assume∑N

i=1Mi is known to be 2600

I

µ =1

M

∑ni=1Mi yi .

n=

1

52

4970.833

10= 9.56

I HW: Exercise 9.6


... example




I∑N


I

µr =


=4970.833

530= 9.38



I

µ =1

M

∑ni=1Mi yi .

n=

1

52

4970.833

10= 9.56

I HW: Exercise 9.6


STA304H1F/1003HF Summer 2015: Lecture 10 - … Summer 2015: Lecture 10 ... Lecture 10 June 16, 2015...

Documents

Transcript of STA304H1F/1003HF Summer 2015: Lecture 10 - … Summer 2015: Lecture 10 ... Lecture 10 June 16, 2015...