[Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite...

37
Chapter 6 Estimation of a Finite Population Distribution Function 6.1 INTRODUCTION Estimation of a finite population distribution function has attracted con- siderable attention of survey statisticians over the last two decades. Our problem here is to estimate the finite population distribution function (d.f.) (6.1.1) where .6.(z) is a step function with .6.(z) = 1(0) if z ? 0 (otherwise) , on the basis of a sample s selected according to a sampling design p with selection probability p(s) and observations of the data. The distribution function FN(t) = F(t) denotes the proportion of units in P for which the value of y does not exceed t. Such functions are of considerable interest in estimating functions like Lorenz Ratio where y is the income and the units are individuals or in the establishments survey where y may be the value added by manufacture and the units are the factories. 165 P. Mukhopadhyay, Topics in Survey Sampling © Springer-Verlag New York, Inc. 2001

Transcript of [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite...

Page 1: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

Chapter 6

Estimation of a FinitePopulation DistributionFunction

6.1 INTRODUCTION

Estimation of a finite population distribution function has attracted con­siderable attention of survey statisticians over the last two decades. Ourproblem here is to estimate the finite population distribution function (d.f.)

(6.1.1)

where .6.(z) is a step function with

.6.(z) = 1(0) if z ? 0 (otherwise) ,

on the basis of a sample s selected according to a sampling design p withselection probability p(s) and observations of the data.

The distribution function FN(t) = F(t) denotes the proportion of unitsin P for which the value of y does not exceed t. Such functions are ofconsiderable interest in estimating functions like Lorenz Ratio where y isthe income and the units are individuals or in the establishments surveywhere y may be the value added by manufacture and the units are thefactories.

165P. Mukhopadhyay, Topics in Survey Sampling© Springer-Verlag New York, Inc. 2001

Page 2: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

166 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

As usual, under the superpopulation model approach, we will consider y =

(YI' ... ,YN) as the realisation of a random vector Y = (Yi, ... ,YN) (Yi beinga realised value of the random variable Y;, having a joint distribution ~ ).In this case, our problem will be to predict FN(t) Le. to estimate £(FN(t))on the basis of the observed data and the assumed superpopulation model~, estimating any unknown parameter involved in the model in the process.We shall, as before, for simplicity, use the same symbol Yi to denote therandom variable Y; as well as its realised value, the actual meaning will beclear from the context.

The following superpopulation models will often be used. Assume that Xi

is the value of an auxiliary variable X, closely related to the main variableY and the values xi(i = 1, ... , N) are known.(a)

Yi = (3xi + UiV(Xi) (6.1.2)

where Ui are iid random variables with mean 0 and variance a2 and V; =vex;) is a known positive function of X; and {3 is a unknown constant.(b)

Yi = 0' + {3x; + f; (6.1.3)

where 0', (3 are unknown constants and f; are iid random variables withmean zero.

6.2 DESIGN-BASED ESTIMATORS

A general class of design-based estimators of F(t) is

where the weight djs may be a function of (j, s) but is independent ofY (djs = 0 if j ~ s). The weights should satisfy the unbiasedness condition

L djsp(s) = 1, j = 1, ... , Ns3j

Taking djs = If1rj, one gets the conventional design-based estimator of F(t)which is the Haj'ek-type estimator

(6.2.1)

Page 3: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.2. DESIGN-BASED ESTIMATORS 167

Under srswor of size n, Fo(t) reduces to the sample empirical distributionfunction

(6.2.2)

For small sample sizes it is always advantageous to smooth out (6.2.1). Thedesign-based ratio estima~orof F(t) is obtained by treating t!.(t - y) as the

main variable and t!.(t - RXi) as the auxiliary variable, where

R= L:s djsYjL:s djsxj

In particular takingR= R = (L:s yk/'Trk)

(L:sxk/'Trk)

one gets a ratio predictor of F(t) corresponding to Fo(t) as

(6.2.3)

(6.2.4)

(6.2.6)

When Y ex x, Fr(t) reduces to F(t). This property will be called ratioestimator property of Fr(t). This suggests that Fr(t) would be expectedlymore efficient than Fo(t) when y is approximately proportional to x.

A design-based difference estimator of F(t) is

N A

Fd(t) = 2-[~ t!.(t - Yj) +d{~ t!.(t _ RXi) _ ~ t!.(t - RXj)}] (6.2.5)N LJ 'Tr- LJ LJ 'Tr-

jEs J i=l jEs J

where d is a known constant. Clearly, Fd(t) is design-unbiased for F(t).The optimum value of d is obtained by minimising the variance of Fd(t)with respect to d and is given by I for srswor,

d* = Pt;,.St;,.y,St;,.x

where

N N1 ~ A 1~ A 2

St;,.x = N _ 1 LJ{t!.(t - RXi) - N LJ t!.(t - RXi)}i=l i=l

Page 4: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

(6.2.8)

168 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

and Pt. is the finite population correlation coefficient between .!l(t - y) and.!l(t - Rx).

In general, the correlation between .!l(t - y) and .!l(t - Rx) is likely to beweaker than the correlation between y and x where R = Y / X. Conse­quently, the gain in efficiency of Fr(t) ,Fd(t) over Fo(t) is likely to be smallerthan those achieved by the customary ratio and regression estimator ofpopulation mean y. The estimators Fo(t), Fr(t), Fd(t) are design-consistentunder approporiate regularity conditions.

If there are p auxiliary variables Xl, ... , x p with known values Xji on uniti (j = 1, ... ,pji = 1, ... , N) the multivariate design-based ratio estimatoris

where

and Wk(> 0) (2: Wk = 1) are constants to be suitably determined.

Similarly, the multivariate design-based difference estimator is

L .!l(t-7r

Rk X kj )}]

jEs J

where the constantsdk's are to be optimally chosen. The estimators (6.2.4),(6.2.5) are asymptotically design-unbiased but not m-unbiased under model(6.1.2). This is because

[[.!l(t - Yi)] =1= .!l(t - (3xi)

Similar results hold for F;, F~ under the multiple regression models.

Silva and Skinner (1995) defined the following post-stratified estimator cor­responding to Fo(t). Let L be the number of post-strata PI,.'" PL

G

(U Pg = P). A unit i E Pg if x(g-l) < Xi < x(g) where X(o) = -00 < X(l) <g=l

Page 5: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.2. DESIGN-BASED ESTIMATORS 169

... < X(L) = 00. Let 81,' .. , 8L be the corresponding partitioning of 8 sothat

8g = 8n Pg

Let Ng be the size of Pg and let

~ L 1Ng = -, 9 = 1, ... , L1[.

jEs. J

The post-stratified estimator is

LD ( ) _ ~L Ng L Li(t - Yj)rps t - N ~N 1[-

g=l 9 jEs. J

1 L

= N L NgFog(t) (say)g=l

(6.2.9)

(6.2.10)

It is desirable to define the post-strata such that the probability that 8 g isempty is very small. In practice, any post-strata with Ng = 0 are pooledwith adjacent post-strata until all Ng are positive.

The predictor Fps is exactly m-unbiased under a model for which Yi hasa common mean within each post-stratum. It may, however, be m-biasedunder model (6.1.2).

Kuk (1988) considered homogeneous linear unbiased estimators of F(t),

~ 1 ~ 1 NFL(t) = -H(t) = - '" disLi(t - Yi)N NLJ

i=l

(6.2.11)

where dis has been defined earlier in this section. For any arbitrary samplingdesign, the choice dis = i (0) if i E 8 (otherwise) gives the HT-estimator

F~ () 1 L Li(t - Yi)HT t =-N 1[-

iEs I

For probabilitty proportional to aggregrate sample size (ppa8) 8. d.,

1dis = --(0) if i E 8 (otherwise)LPj

jEs

(6.2.12)

Page 6: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

170 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

where Pj = xjlX. Define the complementary function of F(t) as

1 NS(t) = 1 - F(t) = N 8 !:::..(y; - t)

Its estimator is, following (6.2.11),

1 NS(t) = N L d;s!:::..(Y; - t)

;=1

An estimator of F(t) is, therefore,

~ ~ 1 N ~Fn(t) = 1 - S(t) = 1 - N L dis + FL(t)

;=1

(6.2.13)

(6.2.14)

(6.2.15)

Again, neither h(t), nor Fn(t) is a distribution function since their max­imum values are not equal to one. A natural remedy is to divide FL(t) orFn(t) by its maximum value. The normalised version of FL(t)

L d;s!:::..(t - Yi)Fv(t) = ~iE",-S-=:::-__

Ld;siEs

is, however, not unbiased. The mse of FL(t) ,

1 NMSE(FL(t» = N2 E(L(d;s - l)!:::..(t - y;»2

;=1

1 N N= N2 L L !:::..(t - y;)!:::..(t - Yj)a;j

;=1 j=l

where

Similarly,

MSE(Fn(t» = ~2 L L !:::..(Yi - t)!:::..(Yj - t)aiji j

Now,

(6.2.16)

(6.2.17)

(6.2.18)

N N

MSE(Fn(t» < MSE(h(t»:::;. L bi > 2 L !:::..(Yi - t)b; (6.2.19)i=l i=l

Page 7: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.2. DESIGN-BASED ESTIMATORS

whereN

bi = Ll1;jj=1

171

Since ~h(t),FR(t),FII(t) are all step functions we need to compare them att = Yl,"" YN and at a value t such that F(t) = O. Assume no ties and let

Y(I) ~ Y(2) ~ ... ~ YeN) (6.2.20)

be the ordered y-values. Let Yo be a value less than Y(I)' Let D(i) be theanti-ranks so that

Y(i)=YD(i), i=l, ... ,N

The condition (6.2.19) imples that

ifN N

L bi > 2 L ~(Y(i) - Y(l))b(i)i=1 i=1

N

=2 L bD(i)i=/+1

(6.2.21)

(6.2.22)

Since FII(t) is a ratio estimator, its mse is approximately its variance andis given by

(td'S~(t-y,) )2

MSE(FII(t)) ~ E ,=1 N - F(t)

Ld,s,=1

N N

E(L dis~(t - Yi) - L disF(t)?i=1 i=1

~ ------N--'---"----

E(Ldis )2i=1

1 N= N2 L L(~(t - Yi) - F(t))(~(t - Yi') - F(t))E(disdi,s)

i;li'=1

Page 8: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

172 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

1 N= N2 L L(b.(t - Yi) - F(t)) (b.(t - Yi') - F(t))aii'

i#i'=l

From (6.2.17) and (6.2.23), denoting asymptotic mse as AMSE,

MSE (h(t)) - AMSE (Fv(t))

F(t)= N[2 L b.(t - Yi)bi - F(t) L b;)

. i

1

= 1[2 L bV(i) - Ib] for t = Y(l)i=l

> 0

if1 1 _

Lbv(i) ~ "2 lbi=l

N

where b= L bi/N. From (6.2.18) and (6.2.23) we conclude thati=l

if

(6.2.23)

(6.2.24)

(6.2.25)

N 1L bV(i) :5 "2(N -I)b (6.2.26)

i=I+1

Conditions (6.2.19), (6.2.22), (6.2.25) are not useful in practice since D(I), ...,D(N) are not known. Assume that the ordering of y-values agree withthat of the x-values so that

X V(l) :5 X V(2) :5... :5 X V(N)' (6.2.27)

In this case, we can compute bV(i)(i = 1, ... , N) and hence check the con­ditions (6.2.19), (6.2.22) and (6.2.25). The condition (6.2.27) implies

bV(l) ~ bV(2) ~ ... ~ bV(N) (6.2.28)

for a number of sampling designs including Poisson, modified Poisson (Ogusand Clark, 1971), Collocated sampling (Brewer et aI, 1972) and ppas sam­pling designs using x as the size measures.

If (6.2.28) holds, (6.2.25) holds in general implying Fv(t) is preferable toFL(t). From (6.2.22) and (6.2.28) it follows that there is an h E [1, Q] where

Page 9: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.3. MODEL-BASED PREDICTORS 173

Q = [Nt] such that MSE(Fn(Yl))) ~ MSE (FL(Y(I»)) V l 2: ll' Generally,II is sufficiently smaller than Q so that for estimating the population mediane,Fn ~ A(t).

It follows from (6.2.22), (6.2.25) and (6.2.26) that A(t) is inferior to bothFv(t) and Fn(t). From (6.2.26) and (6.2.28) it follows that there is an 12such that MSE(Fn(t)) ~ AMSE (Fv(t)) for t 2: Y(l2)'

The empirical studies considered by Kuk (1988) using n =,30 for three pop­ulations, - Dwellings (N=270; Kish, 1965, p.624), Villages (N=250; Murthy1967, p.127), Metropolitan (x=1970 population, y = 1980 population for250 metropolitan statistical areas in US) - confirmed the above findings.

Kuk and Mak (1989) considered the following cross-classified estimator.For any value of t, let F1(t) denote the proportion among those units in thesample with x values ~ M x (population median of x), that have Y values~ t. Similarly, let F2(t) denote the proportion among those units with xvalues> M x • Let N x denote the number of units in the population with xvalues ~ M x. Then F(t) can be estimated as

A 1FKM = N [Nx F1(t) + (N - N x )F2(t)]

1~ 2(F1(t) + H(t)) (6.2.29)

Mukhopadhyay (2000 c) considered calibration estimation of finite popula­tion d.f. under multiple regression model.

6.3 MODEL-BASED PREDICTORS

Following Royall (1970), Royall and Herson (1973), Rodrigues et al (1985),we consider in this section model-based optimal predictors of F(t). After asample has been selected, we may write,

(6.3.1)

where

and

where1

Fr(t) = N _ n L !:l(t - Yi),iEr

(6.3.2)

Page 10: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

174 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

FN(t) = F(t) and the other symbols have usual meanings. Hence a predic­tor of F(t) is of the form

(6.3.3)

where Osr is a predictor of ()sr'

DEFINITION 6.3.1 A predictor F(t) is model (m) -unbiased predictor ofF(t) with respect to the model (6.1.2) if

E.p[F(t) - F(t») = 0 V 'IjJ = (f3,0"2) E 'IF and "Is: p(s) ;::: 0 (6.3.4)

where 'IF is the parameter space.

Chambers and Dunstan (CD)(l986), therefore, suggested a m-unbiased pre­dictor of F(t),

~ 1 " ~F(t) = N [L.." ~(t - Yj) + ~)jEs

where 1% is a m-unbiased predictor of L ~(t - Yi) i.e.iEr

E(1%) = E(L ~(t - Yi»jEr

Nowt - f3Xj

E(~(t - Yj)) = G( v(Xj) )

(6.3.5)

(6.3.6)

where G(z) = P(Uj :::; z) is the distribution function of Uj • An empiricalestimator of G(':C~;Y) is, therefore,

wherefj. _ u. . _ Yj - bnxj

J - nJ - V(Xj)

2

bn = ~ = " XjYj /" -.5­L.." v(x·) L.." v(x·)JEs J JEs J

Hence, an approximately m-unbiased predictor of F(t) is

~ 1" 1"" t-bxFcv(t) = N[L.." ~(t - Yj) + ~ L.." L.." ~( v(x;) • - Unj )]JEs .Er JEs

(6.3.7)

(6.3.8)

(6.3.9)

Page 11: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.3. MODEL-BASED PREDICTORS 175

However, FCD is not design-unbiased under repeated sampling. For smallsample sizes it may be desirable to replace Unj by its studentised equiva­lent under (6.1.2). Also, one could replace Gn(t) in (6.3.7) by a smootherestimator of G, e.g., a kernel estimator of this function, obtained by inte­grating a kernel density estimator (Hill, 1985). Dorfman (1993) extendedCD-estimator to multiple regression model.

Dunstan and Chambers (1989) extended CD-estimator to the case whereonly summary information is available for the auxiliary size variable x. Weassume that only the histogram-type information on x is available enablingthe population to be split up into H strata, defined by the end-pointsXhL, xhu(h = 1, ... ,L). Also, strata sizes Nh and strata means Xh are known.In this case, the double summation in (6.3.9) can be written as

(6.3.10)

Assuming Xhi to be an independent realisation of a random variable X h

with distribution function Ch ,

t - bnXhE{~( ( ) )-z)}=I-rht(z)

v Xh

(E denoting expectation with respect to d.j. Ch) where rht is the distri­bution function of the transformed variable (t - bnXh)/V(Xh). Therefore,expectation of expression in (6.3.10) is

L(Nh - nh)[1 - n-1 L rht(Unj)]h jESh

The actual form of rht will depend on Ch and the form of the variancefunction v(x). For example, when v(x) = ..;x, assuming bn > 0, t > 0,

If approximation Oh to Ch and hence l'ht to r ht are available from surveydata, a limited information estimator corresponding to FCD(t) is

The authors derive estimator of asymptotic prediction variance of F~~(t) byobtaining the limited information approximation as above to the asymptoticvariance of FCD(t) derived in Theorem 6.5.1.

Page 12: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

176 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

Model-dependent strategies can perform poorly in large samples undermodel-misspecification (Hansen, Madow, and Tepping, 1983). Roo, Koverand Mantel (RKM) (1990) noticed a similar poor performance of the model­dependent estimator FCD(t) under model-misspecification and they, there­fore, considered model-assisted approach. In this approach one consid­ers design-consistent estimators, Fdm(t) (say) that are also model-unbiased(at least asymptotically) under the assumed model. Estimators of model­variance V(Fdm -F) that are design-consistent and at the same time model­unbiased (at least asymptotically) can be obtained following Sarndal et al

(1989), Kott (1990). The resulting pivot [Fdm(t)-F(t)))j VV(Fdm(t) - F(t))provides valid inference under the assumed model and at the same timeprotects against model mis-specifications in the sense of providing validdesign-based inference under model-failures.

Roo, Kover and Mantel (RKM) (1990) considered the model (6.1.2) withv(x) = y'x. Considering

where

NG = ~~ b.(t - RXi _ V. .)

, N L...J IX: nJj=l y~,

(6.3.12)

y·-Rx· YVnj = J J , R = X I

VX;as the value of an auxiliary variable, they defined a difference estimator

(6.3.13)

This estimator is both design-unbiased and asymptotically m-unbiased.N

Now, in G i of L Gi , Vnj will not be known for all j. Thus Gi requires toi=l

N

be estimated. A design-based estimator of Gi in L Gi isi=l

where

(6.3.14)

Page 13: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.3. MODEL-BASED PREDICTORS 177

(6.3.15)

Similarly, Gj in 'Es ~ requires to be estimated. G j is estimated by)

(6.3.16)

The estimator Oi is asymptotically design-unbiased for Gi while Oje isasymptotically conditionally design-unbiased for G j given j E s.

The alternative model-assisted estimator is, therefore,

(6.3.17)

which is aymptotically both design-unbiased and model-unbiased.

Godambe (1990) derived (6.3.17) with slight modifications on the basis ofoptimal estimating functions.

Under srswor and Vi = y'Xi,

(6.3.18)

and Oje = Gj.

Dorfman (1993), therefore, proposed a model-based generalisation of FRKM(t)as

(6.3.19)

regarding 1fj as the reflective of the proportion of sampled units near thedata point Xj, not necessarily the inclusion probabilities. FRKM.(t) is,therefore, free from the second order inclusion-probabilities which may bedifficult to estimate. Godambe's (1989) estimator also shares this property.

Page 14: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

(6.3.20)

178 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

Roo and Liu (1992) proposed a model-assisted estimator for the generalweights d;,s satisfying the design-unbiasedness condition. Assume first thatG; is known for all i. A model-assisted estimator is given by

_ 1 N

FRL(t) = N[Ldjs~(t-Yj)+{LG;- LdjsGj }]jEs ;=1 jEs

Now, replace G; in L;:l G; by

where~ Yj - RXjV nj = .:....:..-----"­

Xj

Similarly Gj in LjES djsGj is replaced by

when the weights dkslj satisfy LS3U,k) dksljp(s) = 'Trj. The final model­assisted estimator of Roo and Liu (1992) is

(6.3.21)

which is asymptotically both design-unbiased and model-unbiased.

Godambe's (1990) estimator based on estimating function theory is

(6.3.22)

Wang and Dorfman (1996) combined CD-estimator and RKM estimatorbased on the model (6.1.3). The CD estimator is

~I 1 " " ~ "Fcv(t) = N [L.J ~(t - Yj) + L.J H(t - & - f3x;)]jEs ;Er

(6.3.23)

where H(z) = ~ L ~(z - E;) is an estimate of H(z) = Prob. (f:::; z) ,andjEs

E; = Y; - &- /3x;, &, /3 being least squared estimates of Ct, f3 respectively.

Page 15: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.4. CONDITIONAL APPROACH 179

Rao et al (1990) estimator for srswor corresponding to the model (6.1.3)is

N" 1"" 1"", ,1"", ,FRKM(t) = - LJ~(t - Yj) + - LJH(t- & - (3xi) - - LJH(t - &- (3xi)

n . N . n .Je .=1 .e(6.3.24)

Noting that both FeD and FhKM have desirable properties and deficienciesin certain situations, Wang and Dorfman (1996) considered a new estimatorwhich is their convex combination,

1"" 1 1"" ' ,= NLJ~(t-Yj)+(l-w)(~- N)LJ{~(t-Yj)-H(t-&-{3xi)}+jEs jEs

1"", ,N LJH(t - & - (3xi)

iEr

(6.3.25)

where 0 < w < 1 depends on t and is optimally estimated by minimizingMSE{FwD(t) - F(t)} under the assumption that both nand N increase toinfinity such that n/N -- f E (0,1) and the sample and non-sample designpoints have a common asymptotic density.

Mukhopadhyay (1998 d) considered the design-model unbiased optimal pre­diction of finite population distribution function of a random vector follow­ing simple location model and linear regression model with one auxilaryvariable under measurement errors. This will be considered in the nextchapter.

6.4 CONDITIONAL ApPROACH

Consider the estimator FLR(t) in (6.3.20). Under srs FRL reduces to

FRL(t) = h(t) + (0 - g) (6.4.1)

- , - N -where h(t) = LiEs h(t, Yi)/n = FSn (t), h(t, Yi) = ~(t-Yi) and G = Li=1 Gi/N, g =L~=1 gi/n. The asymptotic conditional bias of Fh(t) is

(6.4.2)

whereB* = {Cov(h, x) - Cov(g, x)} /V(x)

Page 16: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

(6.4.3)

180 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

Sxh - sxGS2

x

where Sxh and SxG are, respectively, the population covariances between xand h and between x and G. A bias-adjusted estimator is, therefore, givenby

- - 2 -FRLa(t) = Ffdt) + s; (Sxh - sxG)(X - x) (6.4.4)

where Sxh and SxG are the sample covariances. The conditional bias ofFRLa (t) is Op (n -1) and consequently FRLa (t) provides conditionally validinference under large sample. Fha(t) is also model-unbiased since E(B*) =o under (6.1.2).

In practice, one replaces Gi by Gi to get

(6.4.5)

where sxG is the sample covariance between Xi and Gi • If only the popu­lation mean X is known, x, an estimate of X is an approximate anciliarystatistic. The estimator FG(t) in (6.3.22) or Fha(t) cannot be used in thiscase since they require the knowledge of Xj(j = 1, ... , N). We, therefore,find the conditional bias of Ii = Fsn (t) to obtain a bias-adjusted estimator.The conditional asymptotic bias of Fs.(t) is

(6.4.6)

where B = Cov(ii, x)/V(x) = SXh/S;, A bias-adjusted estimator is, there­fore, given by

Fa(t) = Fsn(t) + (Sxh/S;)(X - x) (6.4.7)

The conditional bias of Fa(t) is Op(n- l ) and as such, Fa(t) provides condi­tionally valid inferences in large samples. However, Fa(t) is model-biasedunder model (6.1.2).

Quin and Chen (1991) used the empirical likelihood method to obtain amaximum likelihood estimator of F(t) which has the same asymptotic vari­ance as Fa(t).

6.5 ASYMPTOTIC PROPERTIES OF THE Es­TIMATORS

We first recall a result due to Randles (1982).

Consider random variables which would have been U-statistics were it notfor the fact that they contain an estimator. Let Xl, ... ,Xn be a random

Page 17: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.5. ASYMPTOTIC PROPERTIES 181

sample from some population. Let h(XI, ... , x r ,,) be a symmetric kernelof order r with expected value

(6.5.1)

where), denotes a p x 1 vector. Here, is a mathematical symbol whoseone particular value may be 5., a consistent estimator of ),. Both the kerneland its expected value may depend on ,.

The U-statistic corresponding to (6.5.1) is

1UnCI) = (N) L h(Xa1 ,···, X ar ;,)

n aEA'

(6.5.2)

where A· denotes the collection of all subsets of size r from the integers{I, ... ,n}.

LEMMA 6.5.1 Under certain regularity conditions,

provided 7 2 > 0 where 7 2 is given by either(a)

D' = (1, aa()(.)'···' aa()(·\ , = ClI, ... "p),',1 ,p~ is the covariance-matrix of

or

(6.5.3)

(6.5.4)

(6.5.5)

THEOREM 6.5.1 Assume the following regularity conditions:

• (1) As both Nand n increase, the sampling fraction n/N --+ f E(0, 1).

• (2) The d.f. G(t) of the random variable Ui = y~(~); is differentiable

with derivative get) > o.

• (3) The quantities Xi and V(Xi) are bounded.

Page 18: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

182 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

• (4) For arbitrary b define

Sj(t, b) = _1_~b./ - bx; _ Yj - bXj)N - n L...J vex;) vex)

.Er J

(6.5.6)

Assume that as both n, N increase the mean and variance of F;(t, b)tend to a limit in (0, 1) .

• (5) The estimator bn (defined in (6.3.8)) is asymptotically normllydistributed under model (6.1.2).

Let

where

1~ 1 ~ Xj X; t - (3x;h. -= ~ L...J N _ n L...J[{ vex .) - vex;) }]g( vex;) )

JEs .Er J

v,,*(t, (3) = Cov. matrix of (Fr*(t, (3) - E{F;(t, (3)}, bn - (3)

DefineW; (t, (3) = Dr(t, (3)'v,,* (t, (3)Dr(t, (3)

1 ~ (t - (3x;)Wr (t,{3) = (N _ n)2 ~ G{ vex;) }[I- G{(t - (3x;)/v(x;)}]

Then, as both N and n increase,

(6.5.7)

(6.5.8)

(6.5.9)

(6.5.10)

{Fcv(t) - F(t)}/[(I - ~ )2{W;(t, (3) + Wr(t, (3)}1/2] L N(O, 1) (6.5.11)N --->

Proof. When b = {3, Fr*(t, (3) is a U-statistic. Hence, by Randle's theorem

vn[F;(t, bn ) - E{F;(t,{3)}] LN(O, W;(t,{3))--->

where

Now,

Page 19: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.5. ASYMPTOTIC PROPERTIES

= Er(t, (3) (say)

Hence,

183

(6.5.12)

:bE{Fr*(t,bn] = hsb={3

Therefore, for large n, N, F;(t, bn ) rv AN(Er(t, (3), W;(t, (3)) where wewrite AN to denote asymptotically normal. Now,

A N-nFcn(t) - F(t) = ~[F:(t, bn ) - Fr(t))

where Fr(t) is as defined in (6.3.2). Also Fr(t) is independent of F;(t, bn ).

Again, Var [Fr(t)) = Wr(t,{3). Hence the result (6.5.11).

Note 6.5.1

Suppose (6.1.2) holds but with variance function a(x) =I- v(x). It can beshown that Fcn(t) - F(t) is still asymptotically normally distributed butwith mean given by

(N _ n)-l "'([n-1'" G{hij ( t - (3xi n)- G{ t - (3xi }])LJ LJ a(x·) a(x·)iEs jEr 1 1

whereh. _ v(Xj)a(xi)

I) - a(Xj)v(Xi)

The asynptotic bias is approximately zero if the sample is such that hij ~1 Vi.

Using Lemma 6.5.1 and denoting the variance and variance-estimator off = I:s Y;/1fi by V(Yi) and ii(Yi), respectively, Rao et al (1990) showed that

V{Fd(t)} ~ N-2ii{Ll(t - Yi) - Ll(t - RXi)}

V{Fr(t)}~N-2V{Ll(t-Yi)- ~(? )Ll(t-Rxi)}Fx t R

v{Fr(t)} ~ N-2ii{Ll(t - Yi) - A Fo(t) A Ll(t - Rxi)}Fox (tjR)

when Fox(t) is the customary design-based estimators of Fx(t) defined sim­ilarlyas (6.2.1) and V, v denote, respectively, the design-variance and esti­mator of design-variance.

Page 20: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

184 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

The predictor FRKM(t) as well as FRKM(t) is asymptotically model-unbiasedwith respect to (6.1.2). The asymptotic design-variance of FhKM(t) whichis the same as that of FRKM(t) is given by

Similarly,V(Fps(t)) ~ N-2V(tJ.(t - Yi) - Fh(i) (t))

where h(i) is the post-stratum to which i belongs and

(6.5.13)

(6.5.14)

(6.5.15)

A variance estimator with possibly superior conditional properties is ob­tained following Rao (1985) and Sarndal et al (1989) by replacing tJ.(t ­Yi) - Fh(i)(t) by Nh(i) {tJ.(t - Yi) - Fh(i) (t)}jNh(i)'

Chambers et al (1992) examined the consistency and asymptotic mse ofFCD(t) and FRKM(t) based on the model (6.1.3) under the assumption thatthe sampling is by srswor and assumptions that (i) n, N ---+ f E (0,1), (ii)non-sampled design points have a common asymptotic density d i.e.

1 jX:; LtJ.(Xi -x) ---+ d(y)dyiEs -00

1 jXN _ n L tJ.(Xi - x) ---+ -00 d(y)dy

,Es

(6.5.15)

We shall call these assumptions as assumptions A. It then follows thatmodel-bias of both FCD(t) and FRKM(t) are of order O(~) and the s.e. isof order O( -f,;) so that mse is approximately equal to the variance of theestimator. It is found that

ASV{Fs.(t) - F(t)} = ASV{FRKM(t) - F(t)}

where Fs• (t) has been defined in (6.2.2) and ASV denotes asymptotic vari­ance. The ASV{FcD(t)} is found to be lower than that of FRKM(t) ingenerel when the model (6.1.3) holds. However, this result does not holdunder certain situations. The authors simulated conditions under whichASV(FcD) would be greater than that of FRKM or even Fs•(t) even when(6.1.3) holds. Two artificial populations each of size N=550 and witha = {3 = 1 were employed. In the first population, the fit'S were generated

Page 21: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.6. NON-PARAMETRIC KERNEL ESTIMATORS 185

from a standard exponential distribution and the Xk'S according to a doubleexponential, truncated on the left and shifted to the right to give positivelyskewed values. For the second population, Ck and Xk were shifted from amean-centred standard gamma distribution with slope parameter 0.1. Inaddition, a small bump was put in the extreme right to widen the gap be­tween the mean and mode. For each population 500 simple random samplesof size n= 100 were taken and FCD(t),FRKM(t) and Fsn(t) were calculatedfor certain values of t(= to) and population medians. For the first popu­lation and to all the estimators were found to be approximately unbiased(average error approximately zero), FCD having minimum variance amongthe three, being followed by FRKM. For the second population and to, FCDperformed worst both with respect to average error and average standarderror. However, for t = population median, the poor performance of FCDwas not reflected.

Wang and Dorfman (1996) found the asymptotic variance of FWD and itsestimator under the assumptions A. Kuk (1993) proved the pointwise con­sistency of FK(t) (defined in (6.6.2)) under the assumption (i) of A andthat the finite population values (Xi, Yi)(i = 1, ... l N) are realisations of Nindependent random vectors having a continuous bivariate distribution.

6.6 NON-PARAMETRIC KERNEL ESTIMA­

TORS

The last two estimators to be considered are the nonparametric kernelestimators proposed by Kuo (1988) and Kuk (1993), given, respectively,by

(6.6.1)

where

iEs jEr iEs

N

FK(t) = N- 1 L Rj

i=l

(6.6.2)

(6.6.3)

Page 22: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

186 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

are weights for Kuo's estimator, K(z) = e-z2/2 is a standard normal density

(kernel),

(6.6.4)

where

Uji = w[(Xj - xi)/b]W[(t - Yi)/b]

Vji = w[(Xj - xi)/b] (6.6.5)

and W(z) = l~e' is the standard logistic distribution function with density

w(z) = (1;:')2 and b is the bandwidth parameter used to control the amountof smoothing.

N N

V(FK) = ~2[LV(.R;)+2LL Cov (.R;,Rj )]i=l i<j=l

(6.6.6)

Since Rj is a ratio estimator one can estimate V(Rj) and Cov (Ri,Rj) bystandard methods. Under srswor,

and

1 1vj = ~ L Vji,Vk = ~ L Vki

iEs iEs

Substituting these in (6.6.6) one gets an unbiased estimator of V(Fk). Theestimator V(FK) is almost always non-negative and the corresponding con­fidence interval has good coverage property.

The estimator FKO has been improved upon by Chambers et al (1992).Chambers, Dorfman and Wehrly (1993) considered nonparametric calibra­tion estimator of F(t).

Page 23: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.7. PROPERTIES OF AN ESTIMATOR 187

6.7 DESIRABLE PROPERTIES OF AN Es­TIMATOR

Kuk (1993), Silva and Skinner (1995) listed the following as desirable prop­erties of an estimator of F(t).

• (i) F(t) should have the properties of a distribution function, i.e. itshould be monotonically increasing with F(-00) = 0 and F(oo) = l.This property holds for Fo(t),Fps(t),FcD(t),FKO(t) and FK(t), However, as noted by Kuk (1993), none of theestimators Fr(t), Fd(t), FRKM(t) is monotonically increasing in gen­eral.

• (ii) It is desirable that as Y approaches x, value of an auxiliary vari­able F(t) should approach F(t). In particular, if y = x, F(t) shouldequal F(t). This property does not hold for Fo(t) as it makes nouse of x-values. This property holds for each of FCD' Fr,Fd, FRKMbut not in general for FK,FKO ' If Yi = x;Vi, then Fps(t) = F(t) fort = Xh, h = 1, ... , L. For other values of t, equality will not hold ingeneral.

• (iii) The estimator should make efficient and flexible use of the aux­iliary information. Often the value of x on all the units in the pop­ulation are not available, but some summary information of thesevalues, ego in the case of a continuous vriable like age, the number ofpersons Ng in an age-group in lieu of age of each individual. FCD(t)was suggested with this aim in view. Fps(t) can also be used withthis limited information. The other estimators require individualx-values and cannot be used in these cases. As noted before, eachof FCD' Fd, F" FRKM can be extended to multiple regression model.The extensions of FKO and FK do not seem evident.

• (iv) Simplicity of estimators Computations of estimators are par­ticularly simple if

F(t) = L wili.(t - Yi)iEs

where the weights Wi depend only on the label i . This is particularlysuitable for surveys with multiple characteristics. Fa,Fps ,FKO possess this property. The estimators FCD ' FRKM' FK re­quire intensive calclations.

• (v) Uniqueness in the definition The expressions for Fa, Fd(d =1), Fr are unque. Fps depends on choice of the strata. FCD ' FRKM

Page 24: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

188 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

depend on the choice of the model. PKO ,PK require the specifica­tion of the bandwidth b, PK also requiring appropriate scaling of theresponse variable.

• (vi) Availability of the variance-estimators All the above estima­tors possess variance-estimators.

• (vii) The estimators should have good conditional properties. Inparticular, it should remain approximately unbiased over variationsin values of auxiliary variable x.

6.8 EMPIRICAL STUDIES

We first consider two populations employed by different authors.

(i) Chamber and Dunstan (1980) (CD) population: The population con­sisted of 330 sugarcane farms covered in the survey of Queensland sugar­cane industry, Australia, 1982. The main variables were: Y(l) (total caneharvested); Y(2) (gross value of cane); Y(3) (total farm expenditure). Theauxiliary variable was x (area assigned for cane planting). This populationwith Y = Y(2) obey model (6.1.2) fairly well.(ii) Beef cattle population of Chambers et al (1993): The population con­sisted of 430 farms with 50 or more beef cattle covered in the AustralianAgriculture and Grazing Industries Surveys conducted by the AustralianBureau of Agriculture and Resource Economics in 1988 with Y as the incomefrom beef and x as the number of beef cattle in each farm. The true modelfor this population is a quadratic mean function with vex) <X (x + 20)3/4 in(6.1.2) .

We denote by (J(o:), the o:th population quantile,

(J(o:) = inf {t : F(t) ~ o:} (6.8.1)

Sometimes estimators are calculated for different quantiles (Jo, 0: = 1, ... m.Also, let there be A samples and ps denote the value of P on the samples. Different measures of performance of pet) are:

• (i) Relative Mean Error ( RME )

1 A

= A L IPS(t) - F(t) I/F(t)5=1

Page 25: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.8. EMPIRICAL STUDIES

• (li) Relative Root Mean Square Error ( RRMSE )

A

~ 2:)F5 (t) - F(t)? / F(t)25=1

• (iii) Average Absolute Bias ( AAB )

1 m A

= - 2:: Bias {F(80 )}

m 0=1

whereA 1 A A

Bias {F(t)} = A 2:: IF 5 (t) - F(t) I5=1

• (iv) Average Root Mean Square Error ( ARMSE )

189

=

whereA

RMSE (F(t)) = ~ 2::(FS (t) - F(t))25=1

• (v) Maximum Absolute Deviation of F(t) for a given sample s (MAD(s))

= max IF5 (80 ) - a Io

One should consider an estimator FI optimal if

MAD(s) (FI) = mjn MAD(s) (F),

F

F, FI E ft, a class of estimators.

• (vi) Average Minimum Absolute Error ( AMAE )

Page 26: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

190 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

We shall write tl >- t2 with respect to a to denote that the estimator tl isbetter than t2 with respect to the property a.

Chambers and Dunstan (1986) compared the performances of FCD andFo through a simulation study based on samples drawn from the CD­population using the following sampling designs: (a) srs (b) stratified ran­dom sampling with two strata and proportional allocation (c) same as in(b) but with optimum allocation. Strata boundaries were such as to makethe strata sizes (total of x-values) constant over strata. For each samplingdesign, 1000 samples each of size n = 30 were drawn. From each sample,estimates of F(t) for the quantiles t = fJN(l/4), fJN(l/2), fJN(3/4) (wherefJN(D'.) = inf {t : F(t) 2:: D'.}) were calculated for each ofthe study variables.FCD(t) was found to be better than Fo(t) in terms of RRMSE. However,FCD was slightly more biased than Fo.

The 1000 samples were ordered by their x-sample means, split into 20groups of size 50 each and RME of the estimators were calculated for eachgroup. The RME of FCD remained approximately unaffected over variationsin x whereas Fo showed a linear decreasing trend.

Roo et al (1990) compared RME and RMSE of Fo,FrJ FRKM and FCD basedon (6.1.2) with v(Xj) = y'Xi on the basis of CD-population with Y as Y(2)'

Sampling designs, number of samples, sample sizes and quantiles fJ(D'.) weresame as in CD-simulation study. It was found that

Fd,FRKM >- Fr (specially, for small d) >- FCD

with respect to RME. Also,

FRKM >- Fd >- Fsn for srs

Fd >- Fr

with respect to RMSE. The model-based estimator FCD was found to~e s!gn~ficantly more efficient than the design- based Fo(= Fsn for srs) ,Fd,Fr,FRKM, since the data seemed to obey the model (6.1.2).The con­ditional performance of the estimators was studied as in Chambers andDunstan (1986). It was found that the RME of Fd,FCD' FRKM' Fr remainedmore or less stable over variations in x( RME (FRKM ) E [-.03, .03), RME (Fr),RME (Fd) varied over [ -.09, .05], RME (FCD ) ~ -.04), while that of Fo

showed a linear trend ( RME (FsJ E [-.2, .2]). The authors also studiedthe performance of these estimators with respect to Hansen, Madow andTepping (1983) population.

Silva and Skinner (1995) considered Monte Carlo comparison of FCD ' Fd ,

Fr,FRKM ,FK,FKK ,Fps by selecting 1000 samples of sizes 30 and 50 by

Page 27: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.8. EMPIRICAL STUDIES 191

srswor from each of CD- population with y as income and Chambers etal (1993)-beef population. Three alternative schemes of post-stratificationwere used:

(a)The choice x(1) < ... < X(L-1) such that N h = N V h = 1, ... , L

(b) The choice for which

N

Lv'xi= Lv'xi/LV hiE'Ph i=l

(c) The choice for which

N

LXi = Lxi/LV hiE'Ph i=l

For each sample estimates were calculated for 11 different quantiles B(a), a =f2' ... ,H· The numerical study indicated there was considerable gain inefficiency for Fps over Fa. For these populations four seemed to be the op­timum number of strata. Also AAB(Fps ) was small. Fps was found to bebetter than Fr , Fd , FK and worse than FcD ,FRKM from the point of viewof ARMSE.

Kuk (1993) compared Fa, FRKM ,FCD (both based on model (6.1.2) withvex) = JX) and FK on the basis of CD-population with y = Y(2) and beefpopulation. Samples of size n = 30 were drawn using srs, stratified randomsampling with x-stratification and proportional allocation, and ppswr from

A 1 11CD-data. For each sample F(Bo ) was computed for a = 12"'" 12' Thecriteria of comparison were average bias, ARMSE and AMAE . It followedthat FK >- Fo,FRKM with respect to ARMSE and AMAE. However, FKwas not as efficient as FcD . The conditional behaviour of the estimatorswas studied by splitting the 200 ordered (according to x-values) samplesinto 10 groups of equal size and then calculating the conditional bias foreach group. The conditional relative bias of FK was found to be small andinvariant while Fa exhibited trend over the variation of x. Since the datawere highly skewed, transformed variables x' = X1/4 and y' = (y/l00)1/4were used. Samples, 200 in number, each of sizes 30, 60 and 90 weredrawn by pps and FCD' FRKM (both based on (6.1.2) with vex) = x) were

1 3 3 A A

calculated for a = 2' 4' It was found that for a = 4' FCD >- FK >- FRKMwith respect to RMSE. For a = ~,FcD was inferior to FK and FRKM. Fora = ~, the relative bias of FCD was close to 20% with root mse much largerthan that of FK which was best. The bias of FCD remained constant over

Page 28: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

192 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

changes in sample sizes meaning FCD was not asymptotically m-unbiasedwith respect to the data. In terms of MAE, FK was comparable to FCD forn = 30 and was better than Fa, FcD ,FRKM for n = 90. With respect toconditional relative bias FK was better than each of FCD' FRKM and Fa.

Dorfman (1993) made empirical comparison of MCD' MRKM. (based on a• 2 2 A

quadratic model, Yi = ex + f3xi + -yxi + J"XiEi, £(Ei) = 0, V(Ei) = (J" ), MRKMand £10 where M(t) = N~n L b.(t - Yi) and £1.. is obtained from t. (for

iE8example, MRKM. is obtained from FRKM. ·in (6.3.19». Simple randomsamples, 1000 in number, each of size n = N/10 were selected from each offive populations based on data collected in beef cattle population and theabove estimators were calculated for each sample at each quartile of thepopulation. It was found that £10 was invariably less efficient than M RKMor MRKM., which were very close, indicating that the last two estimatorswere stable under changes of the model. MCD was far better than the otherestimators in three populations while in two other populations ifRKM faredbetter than MCD particularly for ex = 1/4. The same trend was found forn = N/5.

Dunstan and Chambers (1989) compared the performance of FCD(t) ,A (L) A A A (L)

FCD(t) and Fo(t) as well as their variance estimators V(FCD), v(FCD )'v(Fo ) (with v(x) = yX in (6.1.2) everywhere) on the basis of 1000 sam­ples each of size 30 drawn independently from CD-population using thefollowing sampling dsigns: (a) srswor with post-stratification into (i) sixpost-strata, suitably defined (ii) three post-strata formed by collapsing sixpost-strata into pairs (b) stratified random sampling with proportinal allo­cation using the six post-strata in (a i) above as strata. The estimates foreach of Y(l)' Y(2) , Y(3) were calculated for each sample for all the quantilesalong with their variance estimates. The criterion was repeated sampling

1000

average, F(t) = 1C:OO L F 8 (t) and RMSE. It was found that FCD and F~~8=1

had very similar performance, both having slight repeated sampling biasand having RMSE smaller than Fa. The performance of v(F) was assessed

1000

by comparing 10~ L JV8(F) with RMSE( F) where V(F8) is the value8=1

of v(F) for the sth sample and by checking the coverage property of theconfidence intervals generated by these strategies. The coverage was closestto its nominal value for all the three estimators at ex = 1/2.

Page 29: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.9. BUP UNDER GUASSIAN MODEL 193

6.9 BEST UNBIASED PREDICTION (BUP)

UNDER GAUSSIAN SUPERPOPULATION

MODEL

Bolfarine and Sandoval (1993) considered best unbiased prediction (BUP)of F(t) under multiple regression model with errors having a Gaussiandistribution. Consider the model

(6.9.1)

where Y = (Yl, .. ·,YN)',e = (el, ... ,eN)',X = «xij,i = 1, ... ,N;j =1, ... ,p))Nxp,Xij being the value of variable Xj on unit i,{3 = ({31,'" , (3p)',a vector of P regression coefficients, 0-

2 a known constant and WaN xN known diagonal matrix. We shall denote by Ys, X s, es,W s the parts ofY, X, e, W, respectively, corresponding to sample s, the same symbols withT in place of s will denote the parts corresponding to non-sampled units. Ifthe sample is drawn by srs, then the model (6.9.1) holds for the sampledelements as well.

Rodrigues et al (1985) considered the following definition and proved The­orem 6.9.1 in the case of survey sampling.

DEFINITION 6.9.1 Complete and Totally Sufficient Statistics A statisticS = S(Ys) is said to be totally sufficient for the family {~9, BE 8} where ~9 isthe pdfof Ys depending on some unknown parameter B, if (i) the conditionaldistribution of Ys given S is independent of B (ii) Ys, Yr are conditionallyindependent given S.

A totally sufficient statistic S is said to be complete if the induced familye of sampling distributions of S is complete.

Condition (ii) means that S contains all information contained in Ys aboutYr' In case Ys and Yr are independent (ii) always holds.

THEOREM 6.9.1 Let S = S(Ys) be a complete and sufficient statistic forB which is also totally sufficient. If B(ys) = B(ys) + Brs(S) is ~-unbiased

for B(y) = B(ys) + Brs(Y) then B(ys) is the unique best unbiased predictor(BUP) (in the sense of having minimum ~-variance among all ~-unbiased

predictors) for B(y).

Clearly, if we have an m-unbiased predictor of F(t) which depends on Ys

Page 30: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

194 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

only through 5, then it is the BUP of F(t). Under model (6.9.1),

(3~ = (X'W- 1X )-1 X'W- 1ys 585 SSS (6.9.2)

is complete and sufficient for (3 and since W is diagonal it is also totallysufficient and complete for (3. The BUP of F(t) is then obtained using The­orem 6.9.1 and some results about estimation of F(t) in infinite populationdue to Olkin and Ghurye (1969).

THEOREM 6.9.2 Under model (6.9.1) the BUP of F(t) is given by

~ n 1 '" ( t - x'~s )FBU(t) = NFsn (t) + N ~ I}> -1 , I , -1 -1

IEr a,;wiJ1 - Wi Xi (XsWs X s) Xi(6.9.3)

provided1 - x:(X~Ws-lXs)-lxdwi > 0 Vi E T

where I}> is the distribution function of the standard normal deviate andW = Diag. (WI, ... ,WN) and Xi = (XilJ ... , Xip)"

PROOF. When W = IN, according to Olkin and Ghurye (1969),

[ ( '(3~)]E I}> t-xi s -E6.t- .aJ1 - x:(X~Xs) l Xi - [( YI)]

which shows that if x:(X~Xs)-IXi < 1 Vi E T,FBU(t) is an unbiased estima­tor of F(t) by (6.3.1). Since FBU(t) is a function of sufficient statistic the re­sult follows by Theorem 6.9.1. The result for W = Diag. (WI, ... ,WN) fol­lows from the case W = IN by making the transformation y; = Yd,;wi, xi =

xd..;wi' ei = ed ..;vi'EXAMPLE 6.9.1

Suppose X = IN, W = IN in model (6.9.1). The complete and totallysufficient statistic is ~s = Ys' in this case

~ n ~ n Mt-~FBU(t) = NFsn(t) + (1- N)I}>( --(--))n-1 a

= F*(t) (say)

NOTE 6.9.1

(6.9.4)

Page 31: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.9. BUP UNDER GUASSIAN MODEL 195

Under model (6.9.1) with X = 1N, W = IN, Fsn(t) is am-unbiased predictorof F(t). If ~ is a family of continuous distributions (not necessarily normal),Fsn(t) being a symmetric function of order statistic Y(s) (order statisticcorresponding to Ys) is a totally sufficient statistic and hence, by Theorem6.9.1 is BUP for F(t).

Considering the model of example 6.9.1 under assumptions that the finitepopulatiuon sequence P v of size N v is an increasing sequence such that asv -> 00, N v - n v -> 00 with n v / N v -> f, f E [0, 1], and applying Lindeberg­Levy CLT,

and

f t-f3 t-f3+-<1>(-)][1 - <1>(-)]1-f u u

(6.9.5)

(6.9.6)

(6.9.7)

where ¢(.) is the density function of a standard Normal distribution, F*(t)is obtainable from (6.9.4) and FNv(t), Fsv(t) are, respectively, populationd.f. and sample d.f. for P v . From (6.9.5) and (6.9.6), asymptotic relativeefficiency of Fs.(t) , with respect to FBU(t) is

ARE (F (t) : F (t)) = APV(F!w(t))SV BU APV(Fsv(t))

(1- f)¢2((t - (3)/u)- f + -'---...,.----;~~----'-,...;--;;---'-

- <l>C~{3)(l - <l>C~{3))

where APV denotes asymptotic prediction variance. For the case t = f3 thisreduces to

ARE = f + .637(1 - f)

Again, ARE is a decreasing function of It-f31 and as It-f3I-> 00, ARE->f.

6.9.1 EMPIRICAL STUDY

Considering model (6.9.1) with X = (Xl, ... ,XN)',W = Diag,(wl"'"WN) and a Gaussian distribution of error, CD-type estimator, obtained by

Page 32: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

196 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

using Royall's approach is

which is closely related to BUP FBU(t), especially if n is large.

Bolfarine and Sandoval compared FBUP ,F~D' Fsn and Fr on the basis of1000 srs each of size n = 10 drawn from a population of size N = 1000generated according to the model

Yi = 3Xi + ei (i = 1, ... , 100)

ei rv N(O, 82xi), the x;'s being generated according to Uniform (10, 200). For each sample estimates of quartiles F(t) were calculated for t =0(1/4), t = 0(1/2), t = 0(3/4). The estimates were compared with respectto repeated sampling mse. It was found that

~ ~I/ ~ ~

FBuP, FCD >->- FSn >- Fr

Fd' FRKM >- Fsn , Fr

The estimator FCD performed closely with FBUP ; performance of FCD waspoor for a = 1/4. One may, therefore, conclude that under normal super­population models, the model-based predictors provide improvement overdesign-based predictors, specially, for small values of a.

As in Chambers and Dunstan (1986), 1000 samples were ordered accordingto Xs values and divided into 20 groups of 50 samples each. The averagebias

1 50

50 2)FS (t) - F(t))s=1

was plotted against the Xs values. It was found that Fr was more affectedby variation in Xs -values than were FBUP and FCD.

Similar studies with large sample prediction variance (as in (6.9.7)) showedthat variance decreased as Xs increased. The optimum sampling design is,therefore, to choose a sample with the largest Xs values with probabilityone.

6.10 ESTIMATION OF MEDIAN

Since many real life populations are highly skewed, the estimation of medianis often of interest. Kuk and Mak (1989) suggested the following method for

Page 33: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.10. ESTIMATION OF MEDIAN 197

(6.10.4)

estimating the finite population median My = M = ()(1/2). In the absenceof auxiliary information x, a natural estimator of M is sample median,

Msn = my (6.10.1)

When the values of the auxiliary variable x are available, the ratio estimatorof My is

(6.10.2)

Let Y(l) ::; ... ::; Y(n) be the ordered values of Y in s. Let i o be an integersuch that

Y(io) ::; My ::; Y(io+l)

and P = io/n. Thus My is approximated by the pth sample quantile Zp.Since My is unknown p is unobservable. If p is a guessed value of p, anestimate of M is

M(p) = Zp

Let n x be the number of units in the sample with x values::; M x · Let P nbe the proportion of population values with y-values ::; My, x - values ::;Mxi P12 the same with y-values ::; My, x values> Mxi P2l the same withy-values > My and x-values::; M x and Pn = 1 - Pll - P12 - P2l . If Pi/sare known, an estimate of p is

2 1~ -[nxPll + (n - n x)( - - PH)] (6.10.3)

n 2since POl ~ ~,PlO ~ ~. In practice, the Pi/s are usually unknown and are es­timated by the sample proportion Pi; obtained by similar cross-classificationof the values in the sample against the sample median my = m and m x •

Therefore, from (6.10.3), a sample-based estimate of pis

and an estimator of My is

Mp = Myp = ZPl

and is referred to as the 'position estimator'.

Another estimator of My is derived from Kuk and Mack (1989) estimatorof d.f. as

M(KM) = inf {t : FKM(t) ~ 1/2}.

Page 34: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

198 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

We consider now asymptotic properties of the estimates. Assume that asN --4 00, n/N --4 f E [0, 1] and the distribution of (X, Y) approaches abivariate continuous distribution with marginal densities fx(x) and fy(y),respectively, and that fx(Mx ) > 0, fy(My) > o. Under these conditions,the sample median my is consistent and asymptotically normal with meanMy and variance

1- f 1 2-- =a

4n {fy(My )}2 y

(Gross, 1980). It follows that the asymptotic distribution of (mx-Mx,m y­My) is bivariate normal with mean zeroes and variances a;, a; ( definedsimilarly) and covariance

Now,

(6.10.5)

Since, mx/Mx --4 1, Mr - My has the same distribution as

Thus, Mr - My is asymptotically normal with mean 0 and variance

Consequently, Mr is asymptotically more efficient than my if

where Pc = 4(Pll - i) E [-1, 1] as Pll E [0, n The quantity Pll canbe regarded as a measure of concordance between x and y. Similarly, theauthors considered asymptotic distribution of M(p) and MKM both of which

are found to be more efficient than Msn .

In an empirical study the authors show that for populations showing astrong linear relationship between x and y, M r , Mp, M(KM) perform con­siderably better than my. However, if the correlation coefficient between

x and y is week (Pll small), M r performs very poorly while M p , M(KM)

retain their superiority relative to my.

Page 35: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.10. ESTIMATION OF MEDIAN

Two estimators of edue to Kuk (1988) are

Again,F>. = >.h(t) + (1 - >')Fn(t), 0 < >. < 1

is also an estimator of F(t). An estimator of eis, therefore,

199

(6.10.6)

Behaviour of ~>. depends largely on the behaviour of F>. near e. Now,

V{F>.(en = >.2V{FL(en2+ (1- >.)2V{Fn(en

+2>.(1- >') Cov {h(e),Fn(en

The optimal value of>. is

N N

>" = L b;Ll(e - Y;)/ L b;;=1 ;=1

(6.10.7)

(b; has been defined in (6.2.19)). Assuming that the ordering of y-valuesagrees with that of x, an estimate of >" is

N N

j,' = L b;Ll(1] - x;)/ L b;;=1 ;=1

where 1] is the median of x. Therefore,

(6.10.8)

(6.10.9)

If aL(t) and an(t) denotes the mse's of h(t) and Fn(t), respectively, thenit can be shown from that

Hence,bL(e) - bn(e) = 1 _ 2>:'

bL(oo)

where h(t), bn(t) denote the mse's of GL(t) and Gn(t), respectively, GL(t) =

h(t)L=Xi and similarly for Gn(t).

Page 36: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

200 CHAPTER 6. ESTIMATION OF DISTN. FUNCTION

Empirical studies reported that €n is considerably better than €L and €v(t)( in conformity with the result Fn is always better than FL and Fv , definedin (6.2.11)). The performance of €>. and €>. are usually at least as good as

that of €n.Since FCD(t) is a monotonically non-increasing function of t, Chambers andDunstan (1986) obtained estimation of 8N(a) as

BN;CD(a) = inf{tj FCD(t) ~ a} (6.10.10)

Since, FCD(t) is asymptotically unbiased under (6.1.2), BN;CD is also so.From Serfling (1980, Theorem 2.5.1) one can note the Bahadur representa­tion of 8N (a) as

8N(a) = 8(a) + [a - FN {8(a)}]jeN {8(a)} + op(N1/

2)

where 8(a) is defined by E[FN(8(a))] = a and eN(t) = ftE{FN(t)}. As­

suming a similar representation for BN;cD(a) for N, n large,

asymptotic variance of BN;cD(a) - 8(a), following Theorem 6.5.1, is

(6.10.11)

Rao et al (1990) obtained ratio estimator of 8N (a) as

(6.10.12)

whereBN(y)(a) = inf{tjFy(t) ~ a}

BN(x)(a) = inf{t;Fx(t) ~ a},

and 8x(a) = inf{tj Fx(t) ~ a} is the known finite population a-quantilefor x. Similarly, a difference estimator for 8(a) is

(6.10.13)

where R is defined in (6.2.3). Both Br(a) and Bd(a) have ratio estimationproperty.

Rao et al (1990) compared the RME and RRMSE of Bo(a),Bd(a), Br(a) for a = 1/4,1/2 and 3/4 on the basis of samples drawn from

Page 37: [Lecture Notes in Statistics] Topics in Survey Sampling Volume 153 || Estimation of a Finite Population Distribution Function

6.10. ESTIMATION OF MEDIAN 201

CD-population by (i) simple random sampling and (ii) stratified randomsampling with x-stratification and proportional allocation. The relativebias of all the estimators was found to be small. For simple random sam­pling, OrCa) and Od(a) were found to be considerably more efficient thanOo(a) with respect to RRMSE while their performance were almost identi­cal for stratified random sampling as above. The conditional relative meanerror of Or and Od remained more or less stable for variations in x while thatof 80 showed linear trends for a = 1/2. Rao et al also considered varianceestimates of these estimates. Sedransk and Meyer (1978) considered confi­dence intervals for the quantiles of a finite population under simple randomsampling and stratified random sampling. Some other references on esti­mation of quantiles are McCarthy (1965), Loynes (1966), Meyer (1972),Sedransk and Meyer (1978), David (1981), Sedransk and Smith (1983),Meeden (1985), Francisco and Fuller (1991) and Bessel et al (1994).