Optimal Allocation in the Multi-way Stratification Design for Business Surveys (*) Paolo Righi,...

Optimal Allocation in the Multi-way Optimal Allocation in the Multi-way Stratification Design for Stratification Design for

Business Surveys Business Surveys (*)(*)

Paolo Righi , Piero Demetrio [email protected]; [email protected]

Italian National Statistical Institute

(*) Research of National Interest n.2007RHFBB3 (PRIN) “Efficient use of auxiliary information at thedesign and at the estimation stage of complex surveys: methodological aspects and applications forproducing official statistics””

OutlineOutline

Statement of the problem Multi-way Sampling Design Multi-way optimal allocation algorithm Monte Carlo simulation

Statement of the problem

Large scale surveys in Official Statistics usually produce estimates for a set of parameters by a huge number of highly detailed estimation domains

These domains generally define not nested partitions of the target population

When the domain indicator variables are available at framework level, we may plan a sample covering each domain

Fixing the sample sizes:Help to control the sampling errors of the main estimates;When direct estimators are not reliable (small area problem), having

the units in the domains allows to: bound the bias of small area indirect estimators; use models with

specific small area effects.


Standard solution for fixing the sample sizes stratifies the sample with strata given by cross-classification of variables defining the different partitions (cross-classified or one-way stratified design)

Main drawback:Too detailed stratification:

Risk of sample size explosion; Inefficient sample allocation (2 units per stratum constraint);Risk of statistical burden (e.g. repeated business surveys) .


Domain of Interest Parameter of interest and estimator:

Multivariate (r=1,…,R) and multidomain (d =1, … , D) context

being 1dk if dUk and 0dk otherwise.


The sampling strategy herein proposed bases each domain estimate on a planned sample size. We consider a general random sampling design where hU (h=1, …, H) of size hN define minimal planned subpopulations. We assume two cases dU = hU or dU = hh U

d where d is a subset of H,...,1 .


Example: Three domain types lT (l=1, .., 3). Nace four digit; Nace three digit by size; Nace 2 digit by geography Each domain type defines a partition of the population of lD cardinality being 321 DDDD . Different sampling design allows to plan the sample size of the interest domain: the standard approach define the hU ’s combining

the population of the three domain types. Then

321 DDDH and the kδ are defined as (0,..,1,...,0) vectors. We denote these design as cross-classified or one-way stratified design;


Example (continue): the hU ’s are defined combining all the couples of

domain types. Then )()()( 323121 DDDDDDH ;

some hU ’s agree with the domains of one population partitions (for instance 1T ) and the others hU ’s are defined combining couples of the remaining domain types ( 2T and 3T ). Then

)( 321 DDDH ; the hU ’s agree with the domains of interes. Then

321 DDDH .

Sampling design defining the hU ’s as in the last three points are denoted by Multi-way (or incomplete) stratification

Multi-way Sampling DesignMulti-way Sampling Design

Main problem of MWD: define a procedure for random selection

We propose to use the Cube method (Deville and Tillé, 2004):Select random sample of multi-way stratified design; For a large population and a lot of domains.

Multi-way Sampling DesignMulti-way Sampling Design

ITACOSM 2011 - 27-29 June 2011, Pisa, Italy - 6

The cube algorithm selects a sample s respecting the following general balancing equations

s U kk

k xx

Setting kkk δx with ),...,,...,( 1 Hkhkkk δ

being 1hk if hUk and 0hk otherwise

We have

s hhk n fixed for each sample selection

Optimal allocation algorithmOptimal allocation algorithm


Deville and Tillé (2005) proposed an approximated expression of the variance

Uk kdrkdrp ftV 2)()( )1/1()|ˆ( π

Where f = N/(N-H), kdr)( dkrky kdrk g )(

and kdrg )( = )(drk Bδ , being

12)( )]1/1([

kUk kkkdr δδB

)1/1( kUk rkkdkk y δ


We propose an algorithm for the definition of an optimal inclusion probability vector *π according to the following optimality criterion:

)( * Uk kMin

(a) )(*

)( )|ˆ( drdrp VtV π (dr)

(b) 10 * k

being )(drV a fixed variance threshold for the domain dU on the

r-th variable of interest.


We note the (a) constraints depend on the unknown variables of interest. In practice only model predicted values can be used. In the paper we sketch the main phases of the algorithm in this operative context.

We consider a general prediction model M

lkuuE

uEkuE

uyy

rlrkM

rkrkMrkM

rkrkrk

0),(

;)(;0)(

~

22 .

We suppose the 2rk values are known or can be predicted


To take into account the model uncertainty, the sampling variance is replaced by the Anticipated Variances (Isaki and Fuller, 1982). An upward approximation of the anticipated variances for the proposed strategy is

Uk rkkdrk

Uk kdrk

drdrpmdr

f

ttEEtAV

2)(

*

2)(

*

2*)()()(

)1/1(

~)1/1(

)|ˆ()ˆ(

π

where 2)(

~kdr is computed by means of a model

predicted value rky~ . The approximation neglects a residual term that we do not show for sake of brevity. However, the optimization procedure does not change if the corrected anticipate variance is taken into account.


The optimization problem is defined as

)( * Uk kMin

(5) )()( )ˆ( drdr VtAV (dr)

10 * k

To obtain a solution of the optimization problem we formulate constraints (5) as

)~,()~(

/)~(

*)(

22)(

*22

ddrrkUk rkdkdr

krkUk rkdk

CyV

y

gπ

being :

]~)1(~~)1(2[ 2***)( Uk dkkkUk dkrkdkkdr ggyfC , )~,...,~,...,~(~

1 Dkdkkd gggg , dkg~ = )(

~drk Bδ with )(

~drB given by (2) replacing rky with rky~


The algorithm consists of two calculation loops nested in each other.

Let G)( and G),( respectively denote

the generic quantity G as calculated by the iteration ( =0,1,2,..) of the first loop (outer process) and by iteration =0,1,2,..) of the second one (inner process).


For a given value of the vector of the inclusion probabilities ),...,,...,( )()()()(

1 Nk πππ π , the first calculation loop calculates the terms (D

x R) terms )~,( *)(

)(ddr

a C gπ .

For given values of )~,( *)(

)(ddr

a C gπ , the second

calculation loop finds the inclusion probabilities,

),...,,...,( ),(),(),(),(1 Nk πππ π ,

solution of problem by a slight modification of the Chromy algorithm.

Monte Carlo simulationMonte Carlo simulation

Objectives of simulation:

Test the convergence of the optimization algorithm (optimization step)

Comparison between the expect AV and the Monte Carlo empirical AV

Comparison with standard cross-classified stratified design



Data:Subpopulation of the Istat Italian Graduates’ Career Survey

(3,427 units);Driving allocation variables:

employed status (yes/no) ; actively seeking work (yes/no) .

We generate the values of the two variables by means a logistic additive model (Prediction model);

Explicative variables: degree mark, sex, age class and aggregation of subject area degree

The parameters are estimated by the data from the previous survey



Survey target estimates:Two partitions define the most disaggregate domains:

First partition: university by subject area degree (9 classes);Second partition: degree by sex;

Domains in real survey:448+94; Strata 2,981 (university, degree, sex);

In the simulation: domains 20+15;strata 91.

Errors thresholds fixed in terms of CV(%)



Results:Assuming as known values

Iterations (outer process): 6;Optimal sample size 171 (after calibration 182).

Assuming predicted values:Iterations (outer process): 3;Optimal sample size 699 (after calibration 707).



Analysis of the allocation with the predicted values:The sample allocation procedure uses an approximation of the

AV

The simulation confirms the input AV is an upward approximation of the real AV


Average of Expectected Anticipated CV(%) Partition

1 8.1 17.82 9.2 19.1

1y 2y

Average of Empirical (10,000 Monte Carlo simulations) Anticipated CV(%) Partition

1 6.7 14.72 7.4 15.5

1y 2y


Comparison with the standard approach:

The implicit model (one-way stratification model) is similar to the model used in our approach;

The allocation differences depend on the unit minimum number constraint (2) in each stratum;

The sample size is 751 units (+7.4%);

Taking into account the domains with small population strata (<10 units in average per stratum) standard approach produces +14.4% sample size.


ReferencesReferences

Bethel J. (1989) Sample Allocation in Multivariate Surveys, Survey Methodology, 15, 47-57.

Chromy J. (1987). Design Optimization with Multiple Objectives, Proceedings of the Survey Research Methods Sec-tion. American Statistical Association, 194-199.

Deville J.-C., Tillé Y. (2004) Efficient Balanced Sampling: the Cube Method, Biometrika, 91, 893-912.

Deville J.-C., Tillé Y. (2005) Variance approximation under balanced sampling, Journal of Statistical Planning and Inference, 128, 569-591

Falorsi P. D., Righi P. (2008) A Balanced Sampling Approach for Multi-way Stratification Designs for Small Area Estimation, Survey Methodology, 34, 223-234

Falorsi P. D., Orsini D., Righi P., (2006) Balanced and Coordinated Sampling Designs for Small Domain Estimation, Statistics in Transition, 7, 1173-1198

Isaki C.T., Fuller W.A. (1982) Survey design under a regression superpopulation model, Journal of the American Statistical Association, 77, 89-96


Optimal Allocation in the Multi-way Stratification Design for Business Surveys (*) Paolo Righi,...

Documents

Transcript of Optimal Allocation in the Multi-way Stratification Design for Business Surveys (*) Paolo Righi,...