Eurostat Statistical matching when samples are drawn according to complex survey designs Training...

Eurostat

Statistical matching when samples are drawn according to complex survey

designs

Training Course «Statistical Matching»

Rome, 6-8 November 2013

Mauro ScanuDept. Integration, Quality, Research and Production Networks Development, Istatscanu [at] istat.it

Eurostat

Outline

• Renssen: calibration– What does CIA mean?– Estimates under the CIA

• Macro approach• Micro approach

– Auxiliary information: file C• Incomplete two-way stratification• Synthetic two-way stratification

• Rubin (1986): File concatenation• Weight-split algorithm

Eurostat

The problemThe presence of survey weights is usually a problem in the statistical

matchign context: should we use survey weights or not? The answer is: yes!

Anyway, survey weights can be included in a statistical matching procedure in different ways. There are essentially two approaches

Renssen (1999): survey matching is obtained by making the two samples homogeneous as much as possible in their statistical content. This approach is mainly based on calibration procedures

Rubin (1996): this approach is more traditional, in the sense of reconstructing a unique sample AB with a unique system of survey weights.

Let’s start from the Renssen’s approach, that is easily comparable with the techniques already shown for i.i.d samples in the last two days.

Eurostat

Let A and B be two archives: on the same population consisting of N units Observing some common variables X and specific variable, Y in A and Z in B The records in A and B have not identifiers (PIN) and the common variables

X cannot be considered as unit identifiers

This is still a statistical matching problem (examples are in DeGroot et al (1971)

The CIA in a finite population context: the case of two data archives

Eurostat

Let s=1,…,N denote the units in the population.Assume that X, Y, and Z are categorical, with I, J, and K categories respectively.The variable categories assumed by each unit in the population are described by these vectors

Notation

Eurostat

The contingency tables can be computed by these matrix computations

Notation

Eurostat

As in the i.i.d. context, statistical matching can have a micro or macro purpose

MACRO APPROACH: The objective is the estimation of the contingency matrix

The statistical matching problem

MICRO APPROACH: The objective is imputation of the missing Z values N in A, or equivalently, missing Y values N in B

Eurostat

Let be linearly dependent

Note: this assumptions seems strange for categorical variables, anyway it has some useful properties.

Regression parameters are obtained by means of the normal equations

The conditional independence assumption

Eurostat

One property is that the marginal distributions are preserved

From the normal equations

Linear dependence: properties

Eurostat

Linear dependence: properties

Eurostat

The (Y, Z) contingency table under the conditional independence assumption (CIA) is

The true, but unknown, contingency table would be

The residual matrix is null when Y and X or Z and X are perfectly correlatedNote that also preserves the observed marginal distributions

The conditional independence assumption

Eurostat

Let A and B be 2 samples drawn from the same finite population according to a complex survey design with the following first and second order inclusion probabilities

Let X be defined by two different kinds of variables U: variables for which NU is known V: variables for which NV is unknown

X corresponds to the categorical variable whose categories are defined by the Cartesian product of all the common variables

From archives to samples

Eurostat

1. Survey weights and are calibrated a first time, giving tnew weigths and , respectively, under the constraints

2. These weights are used for a preliminary estimate for the variable U

Estimates under the CIA

Eurostat

3. The preliminary estimates for U can be different due to sampling variability. In this step we seek a unique estimate for the distribution of U

4. The weights and are calibrated a second time in order to reproduce and .Let and be these final calibrated weights.

Estimates under the CIA

Eurostat

5. Estimate

combining the estimates obtained from A and B with their final weights

6. The regression coefficients are estimated respectively from A and B:

Estimates under the CIA: macro approach

Eurostat

7. The estimate of the contingency table under the CIA, i.e.

is:

Estimates under the CIA: macro approach

Eurostat

8. Assuming A as the recipient, a preliminary imputed value for the missing Z is obtained throught the estimated regression function

9. As we already know, the value is not a live value and can be unrealistic. In this case, a live value cna be obtained through the use of an additional hot deck procedure (hence, a mixed procedure is used).

Note that, given that in step 8 we obtained a complete data set, we can use a distance hot deck procedure with a distance applied on (X, Y, Z) or (Y, Z)

Estimates under the CIA: micro approach

Eurostat

Let C be a sample of size nC, with observations and , and survey weights , c=1,…, nC.

This file can be used in order to improve the estimates and discard the conditional independence assumption.

These procedures use part of the procedure already explained for the estimates under the CIA, i.e. the creation of the final calibrated weights and obtained in step 4.Renssen (1996) defines two macro methods The incomplete two-way stratification The synthetic two-way stratification

Auxiliary information: presence of an additional file C

Eurostat

5. Calibrate the initial C survey weights into the new weights , under the set of constraints

6. The table NYZ is estimated straightforwardly from C

A and B are used only in the set of contraints in step 5. C is able to reproduce the partial estimates on X, Y and X, Z that we got up to step 4.

Incomplete two-way stratification

Eurostat

Let us start from the already seen relationship

5. Estimate under the CIA

6. Calibrate the weights in C to the new weights , under the set of constraints

Synthetic two-way stratification

Eurostat

7. The synthetic two-way estimate is

This method uses C only in order to correct what estimated via A and B under the CIA

Synthetic two-way stratification

Eurostat

The idea of file concatenation consists in the constraction of a unique sample given by the union of the two observed samples A and B. This approach needs new survey weights to be assigned to the units observed in AB.

The probability that the concatenated sample contains a unit s is:

Assuming that is negligible, we get:

Rubin (1986): file concatenation

Eurostat

The new weights become

Rubin (1986): file concatenation

This approach can be difficult to be applied, for different reasons

Eurostat

It is necessary to compute the survey weights that the units a, a=1,…,nA, would have had in case these units had been sampled in file B, i.e. under the survey design that characterizes B.

It is necessary to compute the survey weights that the units b, b=1,…,nB, would have had in case these units had been sampled in file A, i.e. under the survey design that characterizes A

This approach produces the sample that we have used with i.i.d. samples.It does not help in reducing the effects of the CIA unless additional information can be introduced.Hopefully estimate of the marginal distirbution of X is better because produced on a larger sample.This approach is appropriate for the estimation of the statistical matching uncertainty.

File concatenation: comments

Eurostat

Let be the recipient and B be the donor filesWe seek imputed records of this nature:

If a distance hot deck method is applied without any modification of the A survey weigths a, a=1,…,nA, the imputed distribution of Z would not be coherent with the one actually observed in B.In Canada they have consedered a specific constrained distance hot-deck method whose objective is that the new weights for the pair (a,b) (i.e. recipient-donor record) are such that:

1. is minimized2. under the constraints

Hot deck and complex survey designs

Eurostat

For simplicity assume that

Compute

The method consists of these three steps

The weight-split algorithm

Eurostat

1. Impute zb to those records in such that

2. Those b that do not have ties, i.e. a record a such that

impute also zb to the first record in A such that

In this way we have considered all the pairs that would have been taken into consideration by the rank hot deck methodInstead of a data set with nA record, we have a data set with nA+nB-T records, where T is the number of ties


Eurostat

3. Reorder the synthetic records obtained in the first two steps according to their cumulated survey weight

where a and b are the recipient and donor records of the i-th ordered synthetic record

The weight of the i-th record is

Where =0


Eurostat

The marginal and joint distribution for (X, Y) are those observed in A The marginal distribution of Z is that observed in B

The weight-split algorithm: properties

Eurostat

Selected references

Morris H. DeGroot, Paul I. Feder and Prem K. Goel (1971): “Matchmaking”, The Annals of Mathematical Statistics, 42, No. 2 (Apr., 1971), pp. 578-593.

Renssen R H (1998) “Use of Statistical Matching Techniques in Calibration Estimation", Survey Methodology, 24, 171–183

Rubin D B (1986) “Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations", Journal of Business and Economic Statistics, 4, 87–94

Liu T P, Kovacevic M S (1994) “Statistical matching of survey datafiles: a simulation study" Proceedings of the Section on Survey Research Methods of the American Statistical Association, 479–484

Eurostat Statistical matching when samples are drawn according to complex survey designs Training...

Documents

Transcript of Eurostat Statistical matching when samples are drawn according to complex survey designs Training...