Eurostat Statistical matching when samples are drawn according to complex survey designs Training...
-
Upload
justin-roberts -
Category
Documents
-
view
217 -
download
3
Transcript of Eurostat Statistical matching when samples are drawn according to complex survey designs Training...
Eurostat
Statistical matching when samples are drawn according to complex survey
designs
Training Course «Statistical Matching»
Rome, 6-8 November 2013
Mauro ScanuDept. Integration, Quality, Research and Production Networks Development, Istatscanu [at] istat.it
Eurostat
Outline
• Renssen: calibration– What does CIA mean?– Estimates under the CIA
• Macro approach• Micro approach
– Auxiliary information: file C• Incomplete two-way stratification• Synthetic two-way stratification
• Rubin (1986): File concatenation• Weight-split algorithm
Eurostat
The problemThe presence of survey weights is usually a problem in the statistical
matchign context: should we use survey weights or not? The answer is: yes!
Anyway, survey weights can be included in a statistical matching procedure in different ways. There are essentially two approaches
Renssen (1999): survey matching is obtained by making the two samples homogeneous as much as possible in their statistical content. This approach is mainly based on calibration procedures
Rubin (1996): this approach is more traditional, in the sense of reconstructing a unique sample AB with a unique system of survey weights.
Let’s start from the Renssen’s approach, that is easily comparable with the techniques already shown for i.i.d samples in the last two days.
Eurostat
Let A and B be two archives: on the same population consisting of N units Observing some common variables X and specific variable, Y in A and Z in B The records in A and B have not identifiers (PIN) and the common variables
X cannot be considered as unit identifiers
This is still a statistical matching problem (examples are in DeGroot et al (1971)
The CIA in a finite population context: the case of two data archives
Eurostat
Let s=1,…,N denote the units in the population.Assume that X, Y, and Z are categorical, with I, J, and K categories respectively.The variable categories assumed by each unit in the population are described by these vectors
Notation
Eurostat
The contingency tables can be computed by these matrix computations
Notation
Eurostat
As in the i.i.d. context, statistical matching can have a micro or macro purpose
MACRO APPROACH: The objective is the estimation of the contingency matrix
The statistical matching problem
MICRO APPROACH: The objective is imputation of the missing Z values N in A, or equivalently, missing Y values N in B
Eurostat
Let be linearly dependent
Note: this assumptions seems strange for categorical variables, anyway it has some useful properties.
Regression parameters are obtained by means of the normal equations
The conditional independence assumption
Eurostat
One property is that the marginal distributions are preserved
From the normal equations
Linear dependence: properties
Eurostat
Linear dependence: properties
Eurostat
The (Y, Z) contingency table under the conditional independence assumption (CIA) is
The true, but unknown, contingency table would be
The residual matrix is null when Y and X or Z and X are perfectly correlatedNote that also preserves the observed marginal distributions
The conditional independence assumption
Eurostat
Let A and B be 2 samples drawn from the same finite population according to a complex survey design with the following first and second order inclusion probabilities
Let X be defined by two different kinds of variables U: variables for which NU is known V: variables for which NV is unknown
X corresponds to the categorical variable whose categories are defined by the Cartesian product of all the common variables
From archives to samples
Eurostat
1. Survey weights and are calibrated a first time, giving tnew weigths and , respectively, under the constraints
2. These weights are used for a preliminary estimate for the variable U
Estimates under the CIA
Eurostat
3. The preliminary estimates for U can be different due to sampling variability. In this step we seek a unique estimate for the distribution of U
4. The weights and are calibrated a second time in order to reproduce and .Let and be these final calibrated weights.
Estimates under the CIA
Eurostat
5. Estimate
combining the estimates obtained from A and B with their final weights
6. The regression coefficients are estimated respectively from A and B:
Estimates under the CIA: macro approach
Eurostat
7. The estimate of the contingency table under the CIA, i.e.
is:
Estimates under the CIA: macro approach
Eurostat
8. Assuming A as the recipient, a preliminary imputed value for the missing Z is obtained throught the estimated regression function
9. As we already know, the value is not a live value and can be unrealistic. In this case, a live value cna be obtained through the use of an additional hot deck procedure (hence, a mixed procedure is used).
Note that, given that in step 8 we obtained a complete data set, we can use a distance hot deck procedure with a distance applied on (X, Y, Z) or (Y, Z)
Estimates under the CIA: micro approach
Eurostat
Let C be a sample of size nC, with observations and , and survey weights , c=1,…, nC.
This file can be used in order to improve the estimates and discard the conditional independence assumption.
These procedures use part of the procedure already explained for the estimates under the CIA, i.e. the creation of the final calibrated weights and obtained in step 4.Renssen (1996) defines two macro methods The incomplete two-way stratification The synthetic two-way stratification
Auxiliary information: presence of an additional file C
Eurostat
5. Calibrate the initial C survey weights into the new weights , under the set of constraints
6. The table NYZ is estimated straightforwardly from C
A and B are used only in the set of contraints in step 5. C is able to reproduce the partial estimates on X, Y and X, Z that we got up to step 4.
Incomplete two-way stratification
Eurostat
Let us start from the already seen relationship
5. Estimate under the CIA
6. Calibrate the weights in C to the new weights , under the set of constraints
Synthetic two-way stratification
Eurostat
7. The synthetic two-way estimate is
This method uses C only in order to correct what estimated via A and B under the CIA
Synthetic two-way stratification
Eurostat
The idea of file concatenation consists in the constraction of a unique sample given by the union of the two observed samples A and B. This approach needs new survey weights to be assigned to the units observed in AB.
The probability that the concatenated sample contains a unit s is:
Assuming that is negligible, we get:
Rubin (1986): file concatenation
Eurostat
The new weights become
Rubin (1986): file concatenation
This approach can be difficult to be applied, for different reasons
Eurostat
It is necessary to compute the survey weights that the units a, a=1,…,nA, would have had in case these units had been sampled in file B, i.e. under the survey design that characterizes B.
It is necessary to compute the survey weights that the units b, b=1,…,nB, would have had in case these units had been sampled in file A, i.e. under the survey design that characterizes A
This approach produces the sample that we have used with i.i.d. samples.It does not help in reducing the effects of the CIA unless additional information can be introduced.Hopefully estimate of the marginal distirbution of X is better because produced on a larger sample.This approach is appropriate for the estimation of the statistical matching uncertainty.
File concatenation: comments
Eurostat
Let be the recipient and B be the donor filesWe seek imputed records of this nature:
If a distance hot deck method is applied without any modification of the A survey weigths a, a=1,…,nA, the imputed distribution of Z would not be coherent with the one actually observed in B.In Canada they have consedered a specific constrained distance hot-deck method whose objective is that the new weights for the pair (a,b) (i.e. recipient-donor record) are such that:
1. is minimized2. under the constraints
Hot deck and complex survey designs
Eurostat
For simplicity assume that
Compute
The method consists of these three steps
The weight-split algorithm
Eurostat
1. Impute zb to those records in such that
2. Those b that do not have ties, i.e. a record a such that
impute also zb to the first record in A such that
In this way we have considered all the pairs that would have been taken into consideration by the rank hot deck methodInstead of a data set with nA record, we have a data set with nA+nB-T records, where T is the number of ties
The weight-split algorithm
Eurostat
3. Reorder the synthetic records obtained in the first two steps according to their cumulated survey weight
where a and b are the recipient and donor records of the i-th ordered synthetic record
The weight of the i-th record is
Where =0
The weight-split algorithm
Eurostat
The marginal and joint distribution for (X, Y) are those observed in A The marginal distribution of Z is that observed in B
The weight-split algorithm: properties
Eurostat
Selected references
Morris H. DeGroot, Paul I. Feder and Prem K. Goel (1971): “Matchmaking”, The Annals of Mathematical Statistics, 42, No. 2 (Apr., 1971), pp. 578-593.
Renssen R H (1998) “Use of Statistical Matching Techniques in Calibration Estimation", Survey Methodology, 24, 171–183
Rubin D B (1986) “Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations", Journal of Business and Economic Statistics, 4, 87–94
Liu T P, Kovacevic M S (1994) “Statistical matching of survey datafiles: a simulation study" Proceedings of the Section on Survey Research Methods of the American Statistical Association, 479–484