Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome,...

17
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic Statistics, Istat madorazi [at] istat.it

Transcript of Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome,...

Page 1: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

Accuracy of Results of Statistical Matching

Training Course «Statistical Matching»

Rome, 6-8 November 2013

Marcello D’OrazioDept. National Accounts and Economic Statistics, Istatmadorazi [at] istat.it

Page 2: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

Outline

• Problems in evaluation• Evaluation in the macro SM applications• Evaluation in the micro SM applications

Page 3: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

Steps in a SM application

1) Choice of the target variables, i.e. of the variables observed distinctly in two sample surveys.

2) Identification of all the common variables in the two sources. Not all can be used due to lack of harmonization, different definitions, etc.

3) Choice of the matching variables only those that are able to predict the target variables.

4) Application of the chosen SM technique

5) Evaluation of the results of the SM

For major details see Scanu (2008)

Page 4: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

(i) The general objective of SM is to study the relationship of phenomena not jointly observed, unless an additional auxiliary data source is available.

(ii) The SM can provide different outputs: a synthetic data set in the micro case; one or more estimates (e.g. a correlation coefficient, a regression

coefficient, probabilities in a contingency table, etc.) in the macro case.

(iii) There are two or more data sources which may have different quality

“levels” (sampling design, sample size, data processing steps, etc.)

Problems in Evaluation

Page 5: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

It is the major source of uncertainty concerning the matching results.

This lack of information has to be filled in the by:

- making some assumptions (e.g. the conditional independence of the target variables given the matching variables)

- using additional auxiliary information (an external estimate of the interest parameters or an additional data source, etc.).

Unless an alternative approach is used, which consists in evaluating just the uncertainty due to this situation.

Problems in Evaluation: (i) Phenomena Not Jointly Observed

Page 6: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

The results of the SM will necessarily reflect the underlying assumptions/information being used:

- results of a matching application based on the CI assumption will reflect it; they will be unreliable if CI is not holding.

- If auxiliary information is used (CIA avoided), the result of the SM are expected to reflect such input. If the input information is not reliable, the results of SM will be unreliable.

In this setting a researcher “kwows” what to expect but he has check whether the chosen matching method has been applied correctly, avoiding the introduction of additional noise or bias.

Problems in Evaluation: (i) Phenomena Not Jointly Observed (cont.)

Page 7: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

The outputs are estimates of parameters.

It may be easy to check whether there is some additional noise:

Under the CI assumption in some cases it is possible derive analytic estimation formulas for the parameters of interest. Examples:

Correlation coefficient :

cell probabilities :

Evaluation: Checks in the Macro Case

ˆ ˆCIYZ YX ZX

,

1

ˆ ˆ ˆI

CIY j Z k X i Y j X i Z k X i

i

P P P P

Page 8: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

When auxiliary information has been considered,

- If it consists in an external estimate of the target parameter then it is possible to compare it with the final estimate obtained at the end of the SM

- If it consists in an estimate of the parameter that is not the target one, it would necessary to understand the relationship between the two parameters.

- If it consists in an additional data source it is necessary to understand how it has been used in the whole SM estimation process.

Evaluation: Checks in the Macro Case (cont.)

Page 9: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

In complex situations some suggestions

a) Test the SM in a small pilot study in which it is easy the control the whole process

b) Carry out a sensitivity analysis (check how the output changes by changing one or more of the input parameters)

c) Carry out a series of simulations: replication of the matching application a high number of times given a set of inputs (sometimes just a small controlled randomness is introduced)

Evaluation: Checks in the Macro Case (cont.)

Page 10: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

The output of SM is a synthetic file with all the needed variables

It should be checked whether it can be considered a representative sample (in a wide sense, considering the relationship between variables too)

Can be done just partially because the of lack of joint information concerning Y and Z, unless some auxiliary information it is available.

Evaluation: Checks in the Micro Case

Page 11: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

Rässler’s (2002) suggests to look at the “validity” of the SM procedure by analyzing how the synthetic data set:

a) preserves the marginal distribution of the imputed variable (reference is the one in the donor data set);

b) preserves the joint distribution of the imputed variable with the matching variables (reference is the one in the donor data set).

In order to compare marginal or joint distributions of the variables in the synthetic data set with respect to the one in the donor it is possible to use statistical tests and descriptive measures.

Evaluation: Checks in the Micro Case (cont.)

Page 12: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

Statistical tests should be used to compare the distributions of Z variable imputed in the synthetic data set with respect to the one in the donor (reference). E.g. Chi-Squared, Kolmogorov-Smirnov, etc:

Ad hoc modified tests are available to deal with data from complex sample surveys (for modified Chi-Square tests cf. Sarndal et al., 1992, pp. 500-513)

The modified tests require additional information (estimates of the sampling variance or of the design effect) which in some cases may not be available.

Relatively few modified tests are available (e.g. the Kolmogorov-Smirnov test accounting for complex sampling design does not exist)

Evaluation: Micro Case, Comparing Distributions

2

2

1

ˆ ˆ

ˆ

A BG g g

A Bg g

P Pn

P

ˆ ˆmaxB

A BB B

ZD F z F z

Page 13: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

When the statistical tests are not applicable it is possible adopt an “empirical approach” which consists in comparing the marginal distributions estimated from the two surveys using similarity/dissimilarity measures

The dissimilarity index or total variation distance among distributions is:

, 0 means that the distributions are equalCan be interpreted as the smallest fraction of units in A that would need to be re-classified in order to make the distribution equal to B. Agresti (2002, pp. 329-330) or 0.03, denotes that the data in A follow the distribution in B quite closely, even though it is not perfect

Evaluation: Micro Case, Similarity/diss. Between Distributions

1

1ˆ ˆ ˆ2

GA Bg g

g

P P

ˆ0 1

ˆ 0.02

1

1

ˆn

kkg n

kk

w I X gP

w

Page 14: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

The overlap between two distributions is:

, 1 means that the distributions are equalStrictly related to the dissimilarity index:

Using Agresti’s rule of thumb (2002, pp. 329-330) , denotes that the data in A follow the distribution in B quite closely, even though not perfectly

Evaluation: Micro Case, Similarity/diss. Between Distributions (cont.)

1

ˆ ˆmin ,G

A Bg g

g

OV P P

0 1OV

0.97OV

ˆ 1 OV

Page 15: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

A distance among the two distributions can be computed by means of the Hellinger distance:

Satisfies properties of a distance measure: symmetry, triangle inequality, and (0 means that the distributions are equal)B is the Bhattacharyya coefficient ( )

Rule of thumb: distributions are close (few literature ref.)

It is related to the dissimilarity index: Example: if then

Evaluation: Micro Case, Similarity/diss. Between Distributions (cont.)

1ˆ ˆ1 1

G A BH g ggd B P P

0 1Hd

0 1B

2 ˆ 2H Hd d

0.05Hd

0.02Hd ˆ0.0004 0.0283

Page 16: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

Similarity/dissimilarity measures can be used when dealing with categorical nominal or ordered variables.

When dealing with continuous variables they can not be applied unless the variables are categorized (into equal width bins or according to the percentiles of the reference variable; see for instance rules used to determine the number of classes when drawing histograms)

Evaluation: Micro Case, Similarity/diss. Between Distributions (cont.)

Page 17: Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat

Selected referencesAgresti, A (2002) Categorical Data Analysis, 2nd Edition. Wiley, Chichester.

D’Orazio, M (2011b) “Statistical Matching and Imputation of Survey Data with the Package StatMatch for the R Environment” R package vignette. ttp://www.cros-portal.eu/sites/default/files//Statistical_Matching_with_StatMatch.pdf

D’Orazio, M and Di Zio, M and Scanu, M (2006) Statistical Matching: Theory and Practice. Wiley, Chichester

Särndal, CE and Swensson, B and Wretman, J (1992) Model Assisted Survey Sampling. Springer-Verlag, New York.

Scanu, M (2008) “The practical aspects to be considered for statistical matching”, In Eurostat Report of WP2: Recommendations on the use of methodologies for the integration of surveys and administrative data, ESSnet Statistical Methodology Project on Integration of Survey and Administrative Data, pp. 34-35. http://cenex-isad.istat.it