When adjusting for bias due to linkage errors: a sensitivity analysis Q2014 Tiziana Tuoto
description
Transcript of When adjusting for bias due to linkage errors: a sensitivity analysis Q2014 Tiziana Tuoto
When adjusting for bias due to linkage errors: a sensitivity analysis
Q2014
Tiziana Tuoto
05/06/2014
Joint work with Loredana Di Consiglio
Outline of the talk
1. Motivations
2. Linkage errors and total survey error
3. Methodologies for analyses on linked data
4. A sensitivity analysis
5. Concluding remarks and future works
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Why linking and why linkage errors?
• Integration of different sources (surveys, administrative lists, registers )
has acquired a preeminent role
• The huge accomplished effort to link data is not the final aim of the
statistical process
• Whatever is the statistical analysis to perform on integrated data, when
dealing with data resulting from a record linkage process, it should be
taken into account that linkage is subject to two types of errors:
1. erroneous acceptance of false links
2.rejection of true matches (missed links)
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Linkage Errors and Total Survey Error
Biemer 2010
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Linkage Errors and Total Survey Error
Zhang 2012
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Methodologies for analyses on linked data
• 1965 : Neter, Maynes and Ramanathan
• 1993-1997 : Scheuren and Winkler
• 2000 : Lahiri and Larsen
• 2009 : Chambers Regression analysis of probability-linked data, Official
Statistics Research Series, Vol. 4.
• 2011 : Chipperfield, Bishop and Campbell
Chambers (2009) contains a systematic overview of regression analysis of
linked data, describes the approach developed by Neter et al., Scheuren et al,
Lahiri et al. and gives his own bias-corrected estimators of regression
parameters
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Methodologies for analyses on linked data
Those settings work under strong assumptions
• Exchangeability linkage errors model
• Equal size of linking sets (or smallest set contained in the biggest one)
• Linking in 1:1 constrain
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
A sensitivity analysis
Winkler (2014) notes
• «Scheuren and Winkler (1997) observed that, if linkage error is below 1%,
then can perform statistical analysis without adjustment.
• Most ‘good’ matching situations have overall linkage error above 10%.
• Even ‘high match scores’ sets of pairs may have linkage error in range 1-
5%.
• The current models may adjust the ‘observed’ matched pairs to having
linkage error down from 10% to 7.5%»
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Experimental data
Scenario Declared Matches False matches in Declared
Gold826
0 0.048 0 0
Silver 752 11 0.146 0.087 0.015
Bronze 786 30 0.129 0.236 0.038
Random Sample of 1000 units from the fictitious population census data in
the ESSnet DI (2011).
Linear model (as in Chambers, 2009): Y= X+
with X~[1,Uniform(0,1)] =[1,5]
~Norm(0,1)
Logistic model: X~Bernoulli(0.75)
Y~Multinom(0.7,0.05,0.2,0.05) dependent on X.
Two lists L1 and L2 were generated
L1 = [Xs, 942 units]
L2 = [Ys, 921 units]
Units in common (the true matches) 868; true un-matches are 127
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
The three Linkage scenarios
Probabilistic record linkage procedures (Fellegi and Sunter, 1969) with the
software RELAIS (2011).
• Gold scenario: Name, Surname, Complete date of birth
• Silver scenario: Name, Surname, Year of Birth
• Bronze scenario: Day of birth, Month of birth, Address.
Scenario Declared Matches
False matches in Declared
= prob. Missing true matches
= prob. False matches
= false matches rate
Gold826 0 0.048 0 0
Silver752 11 0.146 0.087 0.015
Bronze786 30 0.129 0.236 0.038
Table 1 – Results of linkage procedures for the three Scenarios
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Linear Model – Naive Estimator and Linkage error bias adjusted estimatorsLinkage scenario Estimator Beta Standard
Error
Population True Value 0.886 - 5.155 0.064 - 0.112
Perfect Linkage Naïve 0.907 - 5.093 0.069 - 0.121
Gold Linkage Naïve 0.927 - 5.085 0.071 - 0.123
Silver Linkage
Naïve 0.988 - 4.976 0.079 - 0.138
Ratio – ModOLS – Predictive 0.952 - 5.050 0.080 - 0.141
Eb_CUE 0.949 - 5.055 0.080 - 0.141
Bronze Linkage
Naïve 1.045 - 4.876 0.078 - 0.135
Ratio – ModOLS – Predictive 0.949 - 5.070 0.081 - 0.144
Eb_CUE 0.947 - 5.075 0.081 - 0.144
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Logistic Model – Naive and Adjusted estimators
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Linkage scenario Estimator Beta Standard Error
Population True Value -1.680 0.087
Perfect Linkage Naïve -1.744 0.096
Gold Linkage Naïve -1.762 0.100
Silver Linkage
Naïve -1.795 0.106
Est. Equ. ML -1.798 0.106
LL -1.803 0.107
Est. Equ. Ch. -1.817 0.107
Bronze Linkage
Naïve -1.734 0.101
Est. Equ. ML -1.741 0.102
LL -1.755 0.102
Est. Equ. Ch. -1.789 0.104
Remarks
• Relevance of the missing matches to completely remove linkage errors
effect on the estimate bias.
• The naïve estimators under perfect linkage and Gold scenario are still
biased due to missing true matches.
• Again, in the logistic regression, under the Bronze scenario the naïve
estimate is less biased because there the missed matches component is
lower than in the other scenarios.
• The correction for bias is effective in the linear case (achieving a bias
reduction of about 10% for the Silver scenario and higher in the Bronze
one) but more work is needed for the logistic case where the naïve
estimator performs slightly better.
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Future works
• Further works to investigate linkage errors effects on variability component.
• Further analyses to assess the trade-off in adjusting for bias with respect to
the expected increase of variance.
• More flexible framework, as in Chipperfield et al. (2011), where
exchangeability of linkage errors is not required and missed matches are
explicitly considered
• Finally, here the probability of being correctly linked and the probability of
erroneous missed matches are assumed to be known, whereas the linkage
errors evaluation is not a straightforward task
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Bibliography
Biemer (2010) Total Survey Error Design, Implementation, And Evaluation Public Opinion Quarterly, Vol. 74, No. 5, 2010
Chambers R. (2009) Regression analysis of probability-linked data, Official Statistics Research Series, Vol. 4.
Chipperfield, J. O., Bishop, G. R . and Campbell P. (2011). Maximum likelihood estimation for contingency tables and logistic regression with incorrectly linked data, Survey Methodology, Vol. 37, No. 1
Fellegi I.P., Sunter A.B. (1969) “A Theory for record linkage”, Journal of the American Statistical Association, 64, 1183-1210.
Lahiri, P., and Larsen, M.D. (2000). Model based analysis of records linked using mixture models. Proc. Of the section on survey research methods, ASA, 11-19
Lahiri, P., and Larsen, M.D. (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100, 222-230.
McLeod, Heasman and Forbes, (2011) Simulated data for the on the job training, Essnet DI http://www.cros-portal.eu/content/job-training
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014
Bibliography
Neter, J., Maynes, S., Ramanathan, R. (1965): The effect of mismatching on the measurement of response errors, JASA
RELAIS, (2011). User’s guide version 2.2, available at http://joinup.ec.europa.eu/software/relais/release/22
Scheuren, F., Winkler, W.E. (1993): Regression analysis of data files that are computer matched, Survey Methodology, 39-58
Scheuren, F., Winkler, W.E. (1997): Regression analysis of data files that are computer matched part II, Survey Methodology, 157-165.
Winkler, W.E. (2014), Quality and Analysis of National Files - Computational Methods for Censuses and Surveys, Presentation, January 9, 2014
Zhang, L.-C. (2012), Topics of statistical theory for register-based statistics and data integration. Statistica Neerlandica, 66
Adjusting for bias due to linkage errors, Tiziana Tuoto – Vienna, June 5° 2014