Record Linkage As a Statistical Procedure

26
Record Linkage As a Statistical Procedure: Some History, Formal Frameworks, Applications, and Challenges Stephen E. Fienberg Department of Statistics, Machine Learning Department, Living Analytics Research Centre Carnegie Mellon University SAMSI Program on Statistical and Computational Methods in the Social Sciences August 18, 2013 1 / 26

Transcript of Record Linkage As a Statistical Procedure

Page 1: Record Linkage As a Statistical Procedure

Record Linkage As a Statistical Procedure:Some History, Formal Frameworks,

Applications, and Challenges

Stephen E. FienbergDepartment of Statistics, Machine Learning Department,

Living Analytics Research CentreCarnegie Mellon University

SAMSI Program on Statistical and Computational Methods in theSocial Sciences

August 18, 2013

1 / 26

Page 2: Record Linkage As a Statistical Procedure

Introduction

Why Record Linkage?Where it BeganThe Fellegi-Sunter FrameworkRecord Linkage as Missing DataSome Other ApproachesChallenges

2 / 26

Page 3: Record Linkage As a Statistical Procedure

Record Linkage By Other Names

Some alternative names for record linkage:MatchingDe-duping (duplicate detection)Merge-purgeDatabase hardeningIdentity uncertaintyCo-reference resolution

Used by statistical agencies and data warehouses, and indatabase management, digital libraries, fraud detection,law enforcement, natural language processing, anddatamining.Privacy-preserving dataminingPrimary mechanism used as part of database intruderattackExact vs. statistical matching

3 / 26

Page 4: Record Linkage As a Statistical Procedure

De-Duping Citations From Google Scholar

Are the following Google Scholar listings referring to the samebook?

4 / 26

Page 5: Record Linkage As a Statistical Procedure

De-Duping Citations From Google Scholar (cont.)

Are the following Google Scholar listings referring to the samebook?

5 / 26

Page 6: Record Linkage As a Statistical Procedure

Exact vs. Statistical Matching

Exact matching: Link (X ,Z ) with (Y ,Z ):Updates to Social Security Administration Master EarningsFile (MEF) and Numident file.Electronic medical records.

Statistical matching: Link (X ,Z ) with (Y ,Z ′) where Z ′ is anoisy version of Z or vice versa:

Duplicate/misspelled names:

Misspellings: Steve, Steven and Stephen; Fienberg,Feinberg, Fineberg, Fienburg, Feinburg, Steinberg, etc.Basically noisy matching data.

6 / 26

Page 7: Record Linkage As a Statistical Procedure

Where It Began: Foundational Work

Ideas surfaced in multiple contexts with the rise ofcomputational infrastructure in the 1950s:

Post-WWII welfare state and taxation system led to newadministrative record systemsNew computer technology

Three key papers:1 H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. James

(1959). “Automatic Linkage of Vital Records,” Science, 130(3381), 954–959.

2 B.J. Tepping (1968). “A Model for Optimum Linkage of Records,”J. Amer. Statist. Assoc., 63 (324), 1321–1332.

3 I.P. Fellegi and A.B. Sunter (1969). “A Theory for Record Linkage,”J. Amer. Statist. Assoc., 64 (328), 1183–1210.

Public response: threat to individual privacyR. Kraus (2013). “Statistical Deja Vu: The National Data CenterProposal of 1965 and Its Descendants,” J. Privacy andConfidentiality, Vol. 5, No. 1.

7 / 26

Page 8: Record Linkage As a Statistical Procedure

The Fellegi-Sunter Framework

Represent every pair of records using vector of featuresthat describe similarity between individual record fields.Place feature vectors for record pairs into three classes:matches (M), nonmatches (U), and possible matches.

Let P(γ|M)and P(γ|U) are probabilities of observing thatfeature vector for a matched and nonmatched pair,respectively.Perform record-pair classification by calculating the ratio(P(γ|M))/(P(γ|U)) for each candidate record pair, where γis a feature vector for pair.Establish two thresholds based on desired error levels—Tµ

and Tλ—to optimally separate the ratio values forequivalent, possibly equivalent, and nonequivalent recordpairs.

Because most record pairs are clearly nonmatches,blocking databases so that only records in blocks arecompared significantly improves efficiency.1-1 linkage assumption often drives accuracy.

8 / 26

Page 9: Record Linkage As a Statistical Procedure

The Fellegi-Sunter Framework: II

L. Gu and R. Baxter and D. Vickers and C. Rainsford 2003 Record Linkage:Current Practice and Future Directions. CSIRO

9 / 26

Page 10: Record Linkage As a Statistical Procedure

The Fellegi-Sunter Framework: III

Possible matches often go to clerical review in statisticalagency context.Can do this parametrically using logistic regression orsome other GLM, or non-parametrically.Supervised learning (with training data) vs.non-supervised learning

J.B. Copas and F.J. Hilton (1990). “Record Linkage: StatisticalModels for Matching Computer Records,” J. Roy. Statist. Soc. (A),153, 287–320.S. Ventura, R. Nugent, and E. Fuchs (2013). Methods Matter:Rethinking Inventor Disambiguation Algorithms with ClassificationModels and Labeled Inventor Records.

Use string metrics and edit-distances for names andstrings of numbers.

M. Bilenko, R. Mooney, W.W. Cohen, P. Ravikumar, and S.E.Fienberg (2003) “Adaptive Name-Matching in InformationIntegration,” IEEE Intelligent Systems 18 (5), 16–23.

10 / 26

Page 11: Record Linkage As a Statistical Procedure

String-Comparator Illustration (Source: Winkler)

11 / 26

Page 12: Record Linkage As a Statistical Procedure

Fellegi-Sunter As a Missing Data Problem

True match status unknown, and so impute.Latent variable representation.Bayesian and non-Bayesian formulations:

D.B. Rubin (1986). “Statistical matching using file concatenationwith adjusted weights and multiple imputations.” J. Bus. & Econ.Statist., 4 (1), 87–94.W.E. Winkler (1988). “Using the EM Algorithm for WeightComputation in the Fellegi-Sunter Model of Record Linkage,” Proc.Section on Survey Research Methods, Amer. Statist. Assoc.,667–671.T.N. Herzog, F.J. Scheuren, and W.E. Winkler (2007). Data Qualityand Record Linkage. New York: Springer.

Much subsequent work by Belin, Larsen, Zaslavsky andothers.

12 / 26

Page 13: Record Linkage As a Statistical Procedure

Record Linkage With Multiple Lists

Traditional approach used pairwise list record linkage.Problem: Non-transitive links: A (list 1) links to B (list 2)and B (list 2) links to (C (list 3), BUT A does not link to C!Lack of transitivity issue becomes greater as number oflists, K grows.Possible transitive linkage patterns with three lists:

Generalized version of FS with logistic structure and EMgives principled transitive solution.

M. Sadinle and S.E. Fienberg (2013). “A GeneralizedFellegi-Sunter Framework for Multiple Record Linkage WithApplication to Homicide Record Systems.” J. Amer. Statist.Assoc., 108 (502), 385–397.Integration of 3 datafiles: 67 homicides recorded by ColumbiaCensus Bureau, 62 by National Police, and 33 Forensics Institute.

13 / 26

Page 14: Record Linkage As a Statistical Procedure

Post-Enumeration Surveys and Undercount Estimation

Census subject to enumeration errors (EE) and omissions.Post-Emumeration Survey takes sample of blocksnationwide and essentially redoes census in them.For each block we can think about linking Census andPES: First correct for EE , then focus on those with“sufficient information for matching” and use dual systemsestimator:

DSE = DD × CR × 1MR

(1)

where DD is no. people with sufficient information,CR = CE

CE+EE , and MR is the match rate.Undercount debate is over DSE assumptions, and how tostratify, aggregate, smooth, and project to rest of the blocksin nation.

14 / 26

Page 15: Record Linkage As a Statistical Procedure

Post-Enumeration Survey Matching Using Big Match

15 / 26

Page 16: Record Linkage As a Statistical Procedure

Post-Enumeration Surveys and Undercount EstimationIII

But what do we do about the errors associated with recordlinkage?Need to propagate these into the estimation process.In context of census-adjustment methododology, this wasthe thrust of the Bayesian multiple-imputation methods of

T.R. Belin and D.B.Rubin (1995). “A Method for CalibratingFalse-Match Rates in Record Linkage,” J. Amer. Statist. Assoc.,90, 694–707.M.D. Larsen and D.B. Rubin (2001). “Iterative Automated RecordLinkage Using Mixture Models,” J. Amer. Statist. Assoc., 96,32–41.

16 / 26

Page 17: Record Linkage As a Statistical Procedure

Major Record Linkage Projects at U.S. CensusBureau: 2000 Census

StARS/AREX program with 6 record systems in two areasof the country. Top-down matching versus bottom-up(linking each to MAF).

B.V. Bye and D.H. Judson (2004). “Results From theAdministrative Records Experiment in 2000.” Census 2000Testing, Experimentation, and Evaluation Program SynthesisReport No. 16, TR-16. U. S. Census Bureau, Washington, DC.

17 / 26

Page 18: Record Linkage As a Statistical Procedure

Major Record Linkage Projects at U.S. CensusBureau: 2010 Census

2012 study evaluating data sources used in StARS,additional federal data sources, and commercial data.Matches addresses and persons to 2010 Census toevaluate the quality and coverage of administrative data.

131.7 million addresses in tCensus and 151.3 millionaddresses in administrative records: 92.6% matched; 29.3million administrative records addresses not in Census; and9.7 million Census addresses not in administrative records.Person matching more complex. Also person-addresspairs.S. Rastogi, A. O’Hara, and others (2012). 2010 “Census MatchStudy.” 2010 Census Program for Evaluations and Experiments.U. S. Census Bureau, Washington, DC.

18 / 26

Page 19: Record Linkage As a Statistical Procedure

Record Linkage and Multiple Systems Estimation

Three critical components:1 “putting the lists together,” or record linkage.

From that component one can estimate the number ofdistinct individuals observed in all of the lists combined. Ifrecord linkage were perfect this would not be an estimate, itwould be a list of all of the observed casualties.But since there is error in matching no matter how welldone statistically, there could be over- or under-estimation.

2 Multiple systems estimation where one generalizes fromthe observed individuals to those missed on all of the lists.

Many different kinds of models available for the task:log-linear, Rasch, mixed-membership or in combination.Also covariates.Need to consider how to incorporate model uncertainty intosome form of overall population estimates.

3 Need to propagate uncertainty from the record linkage asan added component of uncertainty in multiple-systemsestimation.

19 / 26

Page 20: Record Linkage As a Statistical Procedure

Privacy-Preserving Record Linkage and Datamining I

20 / 26

Page 21: Record Linkage As a Statistical Procedure

Privacy-Preserving Record Linkage and Datamining II

21 / 26

Page 22: Record Linkage As a Statistical Procedure

Privacy-Preserving Record Linkage and Datamining III

For an example of PPDM, see:Y. Nardi, S.E. Fienberg, and R.J. Hall (2012). “AchievingBoth Valid and Secure Logistic Regression Analysis onAggregated Data from Different Private Sources,” Journal ofPrivacy and Confidentiality, Vol. 4: Iss. 1, Article 9.Available at: http://repository.cmu.edu/jpc/vol4/iss1/9

But we assume that records can be linked by an “honestthird party.”Need to put ideas together.

22 / 26

Page 23: Record Linkage As a Statistical Procedure

Record Linkage and Graphs

1 Non-parametric approach using bi-partite graph structure.Propagates error into a confidence interval for statisticalcomputation, e.g., regression, or MSE.R. Hall and S.E. Fienberg (2012). “Valid statistical inference onautomatically matched files,” In Privacy in Statistical Databases,PSD 2010 (J. Domingo-Ferrer and I. Tinnirello, eds.), LectureNotes in Computer Science Vol. 7556, Berlin: Springer, 131–142.

2 Network block-models: Idea of a partition of aconcatenated set of records.

M. Sadinle, in preparationR. Steorts, R. Hall, and S.E. Fienberg, in preparation

23 / 26

Page 24: Record Linkage As a Statistical Procedure

Some Challenges I

Developing RL methods that scale:No. of comparisons explodes with increasing K :n1 × n2 × . . .× nK .Innovative uses of blocking and approximate stratification.

Propagating linkage error into uncertainty for subsequentmodeling.String-distance metrics in non-English languages.

Arabic names and places in HRDAG Syria killings data.Record linkage on the fly:

For assigning individual online census forms to correctaddress and households.Deal with duplicate separately or in real time.

24 / 26

Page 25: Record Linkage As a Statistical Procedure

Some Challenges II

RL using objects in lieu of data elements, e.g., pictures.Drivers licenses and other ID have photos.Facebook images:

PittPat for matching faces, used in Acquisti privacy study.

Using advances in RL to support new approaches toprivacy protection as alternative to current methods, e.g.,differential privacy.

25 / 26

Page 26: Record Linkage As a Statistical Procedure

Some Related CMU Research at Poster Session

Mauricio Sadinle: “Duplicate Record Detection Using aBayesian Partitioning Model”Rebecca Steorts: “Will the Real Steve Fienberg PleaseStand Up? Estimating a Population from Incomplete DataFiles”Sam Ventura: “Identifying Duplicated Text Records inLarge-Scale Databases with Classifier Aggregation andBlocked Hierarchical Clustering”

26 / 26