Record linkage methods applied to population data deduplication
-
Upload
instituto-canario-de-estadistica-istac -
Category
Government & Nonprofit
-
view
359 -
download
4
description
Transcript of Record linkage methods applied to population data deduplication
![Page 1: Record linkage methods applied to population data deduplication](https://reader036.fdocuments.in/reader036/viewer/2022083002/558c813fd8b42adf268b4620/html5/thumbnails/1.jpg)
Universidadde La Laguna
Record linkagemethods applied to population datadeduplication
Perez-Gonzalez, Carlos J.1, Gonzalez Yanes, Jesus A.2 and Martın Morales, Noelia2
(1) Statistics and Operations Research Dept., University of La Laguna, Spain;Member of BIOSTATNET network (email: [email protected])
(2) Statistical Institute of Canary (ISTAC), Spain.
Introduction
Figure 1: Population census
Record linkage is the methodthat enable us to comparerecords from several files ordatabases (census population,administrative records, etc)and deciding whether they area match (i.e. if they representthe same entity) or they mustbe considered different.
When the process is applied toa single data set the problem isknown as deduplication. An
interesting application of the record linkage is the study of thedatabase of population records in the Canary Islands acrosssuccessive periods. In this case, the records usually containsduplicated data due to mistakes made in the process of regis-tering the habitants of the municipalities. The linkage methodsare applied to remove these duplicates of the census records.
Similarity scores: inexact matchingTo do the comparison of a pair of records, A and B, it is preciseto define a similarity score. To obtain this score it proceeds asfollows. Firstly, it is computed an individual similarity mea-sure, Simil(·, ·), for each pair of corresponding variables fromthe two records; then, the scores for each field are added up toget a total score for the pair of records:
Score(A,B) = Similsex(SexA, SexB) + ...+Similname(NameA, NameB)+Similsurn1(Surname1A, Surname1B)+Similsurn2(Surname1A, Surname2B).
The variables similarity can be defined as the exact match,but other way to extend the linkage method is to consider theinexact matching. In particular, to compare the similarity oftwo strings variables (name, surnames,...) it can be used edit-distance metrics, that calculate the number of operations totransform one string into the other.
JENNIFER
JENIFER YENI
5
4
1
Figure 2: Similarity scores of strings using an edit metric(Levenshtein distance)
Comparing pairs of recordsThe deduplication process requires that we compare everyrecord in the data set to every other record in the same dataset (this is called a ”cartesian join”). Because it is no necessaryto compare B to A once we have already compared A to B, thecomparison size is N(N − 1)/2, where N is the total numberof records. When N is large (typically in censal data sets),the computational problem is extremely intensive in terms ofmemory and time.
Data volume
CPU time Computer memory
Big Data
We can optimize the performance of the comparison processusing several techniques:
� Early thresholding: We rule out those pairs of recordsthat have an score lower that a prefixed threshold, whichimplies less disk space dropping the pairs that probablywould be considered non-matches.
� Indexing: Database indexes (Oracle, MySql, Post-greSQL) or hashing methods (SAS) provide a fastmethod to find observations.
� Blocking: In this case, the records to be compared willbe those that match on certain variables. For example,we compare each record only with those that have thesame values on given variables (e.g. variables concerningbirth date).
FundingThis research was supported by the ”Transnational Cooperation Pro-gramme Madeira-Azores-Canarias (MAC) 2007-2013” co-funded by Eu-ropean Regional Development Fund (ERDF).
Censal data of population registersThe following table represents an example of part of the censal information available for the Canary Islands population withsome duplicate records marked in color.
Table 1: Censal variables of population data recordsOBS NUM IDENT SEX BIRTH PLACE BIRTH YEAR BIRTH MONTH BIRTH DAY NAME SURNAME1 SURNAME2
1 43640367 1 35013 1959 06 04 JOSE DE LOS ANGELES LUJAN MARTIN2 00000000 6 35016 2000 05 03 CLAUDIA TABOADA TAMBE3 00000000 6 35026 1995 11 04 ARIADNA RAMIREZ QUINTERO4 00000000 6 35004 2000 05 03 CLAUDIA TABOADA LAMBE4 54163991 1 35026 2004 09 22 ADRIEL JOSE SANTANA SANCHEZ5 45776089 6 35016 1989 07 05 VANESA PEREZ RAMOS6 44316985 6 35016 1978 07 13 NAYRA RODRIGUEZ ALEMAN7 45368106 1 35016 2001 07 29 VICTOR BASTON QUINTANA8 00000000 6 35016 1989 07 05 VANESSA PEREZ RAMOS... ... ... ... ... ... ... ... ... ...
The example shows the variables that are used to perform the deduplication analysis. The variable IDENT refers to theidentification number of individuals and is usually unknown because this information was not available when the individual datawere originally collected. By this reason, the duplicates are practically determined using the rest of information.
Parallelizing the comparisonHadoop distributed file system (HDFS) is used to build up ascalable and parallel storage system with the massive censaldata. Then, we apply a programming model for distributedcomputation in cluster environments to solve the deduplicationproblem within a reasonable time interval.
Figure 3: Deduplication problem
We implement a map-reduce process where the mapper partdetermines blocking keys for records and outputs the elements{block key, record} whereas the reducer part compares recordsthat belongs to the same block.
Data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data
Compute Cluster
DFS block 1 DFS block 1
DFS block 1
DFS block 2
DFS block 2
DFS block 3
DFS block 3
Results data data data data data data data data data data data data data data data data data data data data
Map
Map
Map
Reduce
Figure 4: Map and reduce scheme
Combining this method with the techniques for the optimiza-tion performance (blocking, threshold, ...) it is possible toreduce the search space where the records are compared eachother.
References
[1] Fellegi, I.; Sunter, A. (1969). A Theory for Record Linkage. Journal of the American Statistical Association, 64 (328), pp. 1183-1210.
[2] Wright, G.; Hulett, D. (2010). Transitive Record Linkage in SAS using Hash Objects. Proceedings of Western Users of SAS Software 2010(http://www.wuss.org/proceedings10).
[3] Wright, G. (2011). Probabilistic Record Linkage in SAS. Proceedings of Western Users of SAS Software 2011 (http://www.wuss.org/proceedings11).
[4] Winkler, W.E. (1990). String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. Proceedings ofthe Section on Survey Research Methods (American Statistical Association), pp. 354-359.
Probability record linkageThe core of a record linkage process is the choice of a decisionmodel that define the rules used to determine wether a pairof records is classified as a match, a non-match or a possiblematch.
Search space reduction AxB
Decision model
Matches
Possible matches
Non matches
Figure 5: Record linkage decision model
Fellegi and Sunter (1969) formalized a probabilistic model forrecord linkage. Let be consider the cartesian join space, A×B,of two files to be matched (in deduplication process, A = B).The space is theoretically partitioned in two unknown sets: theset of true matches
M = {(a, b) : a = b such that a ∈ A, b ∈ B},
and the set of non-matches
U = {(a, b) : a 6= b such that a ∈ A, b ∈ B}.
Then, under this model, we assign a probabilistic (similarity)score to each pair of records, r = (a, b), as a monotonicallyincreasing function of the following ratio
R =P (γ ∈ Γ | (a, b) ∈M)
P (γ ∈ Γ | (a, b) ∈ U),
where γ is an arbitrary agreement pattern in a comparisonspace Γ. For example, an agreement pattern could be definedas the matching in birthdate fields (year, month and day) anda value lower than 0.75 in the similarity scores by edit-distancefor the name and both surnames fields.
Results and conclusions
Figure 6: Distribution of matches and non-matches
The results obtained in this work illustrate that space search re-duction improves significantly the performance of deterministicmodels (based on exact-inexact matching rules) and probabilisticmodels (according to Fellegi and Sunter approach). The resultsshown in Figure 6 and Table 2 enable us to evaluate the compa-rison process in the obtaining of record matches in the censal dataof population. Although the method performance to detect theduplicates depends on the kind of variables and their accuracy, wecan observe that the findings obtained can be enough satisfactory.
Table 2: Records matches with high scoreID PAIR SCORE OBS NUM IDENT SEX B PLACE B YEAR B MONTH B DAY NAME SURNAME1 SURNAME2
428 3.60 10693 00000000 1 35016 2003 5 26 KEVIN MACKENZIE PENATE
428 3.60 10546 00000000 1 35016 2003 5 26 KEVIN MAC KENZIE PENATE435 2.33 10758 00000000 1 35016 2004 8 3 OSCAR PENTON MACHADO435 2.33 66244 00000000 1 35016 2004 8 3 OSLEY OSCAR PETON MACHADO483 2.50 12121 00000000 6 35016 1994 1 25 BELEN LARDUETE GARCIA483 2.50 458026 44720754 6 35016 1994 1 25 CARIDAD BELEN LARDUET GARCIA384 2.00 881191 49407581 6 35019 2002 8 28 DUNIA ABASI AZZAOUI384 2.00 9550 00000000 6 35019 2002 8 28 DUNIA AL ABASI AL ABASI389 2.20 9773 00000000 1 35004 2001 7 29 OSCAR FALLE FINO389 2.20 10234 00000000 1 35004 2001 7 29 OSCAR FALCE PINO43 2.76 994 00000000 6 35016 1999 7 19 MIRIAM ASENSIO CHIMIELEWSKY43 2.76 321475 78824303 6 35016 1999 7 19 MIRIAM ASENCIO CHMIELEWSKI