Record linkage methods applied to population data deduplication

1
Universidad de La Laguna Record linkage methods applied to population data deduplication erez-Gonz´ alez, Carlos J. 1 , Gonz´ alez Yanes, Jes´ us A. 2 and Mart´ ın Morales, Noelia 2 (1) Statistics and Operations Research Dept., University of La Laguna, Spain; Member of BIOSTATNET network (email: [email protected]) (2) Statistical Institute of Canary (ISTAC), Spain. Introduction Figure 1: Population census Record linkage is the method that enable us to compare records from several files or databases (census population, administrative records, etc) and deciding whether they are a match (i.e. if they represent the same entity) or they must be considered different. When the process is applied to a single data set the problem is known as deduplication. An interesting application of the record linkage is the study of the database of population records in the Canary Islands across successive periods. In this case, the records usually contains duplicated data due to mistakes made in the process of regis- tering the habitants of the municipalities. The linkage methods are applied to remove these duplicates of the census records. Similarity scores: inexact matching To do the comparison of a pair of records, A and B , it is precise to define a similarity score. To obtain this score it proceeds as follows. Firstly, it is computed an individual similarity mea- sure, Simil (·, ·), for each pair of corresponding variables from the two records; then, the scores for each field are added up to get a total score for the pair of records: Score(A, B )= Simil sex (Sex A ,Sex B )+ ...+ Simil name (N ame A , N ame B )+ Simil surn1 (Surname1 A , Surname1 B )+ Simil surn2 (Surname1 A , Surname2 B ). The variables similarity can be defined as the exact match, but other way to extend the linkage method is to consider the inexact matching. In particular, to compare the similarity of two strings variables (name, surnames,...) it can be used edit- distance metrics, that calculate the number of operations to transform one string into the other. JENNIFER JENIFER YENI 5 4 1 Figure 2: Similarity scores of strings using an edit metric (Levenshtein distance) Comparing pairs of records The deduplication process requires that we compare every record in the data set to every other record in the same data set (this is called a ”cartesian join”). Because it is no necessary to compare B to A once we have already compared A to B , the comparison size is N (N - 1)/2, where N is the total number of records. When N is large (typically in censal data sets), the computational problem is extremely intensive in terms of memory and time. Data volume CPU time Computer memory Big Data We can optimize the performance of the comparison process using several techniques: Early thresholding: We rule out those pairs of records that have an score lower that a prefixed threshold, which implies less disk space dropping the pairs that probably would be considered non-matches. Indexing: Database indexes (Oracle, MySql, Post- greSQL) or hashing methods (SAS) provide a fast method to find observations. Blocking: In this case, the records to be compared will be those that match on certain variables. For example, we compare each record only with those that have the same values on given variables (e.g. variables concerning birth date). Funding This research was supported by the ”Transnational Cooperation Pro- gramme Madeira-Azores-Canarias (MAC) 2007-2013” co-funded by Eu- ropean Regional Development Fund (ERDF). Censal data of population registers The following table represents an example of part of the censal information available for the Canary Islands population with some duplicate records marked in color. Table 1: Censal variables of population data records OBS NUM IDENT SEX BIRTH PLACE BIRTH YEAR BIRTH MONTH BIRTH DAY NAME SURNAME1 SURNAME2 1 43640367 1 35013 1959 06 04 JOSE DE LOS ANGELES LUJAN MARTIN 2 00000000 6 35016 2000 05 03 CLAUDIA TABOADA TAMBE 3 00000000 6 35026 1995 11 04 ARIADNA RAMIREZ QUINTERO 4 00000000 6 35004 2000 05 03 CLAUDIA TABOADA LAMBE 4 54163991 1 35026 2004 09 22 ADRIEL JOSE SANTANA SANCHEZ 5 45776089 6 35016 1989 07 05 VANESA PEREZ RAMOS 6 44316985 6 35016 1978 07 13 NAYRA RODRIGUEZ ALEMAN 7 45368106 1 35016 2001 07 29 VICTOR BASTON QUINTANA 8 00000000 6 35016 1989 07 05 VANESSA PEREZ RAMOS ... ... ... ... ... ... ... ... ... ... The example shows the variables that are used to perform the deduplication analysis. The variable IDENT refers to the identification number of individuals and is usually unknown because this information was not available when the individual data were originally collected. By this reason, the duplicates are practically determined using the rest of information. Parallelizing the comparison Hadoop distributed file system (HDFS) is used to build up a scalable and parallel storage system with the massive censal data. Then, we apply a programming model for distributed computation in cluster environments to solve the deduplication problem within a reasonable time interval. Figure 3: Deduplication problem We implement a map-reduce process where the mapper part determines blocking keys for records and outputs the elements {block key, record} whereas the reducer part compares records that belongs to the same block. Data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data Compute Cluster DFS block 1 DFS block 1 DFS block 1 DFS block 2 DFS block 2 DFS block 3 DFS block 3 Results data data data data data data data data data data data data data data data data data data data data Map Map Map Reduce Figure 4: Map and reduce scheme Combining this method with the techniques for the optimiza- tion performance (blocking, threshold, ...) it is possible to reduce the search space where the records are compared each other. References [1] Fellegi, I.; Sunter, A. (1969). A Theory for Record Linkage. Journal of the American Statistical Association, 64 (328), pp. 1183-1210. [2] Wright, G.; Hulett, D. (2010). Transitive Record Linkage in SAS using Hash Objects. Proceedings of Western Users of SAS Software 2010 (http://www.wuss.org/proceedings10). [3] Wright, G. (2011). Probabilistic Record Linkage in SAS. Proceedings of Western Users of SAS Software 2011 (http://www.wuss.org/proceedings11). [4] Winkler, W.E. (1990). String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research Methods (American Statistical Association), pp. 354-359. Probability record linkage The core of a record linkage process is the choice of a decision model that define the rules used to determine wether a pair of records is classified as a match, a non-match or a possible match. Search space reduction AxB Decision model Matches Possible matches Non matches Figure 5: Record linkage decision model Fellegi and Sunter (1969) formalized a probabilistic model for record linkage. Let be consider the cartesian join space, A × B , of two files to be matched (in deduplication process, A = B ). The space is theoretically partitioned in two unknown sets: the set of true matches M = {(a, b): a = b such that a A, b B }, and the set of non-matches U = {(a, b): a 6= b such that a A, b B }. Then, under this model, we assign a probabilistic (similarity) score to each pair of records, r =(a, b), as a monotonically increasing function of the following ratio R = P (γ Γ | (a, b) M ) P (γ Γ | (a, b) U ) , where γ is an arbitrary agreement pattern in a comparison space Γ. For example, an agreement pattern could be defined as the matching in birthdate fields (year, month and day) and a value lower than 0.75 in the similarity scores by edit-distance for the name and both surnames fields. Results and conclusions Figure 6: Distribution of matches and non-matches The results obtained in this work illustrate that space search re- duction improves significantly the performance of deterministic models (based on exact-inexact matching rules) and probabilistic models (according to Fellegi and Sunter approach). The results shown in Figure 6 and Table 2 enable us to evaluate the compa- rison process in the obtaining of record matches in the censal data of population. Although the method performance to detect the duplicates depends on the kind of variables and their accuracy, we can observe that the findings obtained can be enough satisfactory. Table 2: Records matches with high score ID PAIR SCORE OBS NUM IDENT SEX B PLACE B YEAR B MONTH B DAY NAME SURNAME1 SURNAME2 428 3.60 10693 00000000 1 35016 2003 5 26 KEVIN MACKENZIE PE ˜ NATE 428 3.60 10546 00000000 1 35016 2003 5 26 KEVIN MAC KENZIE PE ˜ NATE 435 2.33 10758 00000000 1 35016 2004 8 3 OSCAR PENTON MACHADO 435 2.33 66244 00000000 1 35016 2004 8 3 OSLEY OSCAR PETON MACHADO 483 2.50 12121 00000000 6 35016 1994 1 25 BELEN LARDUETE GARCIA 483 2.50 458026 44720754 6 35016 1994 1 25 CARIDAD BELEN LARDUET GARCIA 384 2.00 881191 49407581 6 35019 2002 8 28 DUNIA ABASI AZZAOUI 384 2.00 9550 00000000 6 35019 2002 8 28 DUNIA AL ABASI AL ABASI 389 2.20 9773 00000000 1 35004 2001 7 29 OSCAR FALLE FINO 389 2.20 10234 00000000 1 35004 2001 7 29 OSCAR FALCE PINO 43 2.76 994 00000000 6 35016 1999 7 19 MIRIAM ASENSIO CHIMIELEWSKY 43 2.76 321475 78824303 6 35016 1999 7 19 MIRIAM ASENCIO CHMIELEWSKI

description

Poster presentado en el III Congreso - Escuela de Verano de Jóvenes Investigadores en Diseño de Experimentos y Bioestadística. Pamplona del 21 al 22 de julio de 2014.

Transcript of Record linkage methods applied to population data deduplication

Page 1: Record linkage methods applied to population data deduplication

Universidadde La Laguna

Record linkagemethods applied to population datadeduplication

Perez-Gonzalez, Carlos J.1, Gonzalez Yanes, Jesus A.2 and Martın Morales, Noelia2

(1) Statistics and Operations Research Dept., University of La Laguna, Spain;Member of BIOSTATNET network (email: [email protected])

(2) Statistical Institute of Canary (ISTAC), Spain.

Introduction

Figure 1: Population census

Record linkage is the methodthat enable us to comparerecords from several files ordatabases (census population,administrative records, etc)and deciding whether they area match (i.e. if they representthe same entity) or they mustbe considered different.

When the process is applied toa single data set the problem isknown as deduplication. An

interesting application of the record linkage is the study of thedatabase of population records in the Canary Islands acrosssuccessive periods. In this case, the records usually containsduplicated data due to mistakes made in the process of regis-tering the habitants of the municipalities. The linkage methodsare applied to remove these duplicates of the census records.

Similarity scores: inexact matchingTo do the comparison of a pair of records, A and B, it is preciseto define a similarity score. To obtain this score it proceeds asfollows. Firstly, it is computed an individual similarity mea-sure, Simil(·, ·), for each pair of corresponding variables fromthe two records; then, the scores for each field are added up toget a total score for the pair of records:

Score(A,B) = Similsex(SexA, SexB) + ...+Similname(NameA, NameB)+Similsurn1(Surname1A, Surname1B)+Similsurn2(Surname1A, Surname2B).

The variables similarity can be defined as the exact match,but other way to extend the linkage method is to consider theinexact matching. In particular, to compare the similarity oftwo strings variables (name, surnames,...) it can be used edit-distance metrics, that calculate the number of operations totransform one string into the other.

JENNIFER

JENIFER YENI

5

4

1

Figure 2: Similarity scores of strings using an edit metric(Levenshtein distance)

Comparing pairs of recordsThe deduplication process requires that we compare everyrecord in the data set to every other record in the same dataset (this is called a ”cartesian join”). Because it is no necessaryto compare B to A once we have already compared A to B, thecomparison size is N(N − 1)/2, where N is the total numberof records. When N is large (typically in censal data sets),the computational problem is extremely intensive in terms ofmemory and time.

Data volume

CPU time Computer memory

Big Data

We can optimize the performance of the comparison processusing several techniques:

� Early thresholding: We rule out those pairs of recordsthat have an score lower that a prefixed threshold, whichimplies less disk space dropping the pairs that probablywould be considered non-matches.

� Indexing: Database indexes (Oracle, MySql, Post-greSQL) or hashing methods (SAS) provide a fastmethod to find observations.

� Blocking: In this case, the records to be compared willbe those that match on certain variables. For example,we compare each record only with those that have thesame values on given variables (e.g. variables concerningbirth date).

FundingThis research was supported by the ”Transnational Cooperation Pro-gramme Madeira-Azores-Canarias (MAC) 2007-2013” co-funded by Eu-ropean Regional Development Fund (ERDF).

Censal data of population registersThe following table represents an example of part of the censal information available for the Canary Islands population withsome duplicate records marked in color.

Table 1: Censal variables of population data recordsOBS NUM IDENT SEX BIRTH PLACE BIRTH YEAR BIRTH MONTH BIRTH DAY NAME SURNAME1 SURNAME2

1 43640367 1 35013 1959 06 04 JOSE DE LOS ANGELES LUJAN MARTIN2 00000000 6 35016 2000 05 03 CLAUDIA TABOADA TAMBE3 00000000 6 35026 1995 11 04 ARIADNA RAMIREZ QUINTERO4 00000000 6 35004 2000 05 03 CLAUDIA TABOADA LAMBE4 54163991 1 35026 2004 09 22 ADRIEL JOSE SANTANA SANCHEZ5 45776089 6 35016 1989 07 05 VANESA PEREZ RAMOS6 44316985 6 35016 1978 07 13 NAYRA RODRIGUEZ ALEMAN7 45368106 1 35016 2001 07 29 VICTOR BASTON QUINTANA8 00000000 6 35016 1989 07 05 VANESSA PEREZ RAMOS... ... ... ... ... ... ... ... ... ...

The example shows the variables that are used to perform the deduplication analysis. The variable IDENT refers to theidentification number of individuals and is usually unknown because this information was not available when the individual datawere originally collected. By this reason, the duplicates are practically determined using the rest of information.

Parallelizing the comparisonHadoop distributed file system (HDFS) is used to build up ascalable and parallel storage system with the massive censaldata. Then, we apply a programming model for distributedcomputation in cluster environments to solve the deduplicationproblem within a reasonable time interval.

Figure 3: Deduplication problem

We implement a map-reduce process where the mapper partdetermines blocking keys for records and outputs the elements{block key, record} whereas the reducer part compares recordsthat belongs to the same block.

Data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data data

Compute Cluster

DFS block 1 DFS block 1

DFS block 1

DFS block 2

DFS block 2

DFS block 3

DFS block 3

Results data data data data data data data data data data data data data data data data data data data data

Map

Map

Map

Reduce

Figure 4: Map and reduce scheme

Combining this method with the techniques for the optimiza-tion performance (blocking, threshold, ...) it is possible toreduce the search space where the records are compared eachother.

References

[1] Fellegi, I.; Sunter, A. (1969). A Theory for Record Linkage. Journal of the American Statistical Association, 64 (328), pp. 1183-1210.

[2] Wright, G.; Hulett, D. (2010). Transitive Record Linkage in SAS using Hash Objects. Proceedings of Western Users of SAS Software 2010(http://www.wuss.org/proceedings10).

[3] Wright, G. (2011). Probabilistic Record Linkage in SAS. Proceedings of Western Users of SAS Software 2011 (http://www.wuss.org/proceedings11).

[4] Winkler, W.E. (1990). String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. Proceedings ofthe Section on Survey Research Methods (American Statistical Association), pp. 354-359.

Probability record linkageThe core of a record linkage process is the choice of a decisionmodel that define the rules used to determine wether a pairof records is classified as a match, a non-match or a possiblematch.

Search space reduction AxB

Decision model

Matches

Possible matches

Non matches

Figure 5: Record linkage decision model

Fellegi and Sunter (1969) formalized a probabilistic model forrecord linkage. Let be consider the cartesian join space, A×B,of two files to be matched (in deduplication process, A = B).The space is theoretically partitioned in two unknown sets: theset of true matches

M = {(a, b) : a = b such that a ∈ A, b ∈ B},

and the set of non-matches

U = {(a, b) : a 6= b such that a ∈ A, b ∈ B}.

Then, under this model, we assign a probabilistic (similarity)score to each pair of records, r = (a, b), as a monotonicallyincreasing function of the following ratio

R =P (γ ∈ Γ | (a, b) ∈M)

P (γ ∈ Γ | (a, b) ∈ U),

where γ is an arbitrary agreement pattern in a comparisonspace Γ. For example, an agreement pattern could be definedas the matching in birthdate fields (year, month and day) anda value lower than 0.75 in the similarity scores by edit-distancefor the name and both surnames fields.

Results and conclusions

Figure 6: Distribution of matches and non-matches

The results obtained in this work illustrate that space search re-duction improves significantly the performance of deterministicmodels (based on exact-inexact matching rules) and probabilisticmodels (according to Fellegi and Sunter approach). The resultsshown in Figure 6 and Table 2 enable us to evaluate the compa-rison process in the obtaining of record matches in the censal dataof population. Although the method performance to detect theduplicates depends on the kind of variables and their accuracy, wecan observe that the findings obtained can be enough satisfactory.

Table 2: Records matches with high scoreID PAIR SCORE OBS NUM IDENT SEX B PLACE B YEAR B MONTH B DAY NAME SURNAME1 SURNAME2

428 3.60 10693 00000000 1 35016 2003 5 26 KEVIN MACKENZIE PENATE

428 3.60 10546 00000000 1 35016 2003 5 26 KEVIN MAC KENZIE PENATE435 2.33 10758 00000000 1 35016 2004 8 3 OSCAR PENTON MACHADO435 2.33 66244 00000000 1 35016 2004 8 3 OSLEY OSCAR PETON MACHADO483 2.50 12121 00000000 6 35016 1994 1 25 BELEN LARDUETE GARCIA483 2.50 458026 44720754 6 35016 1994 1 25 CARIDAD BELEN LARDUET GARCIA384 2.00 881191 49407581 6 35019 2002 8 28 DUNIA ABASI AZZAOUI384 2.00 9550 00000000 6 35019 2002 8 28 DUNIA AL ABASI AL ABASI389 2.20 9773 00000000 1 35004 2001 7 29 OSCAR FALLE FINO389 2.20 10234 00000000 1 35004 2001 7 29 OSCAR FALCE PINO43 2.76 994 00000000 6 35016 1999 7 19 MIRIAM ASENSIO CHIMIELEWSKY43 2.76 321475 78824303 6 35016 1999 7 19 MIRIAM ASENCIO CHMIELEWSKI