Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate...

Post on 22-Aug-2020

2 views 0 download

Transcript of Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate...

Record Linkage

Everything DataCompSci 216 Spring 2019

Announcements Thu, Jan 31

• HW03 will be posted later tonight.

• We will begin by discussing HW solutions

• If you have “multiple” submissions, the latest submission will be graded.

2

Recap: Querying Relational Databases in SQL

SELECT columns or expressions

FROM tables

WHERE conditionsGROUP BY columns

ORDER BY output columns;

3

1. Generate all combinations of rows, one from each table;each combination forms a “wide row”

2. Filter—keep only “wide rows” satisfying conditions

3. Group—”wide rows” with matching values for columns go into the same group

5. Compute one output row for each “wide row”

(or for each group of them if query has grouping/aggregation)

× × =!

4. Sort the output rows

Problem

• Forbes magazine article: “Wall Street’s favorite senators”

4

Problem

• Forbes magazine article: “Wall Street’s favorite senators”

• What are their ages?

5

Solution

• Join with the persons table (from govtrack)

• But there is no key to join on …

6

Record Linkage

• Problem of finding duplicate entities across different sources (or even within a single dataset).

7

Ironically, Record Linkage has many names

Doubles

Duplicate detectionEntity Resolution

Deduplication

Object identificationObject consolidation

Coreference resolution

Entity clustering

Reference reconciliation

Reference matchingHouseholding

Household matching

Fuzzy match

Approximate match

Merge/purge

Hardening soft databases

Identity uncertainty

8

Motivating Example 1: Web9

Motivating Example 1: Web

Motivating Example 2: Network Science• Measuring the topology of the internet …

using traceroute

IP Aliasing Problem [Willinger et al. 2009]

IP Aliasing Problem [Willinger et al. 2009]

And many many more examples

• Linking Census Records• Public Health• Medical records• Web search – query disambiguation• Comparison shopping• Maintaining customer databases• Law enforcement and Counter-terrorism• Scientific data• Genealogical data• Bibliographic data

14

Opportunityhttp://lod-cloud.net/

Back to our example

• Join with the persons table (from govtrack)

• But there is no key to join on …

• What about (firstname, lastname)?

16

Attempt 1:

SELECT w.*, date_part('year', current_date) -date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE w.first_name = p.first_name

and w.last_name = p.last_name;

17

Problems

• Join condition is too specific– Nicknames used instead of real first names

18

Attempt 2:

• Join on Last name + Age < 100 (senator must be alive)

SELECT w.*, date_part('year', current_date) -date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE w.lastname = p.last_name and date_part('year', current_date) - date_part('year', p.birthday) < 100;

19

Problem:

• Join condition is too inclusive– Many individuals share the same last name.

20

Surname Approx # RankSmith 2.4 M 1Johnson 1.8 M 2Williams 1.5 M 3Brown 1.4 M 4Jones 1.4 M 5

http://www.census.gov/genealogy/www/data/2000surnames/

“Where is Joe Liebermen ?”

• Spelling mistake– Liebermen vs Lieberman

• Need an approximate matching condition!

21

Levenshtein (or edit) distance

• The minimum number of character editoperations needed to turn one string into the other.

LIEBERMANLIEBERMEN

– Substitute A to E. Edit distance = 1

22

Levenshtein (or edit) distance

• Distance between two string s and t is the shortest sequence of edit commands that transform s to t.

• Commands: – Copy character from s to t (cost = 0)– Delete a character from s (cost = 1)– Insert a character into t (cost = 1)– Substitute one character for another (cost = 1)

23

Costs can be different

Levenshtein (or edit) distance

Ashwin MachanavajjhalaAswhin Maachanavajhala

24

Levenshtein (or edit) distance

String s: Ashwin MaGchanavajjhala

String t: Aswhin MaachanavajGhala

Total cost: 4

25

sub ins del

Computing the edit distance

A S W A N0 1

A 1SWHIN

26

Cost of changing “G” à “A”

Cost of changing “ASWH” à “AS”

Computing the edit distanceA S W A N

0 1 2A 1 0 1S 2 1 0W 3 2HI

N

27

Cost of changing “ASW” à “AS”:

Minimum of: - Cost of “AS” à “AS” + 1 (delete W)- Cost of “ASW” à “A” + 1 (insert S)- Cost of “AS” à “A” + 1 (substitute

W with S)

Computing the edit distanceA S W A N

0 1 2 3 4 5A 1 0 1 2 3 4S 2 1 0 1 2 3W 3 2 1 0 1 2H 4 3 2 1 1 2I 5 4 3 2 2 2

N 6 5 4 3 3 ?

28

Computing the edit distanceA S W A N

0 1 2 3 4 5A 1 0 1 2 3 4S 2 1 0 1 2 3W 3 2 1 0 1 2H 4 3 2 1 1 2I 5 4 3 2 2 2

N 6 5 4 3 3 2

29

Remember the minimum in each step and retrace your path.

Edit Distance Variants

• Needleman-Munch– Different costs for each operation

• Affine Gap distance– John Reed vs John Francis “Jack” Reed– Consecutive inserts cost less than the first

insert.

30

Back to our example … Attempt 3

SELECT w.firstname, w.lastname, w.state, w.party, p.first_name, p.last_name, date_part('year', current_date) - date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE levenshtein(w.lastname, p.last_name) <= 1and date_part('year', current_date) -date_part('year', p.birthday) < 100;

31

Jaccard similarity

• Useful similarity function for sets – (and for… long strings).

• Let A and B be two sets– Words in two documents– Friends lists of two individuals

32

Jaccard !,! = ! |! ∩ !||! ∪ !|!

Jaccard similarity for names

• Use character trigrams

LIEBERMAN = {GGL, GLI, LIE, IEB, EBE, BER, ERM, RMA,MAN, ANG, NGG}

LIEBERMEN = {GGL, GLI, LIE, IEB, EBE, BER, ERM, RME,MEN, ENG, NGG}

Jaccard(s,t) = 8/14

33

Attempt 4:

SELECT w.firstname, w.lastname, w.state, w.party, p.first_name, p.last_name, date_part('year', current_date) - date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE similarity(w.lastname, p.last_name) >= 0.5 and date_part('year', current_date) -date_part('year', p.birthday) < 100;

34

Translation / Substitution Tables

• Strings that are usually used interchangeably– New York vs Big Apple– Thomas vs Tom– Robert vs Bob

35

Attempt 5

select w.firstname, w.lastname, w.state, p.first_name, p.last_name, date_part('year', current_date) - date_part('year', p.birthday) AS agefrom wallst w, persons pwhere levenshtein(w.lastname, p.last_name) <= 1 and date_part('year', current_date) -date_part('year', p.birthday) < 100 and (w.firstname = p.first_name or w.firstname IN (select n.nickname from nicknames n where n.firstname = p.first_name));

36

Almost there …

• Tim matches both Timothy and Tim– Can fix it by matching on STATE– Try it on your own … J

37

Summary of Similarity Methods

• Equality on a booleanpredicate

• Edit distance– Levenstein, Affine

• Set similarity– Jaccard

• Vector Based– Cosine similarity, TFIDF

• Translation-based• Numeric distance between

values• Phonetic Similarity

– Soundex, Metaphone

• Other– Jaro-Winkler, Soft-TFIDF,

Monge-Elkan

38

Easiest and most efficient

Summary of Similarity Methods

• Equality on a booleanpredicate

• Edit distance– Levenstein, Affine

• Set similarity– Jaccard

• Vector Based– Cosine similarity, TFIDF

• Translation-based• Numeric distance between

values• Phonetic Similarity

– Soundex, Metaphone

• Other– Jaro-Winkler, Soft-TFIDF,

Monge-Elkan

39

Handle Typographical errors

Good for Text (reviews/ tweets), sets, class membership, …

Useful for abbreviations,

alternate names.

Good for Names

Evaluating Record Linkage

• Hard to get all the matches to be exactly correct in real world problems– As we saw in real examples

• Need to quantify how good the matching is.

40

Property Testing

• Consider a universe U of objects– Documents (in web search)– Pairs of records (in record linkage)

• Suppose you want to identify a subset M in U that satisfies a specific property– Relevance to a query (in web search)– Do the records match (in record linkage)

41

Property Testing

• Consider a universe U of objects• Suppose you want to identify a subset M

in U that satisfies a specific property

• Let A be an (imperfect) algorithm that guesses whether or not an element in U satisfies the property– Let MA be the subset of objects that A

identifies as satisfying the property.

42

Property Testing43

Satisfies P Doesn’t Satisfy P

Satisfies P

Doesn’t satisfy P

Real World

Alg

orith

m

Gue

ss MA

M U - M

U – MA

True positives (TP)

True negatives (TN)

False positives (FP)

Crying Wolf!

False negatives(FN)

Venn diagram view44

M MAU

True positives (TP)

True negatives(TN)

False negatives(FN) False positives

(FP)

Error: Precision / Recall

fraction of answers returned by A that are correct

fraction of correct answers that are returned by A

45

Precision = ! !" (!" + !") != !! |! ∩!!| |!!|!!!!

Recall = ! !" (!" + !") != !! |! ∩!!| |!|!!!!

Error: F-measure46

Precision = ! |! ∩!!| |!!|!!

Recall = ! |! ∩!!| |!|!!

F1!score = !2 ∙ precision ∙ recallprecision+ recall!!

Example

• M:

47

Example:

Algorithm A: select * from wallst w, persons pwhere w.lastname = p.last_name anddate_part('year', current_date) - date_part('year', p.birthday) < 100and (w.firstname = p.first_name or w.firstname IN (select n.nickname from nicknames n where n.firstname = p.first_name));

48

Exact match on last name

Age < 100

First name is same or a nickname

Example

• MA:

49

Example50

Precision = ! ! ∩!! !! !!= !9 10 = 0.9!!!

Recall = ! ! ∩!! ! != !9 10 = 0.9!!

F1!score = 2! 0.9!×!0.90.9+ 0.9 = 0.9!!

Summary

• Many interesting data analyses require reasoning across different datasets

• May not have access to keys that uniquely identify individual rows in both datasets

51

Summary

• Use combinations of attributes that are approximate keys (or quasi-identifiers)

• Use similarity measures for fuzzy or approximate matching– Levenshtein or Edit distance– Jaccard Similarity

• Use translation tables

52

Summary

• Record Linkage is rarely perfect– Missing attributes– Messy data errors– …

• Precision/Recall is used to measure the quality of linkage.

53

The Ugly side of Record Linkage [Sweeney IJUFKS 2002]

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

• Zip

• Birthdate

• Sex

Medical Data

54

The Ugly side of Record Linkage [Sweeney IJUFKS 2002]

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date

Registered•Party

affiliation•Date last

voted

• Zip

• Birthdate

• Sex

Medical Data Voter List

• Governor of MAuniquely identifiedusing ZipCode, Birth Date, and Sex.

Name linked to Diagnosis

55

The Ugly side of Record Linkage [Sweeney IJUFKS 2002]

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date

Registered•Party

affiliation•Date last

voted

• Zip

• Birthdate

• Sex

Medical Data Voter List

56

• Governor of MAuniquely identifiedusing ZipCode, Birth Date, and Sex.

Quasi Identifier

87 % of US population

(anonymous) browsing history àsocial network profiles.

57

Figure 5: Screenshots of the online experiment.

●●

●●

0%

20%

40%

60%

80%

100%

25 50 75 100Number of links

Accu

racy

MLEIntersection sizeJaccard

Figure 6: De-anonymization accuracy for three can-

didate ranking methods on user-contributed web

browsing histories. Accuracy for intersection size

and Jaccard rankings are approximate, as ground-

truth answers are typically only available for users

who were ranked in the top 15 by the MLE.

Unfortunately, because of the experiment’s design, we typi-cally only know ground truth for the individuals who rankedin the top 15 by our approach, and it is possible in theorythat the other two methods succeed precisely where the MLEfails. To assess this possibility, we consider the 11 cases inwhich an individual did not appear in our list of top 15 can-didates but disclosed their identity by signing into Twitter.In all of these 11 cases, both the intersection method andJaccard failed to successfully identify the user. Thus, whilebased on a small sample, it seems reasonable to assume thatif a participant is not ranked in the top 15 by the MLEmethod, then other de-anonymization methods would alsohave failed. Based on this assumption, Figure 6 comparesthe performance of all three de-anonymization methods onthe full set of 374 users. As on the simulated data, we findthat our method outperforms Jaccard similarity and inter-section size, often by a substantial margin.

We can further use the MLE scores to estimate the con-fidence of our predictions. Given ordered candidate scoress1 � s2 � · · · � sn for an anonymous browsing history H,the eccentricity [28] of H is (s1�s2)/std-dev({si}). Figure 7

70%

75%

80%

85%

90%

95%

100%

0% 20% 40% 60% 80% 100%Coverage

Accu

racy

EccentricityHistory length

Figure 7: De-anonymization accuracy on the top-khistories ranked by eccentricity and history length.

shows prediction accuracy on the top-k instances ranked byeccentricity. The right-most point on the plot correspondsto accuracy on the full set of 374 histories (72%); if we limitto the 50% of histories with the highest eccentricity, accu-racy increases to 96%. For comparison, the plot also showsaccuracy as a function of history length, and indicates thateccentricity is the better predictor of accuracy.

7. THREAT MODELSOur de-anonymization strategy assumes access to an in-

dividual’s Twitter browsing history. Such data are availableto a variety of organizations with commercial or strategic in-centives to de-anonymize users. In this section, we describetwo such possible attackers and evaluate the e�cacy of ourapproach on data available to them.Third-party trackers are entities embedded into some web-

sites for the purpose of collecting individual user browsinghabits. Trackers can determine whether a user arrived fromTwitter to a site where they are embedded by examiningthe page’s document.referrer property. We estimate thede-anonymization capabilities of four common third-partytrackers: Google, Facebook, ComScore, and AppNexus. Foreach user-contributed history, and for each organization, wefirst determine which URLs in the history they are likelyable to track by checking if the organization has a tracker

Figure 5: Screenshots of the online experiment.

●●

●●

0%

20%

40%

60%

80%

100%

25 50 75 100Number of links

Accu

racy

MLEIntersection sizeJaccard

Figure 6: De-anonymization accuracy for three can-

didate ranking methods on user-contributed web

browsing histories. Accuracy for intersection size

and Jaccard rankings are approximate, as ground-

truth answers are typically only available for users

who were ranked in the top 15 by the MLE.

Unfortunately, because of the experiment’s design, we typi-cally only know ground truth for the individuals who rankedin the top 15 by our approach, and it is possible in theorythat the other two methods succeed precisely where the MLEfails. To assess this possibility, we consider the 11 cases inwhich an individual did not appear in our list of top 15 can-didates but disclosed their identity by signing into Twitter.In all of these 11 cases, both the intersection method andJaccard failed to successfully identify the user. Thus, whilebased on a small sample, it seems reasonable to assume thatif a participant is not ranked in the top 15 by the MLEmethod, then other de-anonymization methods would alsohave failed. Based on this assumption, Figure 6 comparesthe performance of all three de-anonymization methods onthe full set of 374 users. As on the simulated data, we findthat our method outperforms Jaccard similarity and inter-section size, often by a substantial margin.

We can further use the MLE scores to estimate the con-fidence of our predictions. Given ordered candidate scoress1 � s2 � · · · � sn for an anonymous browsing history H,the eccentricity [28] of H is (s1�s2)/std-dev({si}). Figure 7

70%

75%

80%

85%

90%

95%

100%

0% 20% 40% 60% 80% 100%Coverage

Accu

racy

EccentricityHistory length

Figure 7: De-anonymization accuracy on the top-khistories ranked by eccentricity and history length.

shows prediction accuracy on the top-k instances ranked byeccentricity. The right-most point on the plot correspondsto accuracy on the full set of 374 histories (72%); if we limitto the 50% of histories with the highest eccentricity, accu-racy increases to 96%. For comparison, the plot also showsaccuracy as a function of history length, and indicates thateccentricity is the better predictor of accuracy.

7. THREAT MODELSOur de-anonymization strategy assumes access to an in-

dividual’s Twitter browsing history. Such data are availableto a variety of organizations with commercial or strategic in-centives to de-anonymize users. In this section, we describetwo such possible attackers and evaluate the e�cacy of ourapproach on data available to them.Third-party trackers are entities embedded into some web-

sites for the purpose of collecting individual user browsinghabits. Trackers can determine whether a user arrived fromTwitter to a site where they are embedded by examiningthe page’s document.referrer property. We estimate thede-anonymization capabilities of four common third-partytrackers: Google, Facebook, ComScore, and AppNexus. Foreach user-contributed history, and for each organization, wefirst determine which URLs in the history they are likelyable to track by checking if the organization has a tracker

https://5harad.com/papers/twivacy.pdf