Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate...

57
Record Linkage Everything Data CompSci 216 Spring 2019

Transcript of Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate...

Page 1: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Record Linkage

Everything DataCompSci 216 Spring 2019

Page 2: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Announcements Thu, Jan 31

• HW03 will be posted later tonight.

• We will begin by discussing HW solutions

• If you have “multiple” submissions, the latest submission will be graded.

2

Page 3: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Recap: Querying Relational Databases in SQL

SELECT columns or expressions

FROM tables

WHERE conditionsGROUP BY columns

ORDER BY output columns;

3

1. Generate all combinations of rows, one from each table;each combination forms a “wide row”

2. Filter—keep only “wide rows” satisfying conditions

3. Group—”wide rows” with matching values for columns go into the same group

5. Compute one output row for each “wide row”

(or for each group of them if query has grouping/aggregation)

× × =!

4. Sort the output rows

Page 4: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Problem

• Forbes magazine article: “Wall Street’s favorite senators”

4

Page 5: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Problem

• Forbes magazine article: “Wall Street’s favorite senators”

• What are their ages?

5

Page 6: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Solution

• Join with the persons table (from govtrack)

• But there is no key to join on …

6

Page 7: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Record Linkage

• Problem of finding duplicate entities across different sources (or even within a single dataset).

7

Page 8: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Ironically, Record Linkage has many names

Doubles

Duplicate detectionEntity Resolution

Deduplication

Object identificationObject consolidation

Coreference resolution

Entity clustering

Reference reconciliation

Reference matchingHouseholding

Household matching

Fuzzy match

Approximate match

Merge/purge

Hardening soft databases

Identity uncertainty

8

Page 9: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Motivating Example 1: Web9

Page 10: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Motivating Example 1: Web

Page 11: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Motivating Example 2: Network Science• Measuring the topology of the internet …

using traceroute

Page 12: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

IP Aliasing Problem [Willinger et al. 2009]

Page 13: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

IP Aliasing Problem [Willinger et al. 2009]

Page 14: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

And many many more examples

• Linking Census Records• Public Health• Medical records• Web search – query disambiguation• Comparison shopping• Maintaining customer databases• Law enforcement and Counter-terrorism• Scientific data• Genealogical data• Bibliographic data

14

Page 15: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Opportunityhttp://lod-cloud.net/

Page 16: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Back to our example

• Join with the persons table (from govtrack)

• But there is no key to join on …

• What about (firstname, lastname)?

16

Page 17: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Attempt 1:

SELECT w.*, date_part('year', current_date) -date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE w.first_name = p.first_name

and w.last_name = p.last_name;

17

Page 18: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Problems

• Join condition is too specific– Nicknames used instead of real first names

18

Page 19: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Attempt 2:

• Join on Last name + Age < 100 (senator must be alive)

SELECT w.*, date_part('year', current_date) -date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE w.lastname = p.last_name and date_part('year', current_date) - date_part('year', p.birthday) < 100;

19

Page 20: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Problem:

• Join condition is too inclusive– Many individuals share the same last name.

20

Surname Approx # RankSmith 2.4 M 1Johnson 1.8 M 2Williams 1.5 M 3Brown 1.4 M 4Jones 1.4 M 5

http://www.census.gov/genealogy/www/data/2000surnames/

Page 21: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

“Where is Joe Liebermen ?”

• Spelling mistake– Liebermen vs Lieberman

• Need an approximate matching condition!

21

Page 22: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Levenshtein (or edit) distance

• The minimum number of character editoperations needed to turn one string into the other.

LIEBERMANLIEBERMEN

– Substitute A to E. Edit distance = 1

22

Page 23: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Levenshtein (or edit) distance

• Distance between two string s and t is the shortest sequence of edit commands that transform s to t.

• Commands: – Copy character from s to t (cost = 0)– Delete a character from s (cost = 1)– Insert a character into t (cost = 1)– Substitute one character for another (cost = 1)

23

Costs can be different

Page 24: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Levenshtein (or edit) distance

Ashwin MachanavajjhalaAswhin Maachanavajhala

24

Page 25: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Levenshtein (or edit) distance

String s: Ashwin MaGchanavajjhala

String t: Aswhin MaachanavajGhala

Total cost: 4

25

sub ins del

Page 26: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Computing the edit distance

A S W A N0 1

A 1SWHIN

26

Cost of changing “G” à “A”

Cost of changing “ASWH” à “AS”

Page 27: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Computing the edit distanceA S W A N

0 1 2A 1 0 1S 2 1 0W 3 2HI

N

27

Cost of changing “ASW” à “AS”:

Minimum of: - Cost of “AS” à “AS” + 1 (delete W)- Cost of “ASW” à “A” + 1 (insert S)- Cost of “AS” à “A” + 1 (substitute

W with S)

Page 28: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Computing the edit distanceA S W A N

0 1 2 3 4 5A 1 0 1 2 3 4S 2 1 0 1 2 3W 3 2 1 0 1 2H 4 3 2 1 1 2I 5 4 3 2 2 2

N 6 5 4 3 3 ?

28

Page 29: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Computing the edit distanceA S W A N

0 1 2 3 4 5A 1 0 1 2 3 4S 2 1 0 1 2 3W 3 2 1 0 1 2H 4 3 2 1 1 2I 5 4 3 2 2 2

N 6 5 4 3 3 2

29

Remember the minimum in each step and retrace your path.

Page 30: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Edit Distance Variants

• Needleman-Munch– Different costs for each operation

• Affine Gap distance– John Reed vs John Francis “Jack” Reed– Consecutive inserts cost less than the first

insert.

30

Page 31: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Back to our example … Attempt 3

SELECT w.firstname, w.lastname, w.state, w.party, p.first_name, p.last_name, date_part('year', current_date) - date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE levenshtein(w.lastname, p.last_name) <= 1and date_part('year', current_date) -date_part('year', p.birthday) < 100;

31

Page 32: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Jaccard similarity

• Useful similarity function for sets – (and for… long strings).

• Let A and B be two sets– Words in two documents– Friends lists of two individuals

32

Jaccard !,! = ! |! ∩ !||! ∪ !|!

Page 33: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Jaccard similarity for names

• Use character trigrams

LIEBERMAN = {GGL, GLI, LIE, IEB, EBE, BER, ERM, RMA,MAN, ANG, NGG}

LIEBERMEN = {GGL, GLI, LIE, IEB, EBE, BER, ERM, RME,MEN, ENG, NGG}

Jaccard(s,t) = 8/14

33

Page 34: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Attempt 4:

SELECT w.firstname, w.lastname, w.state, w.party, p.first_name, p.last_name, date_part('year', current_date) - date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE similarity(w.lastname, p.last_name) >= 0.5 and date_part('year', current_date) -date_part('year', p.birthday) < 100;

34

Page 35: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Translation / Substitution Tables

• Strings that are usually used interchangeably– New York vs Big Apple– Thomas vs Tom– Robert vs Bob

35

Page 36: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Attempt 5

select w.firstname, w.lastname, w.state, p.first_name, p.last_name, date_part('year', current_date) - date_part('year', p.birthday) AS agefrom wallst w, persons pwhere levenshtein(w.lastname, p.last_name) <= 1 and date_part('year', current_date) -date_part('year', p.birthday) < 100 and (w.firstname = p.first_name or w.firstname IN (select n.nickname from nicknames n where n.firstname = p.first_name));

36

Page 37: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Almost there …

• Tim matches both Timothy and Tim– Can fix it by matching on STATE– Try it on your own … J

37

Page 38: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Summary of Similarity Methods

• Equality on a booleanpredicate

• Edit distance– Levenstein, Affine

• Set similarity– Jaccard

• Vector Based– Cosine similarity, TFIDF

• Translation-based• Numeric distance between

values• Phonetic Similarity

– Soundex, Metaphone

• Other– Jaro-Winkler, Soft-TFIDF,

Monge-Elkan

38

Easiest and most efficient

Page 39: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Summary of Similarity Methods

• Equality on a booleanpredicate

• Edit distance– Levenstein, Affine

• Set similarity– Jaccard

• Vector Based– Cosine similarity, TFIDF

• Translation-based• Numeric distance between

values• Phonetic Similarity

– Soundex, Metaphone

• Other– Jaro-Winkler, Soft-TFIDF,

Monge-Elkan

39

Handle Typographical errors

Good for Text (reviews/ tweets), sets, class membership, …

Useful for abbreviations,

alternate names.

Good for Names

Page 40: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Evaluating Record Linkage

• Hard to get all the matches to be exactly correct in real world problems– As we saw in real examples

• Need to quantify how good the matching is.

40

Page 41: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Property Testing

• Consider a universe U of objects– Documents (in web search)– Pairs of records (in record linkage)

• Suppose you want to identify a subset M in U that satisfies a specific property– Relevance to a query (in web search)– Do the records match (in record linkage)

41

Page 42: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Property Testing

• Consider a universe U of objects• Suppose you want to identify a subset M

in U that satisfies a specific property

• Let A be an (imperfect) algorithm that guesses whether or not an element in U satisfies the property– Let MA be the subset of objects that A

identifies as satisfying the property.

42

Page 43: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Property Testing43

Satisfies P Doesn’t Satisfy P

Satisfies P

Doesn’t satisfy P

Real World

Alg

orith

m

Gue

ss MA

M U - M

U – MA

True positives (TP)

True negatives (TN)

False positives (FP)

Crying Wolf!

False negatives(FN)

Page 44: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Venn diagram view44

M MAU

True positives (TP)

True negatives(TN)

False negatives(FN) False positives

(FP)

Page 45: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Error: Precision / Recall

fraction of answers returned by A that are correct

fraction of correct answers that are returned by A

45

Precision = ! !" (!" + !") != !! |! ∩!!| |!!|!!!!

Recall = ! !" (!" + !") != !! |! ∩!!| |!|!!!!

Page 46: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Error: F-measure46

Precision = ! |! ∩!!| |!!|!!

Recall = ! |! ∩!!| |!|!!

F1!score = !2 ∙ precision ∙ recallprecision+ recall!!

Page 47: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Example

• M:

47

Page 48: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Example:

Algorithm A: select * from wallst w, persons pwhere w.lastname = p.last_name anddate_part('year', current_date) - date_part('year', p.birthday) < 100and (w.firstname = p.first_name or w.firstname IN (select n.nickname from nicknames n where n.firstname = p.first_name));

48

Exact match on last name

Age < 100

First name is same or a nickname

Page 49: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Example

• MA:

49

Page 50: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Example50

Precision = ! ! ∩!! !! !!= !9 10 = 0.9!!!

Recall = ! ! ∩!! ! != !9 10 = 0.9!!

F1!score = 2! 0.9!×!0.90.9+ 0.9 = 0.9!!

Page 51: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Summary

• Many interesting data analyses require reasoning across different datasets

• May not have access to keys that uniquely identify individual rows in both datasets

51

Page 52: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Summary

• Use combinations of attributes that are approximate keys (or quasi-identifiers)

• Use similarity measures for fuzzy or approximate matching– Levenshtein or Edit distance– Jaccard Similarity

• Use translation tables

52

Page 53: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

Summary

• Record Linkage is rarely perfect– Missing attributes– Messy data errors– …

• Precision/Recall is used to measure the quality of linkage.

53

Page 54: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

The Ugly side of Record Linkage [Sweeney IJUFKS 2002]

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

• Zip

• Birthdate

• Sex

Medical Data

54

Page 55: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

The Ugly side of Record Linkage [Sweeney IJUFKS 2002]

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date

Registered•Party

affiliation•Date last

voted

• Zip

• Birthdate

• Sex

Medical Data Voter List

• Governor of MAuniquely identifiedusing ZipCode, Birth Date, and Sex.

Name linked to Diagnosis

55

Page 56: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

The Ugly side of Record Linkage [Sweeney IJUFKS 2002]

•Name•SSN•Visit Date•Diagnosis•Procedure•Medication•Total Charge

•Name•Address•Date

Registered•Party

affiliation•Date last

voted

• Zip

• Birthdate

• Sex

Medical Data Voter List

56

• Governor of MAuniquely identifiedusing ZipCode, Birth Date, and Sex.

Quasi Identifier

87 % of US population

Page 57: Record Linkage - Duke University · 2019. 2. 1. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 7. Ironically,

(anonymous) browsing history àsocial network profiles.

57

Figure 5: Screenshots of the online experiment.

●●

●●

0%

20%

40%

60%

80%

100%

25 50 75 100Number of links

Accu

racy

MLEIntersection sizeJaccard

Figure 6: De-anonymization accuracy for three can-

didate ranking methods on user-contributed web

browsing histories. Accuracy for intersection size

and Jaccard rankings are approximate, as ground-

truth answers are typically only available for users

who were ranked in the top 15 by the MLE.

Unfortunately, because of the experiment’s design, we typi-cally only know ground truth for the individuals who rankedin the top 15 by our approach, and it is possible in theorythat the other two methods succeed precisely where the MLEfails. To assess this possibility, we consider the 11 cases inwhich an individual did not appear in our list of top 15 can-didates but disclosed their identity by signing into Twitter.In all of these 11 cases, both the intersection method andJaccard failed to successfully identify the user. Thus, whilebased on a small sample, it seems reasonable to assume thatif a participant is not ranked in the top 15 by the MLEmethod, then other de-anonymization methods would alsohave failed. Based on this assumption, Figure 6 comparesthe performance of all three de-anonymization methods onthe full set of 374 users. As on the simulated data, we findthat our method outperforms Jaccard similarity and inter-section size, often by a substantial margin.

We can further use the MLE scores to estimate the con-fidence of our predictions. Given ordered candidate scoress1 � s2 � · · · � sn for an anonymous browsing history H,the eccentricity [28] of H is (s1�s2)/std-dev({si}). Figure 7

70%

75%

80%

85%

90%

95%

100%

0% 20% 40% 60% 80% 100%Coverage

Accu

racy

EccentricityHistory length

Figure 7: De-anonymization accuracy on the top-khistories ranked by eccentricity and history length.

shows prediction accuracy on the top-k instances ranked byeccentricity. The right-most point on the plot correspondsto accuracy on the full set of 374 histories (72%); if we limitto the 50% of histories with the highest eccentricity, accu-racy increases to 96%. For comparison, the plot also showsaccuracy as a function of history length, and indicates thateccentricity is the better predictor of accuracy.

7. THREAT MODELSOur de-anonymization strategy assumes access to an in-

dividual’s Twitter browsing history. Such data are availableto a variety of organizations with commercial or strategic in-centives to de-anonymize users. In this section, we describetwo such possible attackers and evaluate the e�cacy of ourapproach on data available to them.Third-party trackers are entities embedded into some web-

sites for the purpose of collecting individual user browsinghabits. Trackers can determine whether a user arrived fromTwitter to a site where they are embedded by examiningthe page’s document.referrer property. We estimate thede-anonymization capabilities of four common third-partytrackers: Google, Facebook, ComScore, and AppNexus. Foreach user-contributed history, and for each organization, wefirst determine which URLs in the history they are likelyable to track by checking if the organization has a tracker

Figure 5: Screenshots of the online experiment.

●●

●●

0%

20%

40%

60%

80%

100%

25 50 75 100Number of links

Accu

racy

MLEIntersection sizeJaccard

Figure 6: De-anonymization accuracy for three can-

didate ranking methods on user-contributed web

browsing histories. Accuracy for intersection size

and Jaccard rankings are approximate, as ground-

truth answers are typically only available for users

who were ranked in the top 15 by the MLE.

Unfortunately, because of the experiment’s design, we typi-cally only know ground truth for the individuals who rankedin the top 15 by our approach, and it is possible in theorythat the other two methods succeed precisely where the MLEfails. To assess this possibility, we consider the 11 cases inwhich an individual did not appear in our list of top 15 can-didates but disclosed their identity by signing into Twitter.In all of these 11 cases, both the intersection method andJaccard failed to successfully identify the user. Thus, whilebased on a small sample, it seems reasonable to assume thatif a participant is not ranked in the top 15 by the MLEmethod, then other de-anonymization methods would alsohave failed. Based on this assumption, Figure 6 comparesthe performance of all three de-anonymization methods onthe full set of 374 users. As on the simulated data, we findthat our method outperforms Jaccard similarity and inter-section size, often by a substantial margin.

We can further use the MLE scores to estimate the con-fidence of our predictions. Given ordered candidate scoress1 � s2 � · · · � sn for an anonymous browsing history H,the eccentricity [28] of H is (s1�s2)/std-dev({si}). Figure 7

70%

75%

80%

85%

90%

95%

100%

0% 20% 40% 60% 80% 100%Coverage

Accu

racy

EccentricityHistory length

Figure 7: De-anonymization accuracy on the top-khistories ranked by eccentricity and history length.

shows prediction accuracy on the top-k instances ranked byeccentricity. The right-most point on the plot correspondsto accuracy on the full set of 374 histories (72%); if we limitto the 50% of histories with the highest eccentricity, accu-racy increases to 96%. For comparison, the plot also showsaccuracy as a function of history length, and indicates thateccentricity is the better predictor of accuracy.

7. THREAT MODELSOur de-anonymization strategy assumes access to an in-

dividual’s Twitter browsing history. Such data are availableto a variety of organizations with commercial or strategic in-centives to de-anonymize users. In this section, we describetwo such possible attackers and evaluate the e�cacy of ourapproach on data available to them.Third-party trackers are entities embedded into some web-

sites for the purpose of collecting individual user browsinghabits. Trackers can determine whether a user arrived fromTwitter to a site where they are embedded by examiningthe page’s document.referrer property. We estimate thede-anonymization capabilities of four common third-partytrackers: Google, Facebook, ComScore, and AppNexus. Foreach user-contributed history, and for each organization, wefirst determine which URLs in the history they are likelyable to track by checking if the organization has a tracker

https://5harad.com/papers/twivacy.pdf