Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across...

54
Record Linkage Everything Data CompSci 216 Spring 2018

Transcript of Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across...

Page 1: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Record Linkage

Everything DataCompSci 216 Spring 2018

Page 2: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Recap: Querying Relational Databases in SQL

SELECT columns or expressions ��FROM tables �� WHERE conditions �GROUP BY columns� �ORDER BY output columns;

2

1. Generate all combinations of rows, one from each table; each combination forms a “wide row”

2. Filter—keep only “wide rows” satisfying conditions

3. Group—”wide rows” with matching values for columns go into the same group

5. Compute one output row for each “wide row”

(or for each group of them if query has grouping/aggregation)

× × =!

4. Sort the output rows

Page 3: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Problem

•  Forbes magazine article: “Wall Street’s favorite senators”

3

Page 4: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Problem

•  Forbes magazine article: “Wall Street’s favorite senators”

•  What are their ages?

4

Page 5: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Solution

•  Join with the persons table (from govtrack)

•  But there is no key to join on …

5

Page 6: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Record Linkage

•  Problem of finding duplicate entities across different sources (or even within a

single dataset).

6

Page 7: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Ironically, Record Linkage has many names

Doubles

Duplicate detectionEntity Resolution

Deduplication

Object identification

Object consolidation

Coreference resolution

Entity clustering

Reference reconciliation

Reference matchingHouseholding

Household matching

Fuzzy match

Approximate match

Merge/purge

Hardening soft databases

Identity uncertainty

7

Page 8: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Motivating Example 1: Web9

Page 9: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Motivating Example 1: Web

Page 10: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Motivating Example 2: Network Science•  Measuring the topology of the internet …

using traceroute

Page 11: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

IP Aliasing Problem [Willinger et al. 2009]

Page 12: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

IP Aliasing Problem [Willinger et al. 2009]

Page 13: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

And many many more examples

•  Linking Census Records

•  Public Health

•  Medical records

•  Web search – query disambiguation

•  Comparison shopping

•  Maintaining customer databases

•  Law enforcement and Counter-terrorism

•  Scientific data

•  Genealogical data

•  Bibliographic data

14

Page 14: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Opportunityhttp://lod-cloud.net/

Page 15: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Back to our example

•  Join with the persons table (from govtrack)

•  But there is no key to join on …

•  What about (firstname, lastname)?

16

Page 16: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Aiempt 1:

SELECT w.*, date_part('year', current_date) - date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE w.first_name = p.first_name and w.last_name = p.last_name;

17

Page 17: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Problems

•  Join condition is too specific– Nicknames used instead of real first names

18

Page 18: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Aiempt 2:

•  Join on Last name + Age < 100 (senator must be alive)

SELECT w.*, date_part('year', current_date) - date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE w.lastname = p.last_name and date_part('year', current_date) - date_part('year', p.birthday) < 100;

19

Page 19: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Problem:

•  Join condition is too inclusive– Many individuals share the same last name.

20

Surname Approx # Rank

Smith 2.4 M 1

Johnson 1.8 M 2

Williams 1.5 M 3

Brown 1.4 M 4

Jones 1.4 M 5

http://www.census.gov/genealogy/www/data/2000surnames/

Page 20: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

“Where is Joe Liebermen ?”

•  Spelling mistake– Liebermen vs Lieberman

•  Need an approximate matching condition!

21

Page 21: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Levenshtein (or edit) distance

•  The minimum number of character edit operations needed to turn one string into the other.

LIEBERMAN LIEBERMEN

– Substitute A to E. Edit distance = 1

22

Page 22: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Levenshtein (or edit) distance

•  Distance between two string s and t is the shortest sequence of edit commands that transform s to t.

•  Commands: – Copy character from s to t (cost = 0)– Delete a character from s (cost = 1)–  Insert a character into t (cost = 1)– Substitute one character for another(cost = 1)

23

Costs can be different

Page 23: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Computing the edit distance

A S W A N

0 1

A 1

S

W

H

I

N

26

Cost of changing “G” à “A”

Cost of changing “ASWH” à “AS”

Page 24: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Computing the edit distanceA S W A N

0 1 2

A 1 0 1

S 2 1 0

W 3 2

H

I

N

27

Cost of changing “ASW” à “AS”:Minimum of: - Cost of “AS” à “AS” + 1 (delete W) - Cost of “ASW” à “A” + 1 (insert S) - Cost of “AS” à “A” + 1 (substitute

W with S)

Page 25: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Computing the edit distanceA S W A N

0 1 2 3 4 5

A 1 0 1 2 3 4

S 2 1 0 1 2 3

W 3 2 1 0 1 2

H 4 3 2 1 1 2

I 5 4 3 2 2 2

N 6 5 4 3 3 ?

28

Page 26: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Computing the edit distanceA S W A N

0 1 2 3 4 5

A 1 0 1 2 3 4

S 2 1 0 1 2 3

W 3 2 1 0 1 2

H 4 3 2 1 1 2

I 5 4 3 2 2 2

N 6 5 4 3 3 2

29

Remember the minimum in each step and retrace your path.

Page 27: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Edit Distance Variants

•  Needleman-Munch– Different costs for each operation

•  Affine Gap distance–  John Reed vs John Francis “Jack” Reed– Consecutive inserts cost less than the first

insert.

30

Page 28: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Back to our example … Aiempt 3

SELECT w.firstname, w.lastname, w.state, w.party, p.first_name, p.last_name, date_part('year', current_date) - date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE levenshtein(w.lastname, p.last_name) <= 1 and date_part('year', current_date) - date_part('year', p.birthday) < 100;

31

Page 29: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Jaccard similarity

•  Useful similarity function for sets –  (and for… long strings).

•  Let A and B be two sets– Words in two documents– Friends lists of two individuals

32

Jaccard !,! = ! |! ∩ !||! ∪ !|!

Page 30: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Jaccard similarity for names

•  Use character trigrams

LIEBERMAN = {GGL, GLI, LIE, IEB, EBE, BER, ERM, RMA,MAN, ANG, NGG}LIEBERMEN = {GGL, GLI, LIE, IEB, EBE, BER, ERM, RMA,MEN, ENG, NGG}

Jaccard(s,t) = 9/13 = 0.69

33

Page 31: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Aiempt 4:

SELECT w.firstname, w.lastname, w.state, w.party, p.first_name, p.last_name, date_part('year', current_date) - date_part('year', p.birthday) AS age FROM wallst w, persons p WHERE similarity(w.lastname, p.last_name) >= 0.5 and date_part('year', current_date) - date_part('year', p.birthday) < 100;

34

Page 32: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Translation / Substitution Tables

•  Strings that are usually used interchangeably– New York vs Big Apple– Thomas vs Tom– Robert vs Bob

35

Page 33: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Aiempt 5

select w.firstname, w.lastname, w.state, p.first_name, p.last_name, date_part('year', current_date) - date_part('year', p.birthday) AS age from wallst w, persons p where levenshtein(w.lastname, p.last_name) <= 1 and date_part('year', current_date) - date_part('year', p.birthday) < 100 and (w.firstname = p.first_name or w.firstname IN (select n.nickname from nicknames n where n.firstname = p.first_name));

36

Page 34: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Almost there …

•  Tim matches both Timothy and Tim– Can fix it by matching on STATE– Try it on your own … J

37

Page 35: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Summary of Similarity Methods

•  Equality on a boolean predicate

•  Edit distance–  Levenstein, Affine

•  Set similarity–  Jaccard

•  Vector Based–  Cosine similarity, TFIDF

•  Translation-based•  Numeric distance between

values•  Phonetic Similarity

–  Soundex, Metaphone•  Other

–  Jaro-Winkler, Soft-TFIDF, Monge-Elkan

38

Easiest and most efficient

Page 36: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Summary of Similarity Methods

•  Equality on a boolean predicate

•  Edit distance–  Levenstein, Affine

•  Set similarity–  Jaccard

•  Vector Based–  Cosine similarity, TFIDF

•  Translation-based•  Numeric distance between

values•  Phonetic Similarity

–  Soundex, Metaphone•  Other

–  Jaro-Winkler, Soft-TFIDF, Monge-Elkan

39

Handle Typographical errors

Good for Text (reviews/ tweets), sets, class membership, …

Useful for abbreviations,

alternate names.

Good for Names

Page 37: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Evaluating Record Linkage

•  Hard to get all the matches to be exactly correct in real world problems– As we saw in real examples

•  Need to quantify how good the matching is.

40

Page 38: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Property Testing

•  Consider a universe U of objects– Documents (in web search)– Pairs of records (in record linkage)

•  Suppose you want to identify a subset M in U that satisfies a specific property– Relevance to a query(in web search)– Do the records match (in record linkage)

41

Page 39: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Property Testing

•  Consider a universe U of objects•  Suppose you want to identify a subset M

in U that satisfies a specific property

•  Let A be an (imperfect) algorithm that guesses whether or not an element in U satisfies the property– Let MA be the subset of objects that A

identifies as satisfying the property.

42

Page 40: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Property Testing43

Satisfies P Doesn’t Satisfy P

Satisfies P

Doesn’t satisfy P

Real World

Alg

orit

hm

Gu

ess

MA

M U - M

U – MA

True positives

(TP)

True negatives

(TN)

False positives

(FP)

Crying

Wolf!

False negatives

(FN)

Page 41: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Venn diagram view44

MMA

U

True positives

(TP)

True negatives

(TN)

False negatives

(FN) False positives

(FP)

Page 42: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Error: Precision / Recall

fraction of answers returned by A that are correct

fraction of correct answers that are returned by A

45

Precision = ! !" (!" + !") != !! |! ∩!!| |!!|!!!!

Recall = ! !" (!" + !") != !! |! ∩!!| |!|!!!!

Page 43: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Error: F-measure

46

Precision = ! |! ∩!!| |!!|!!

Recall = ! |! ∩!!| |!|!!

F1!score = !2 ∙ precision ∙ recallprecision+ recall!!

Page 44: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Example

•  M:

47

Page 45: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Example:

Algorithm A: select * from wallst w, persons p where w.lastname = p.last_name and date_part('year', current_date) - date_part('year', p.birthday) < 100 and (w.firstname = p.first_name or w.firstname IN (select n.nickname from nicknames n where n.firstname = p.first_name));

48

Exact match on last name

Age < 100

First name is same or a nickname

Page 46: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Example

•  MA:

49

Page 47: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Example50

Precision = ! ! ∩!! !! !!= !9 10 = 0.9!!!

Recall = ! ! ∩!! ! != !9 10 = 0.9!!

F1!score = 2! 0.9!×!0.90.9+ 0.9 = 0.9!!

Page 48: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Summary

•  Many interesting data analyses require reasoning across different datasets

•  May not have access to keys that uniquely identify individual rows in both datasets

51

Page 49: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Summary

•  Use combinations of aiributes that are approximate keys (or quasi-identifiers)

•  Use similarity measures for fuzzy or approximate matching– Levenshtein or Edit distance– Jaccard Similarity

•  Use translation tables

52

Page 50: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

Summary

•  Record Linkage is rarely perfect– Missing aiributes– Messy data errors– …

•  Precision/Recall is used to measure the quality of linkage.

53

Page 51: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

The Ugly side of Record Linkage [Sweeney IJUFKS 2002]

• Name• SSN• Visit Date• Diagnosis• Procedure• Medication• Total Charge

•  Zip•  Birth date•  Sex

Medical Data

54

Page 52: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

The Ugly side of Record Linkage [Sweeney IJUFKS 2002]

• Name• SSN• Visit Date• Diagnosis• Procedure• Medication• Total Charge

• Name• Address• Date Registered• Party affiliation • Date last voted

•  Zip

•  Birth

date

•  Sex

Medical Data Voter List

•  Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis

55

Page 53: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

The Ugly side of Record Linkage [Sweeney IJUFKS 2002]

• Name• SSN• Visit Date• Diagnosis• Procedure• Medication• Total Charge

• Name• Address• Date Registered• Party affiliation • Date last voted

•  Zip

•  Birth

date

•  Sex

Medical Data Voter List

56

•  Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Quasi Identifier

87 % of US population

Page 54: Record Linkage · 2018. 1. 31. · Record Linkage • Problem of finding duplicate entities across different sources (or even within a single dataset). 6 Ironically, Record Linkage

(anonymous) browsing history à social network profiles.

57

Figure 5: Screenshots of the online experiment.

●●

●●

0%

20%

40%

60%

80%

100%

25 50 75 100Number of links

Accu

racy

MLEIntersection sizeJaccard

Figure 6: De-anonymization accuracy for three can-

didate ranking methods on user-contributed web

browsing histories. Accuracy for intersection size

and Jaccard rankings are approximate, as ground-

truth answers are typically only available for users

who were ranked in the top 15 by the MLE.

Unfortunately, because of the experiment’s design, we typi-cally only know ground truth for the individuals who rankedin the top 15 by our approach, and it is possible in theorythat the other two methods succeed precisely where the MLEfails. To assess this possibility, we consider the 11 cases inwhich an individual did not appear in our list of top 15 can-didates but disclosed their identity by signing into Twitter.In all of these 11 cases, both the intersection method andJaccard failed to successfully identify the user. Thus, whilebased on a small sample, it seems reasonable to assume thatif a participant is not ranked in the top 15 by the MLEmethod, then other de-anonymization methods would alsohave failed. Based on this assumption, Figure 6 comparesthe performance of all three de-anonymization methods onthe full set of 374 users. As on the simulated data, we findthat our method outperforms Jaccard similarity and inter-section size, often by a substantial margin.

We can further use the MLE scores to estimate the con-fidence of our predictions. Given ordered candidate scoress1 � s2 � · · · � sn for an anonymous browsing history H,the eccentricity [28] of H is (s1�s2)/std-dev({si}). Figure 7

70%

75%

80%

85%

90%

95%

100%

0% 20% 40% 60% 80% 100%Coverage

Accu

racy

EccentricityHistory length

Figure 7: De-anonymization accuracy on the top-khistories ranked by eccentricity and history length.

shows prediction accuracy on the top-k instances ranked byeccentricity. The right-most point on the plot correspondsto accuracy on the full set of 374 histories (72%); if we limitto the 50% of histories with the highest eccentricity, accu-racy increases to 96%. For comparison, the plot also showsaccuracy as a function of history length, and indicates thateccentricity is the better predictor of accuracy.

7. THREAT MODELSOur de-anonymization strategy assumes access to an in-

dividual’s Twitter browsing history. Such data are availableto a variety of organizations with commercial or strategic in-centives to de-anonymize users. In this section, we describetwo such possible attackers and evaluate the e�cacy of ourapproach on data available to them.Third-party trackers are entities embedded into some web-

sites for the purpose of collecting individual user browsinghabits. Trackers can determine whether a user arrived fromTwitter to a site where they are embedded by examiningthe page’s document.referrer property. We estimate thede-anonymization capabilities of four common third-partytrackers: Google, Facebook, ComScore, and AppNexus. Foreach user-contributed history, and for each organization, wefirst determine which URLs in the history they are likelyable to track by checking if the organization has a tracker

Figure 5: Screenshots of the online experiment.

●●

●●

0%

20%

40%

60%

80%

100%

25 50 75 100Number of links

Accu

racy

MLEIntersection sizeJaccard

Figure 6: De-anonymization accuracy for three can-

didate ranking methods on user-contributed web

browsing histories. Accuracy for intersection size

and Jaccard rankings are approximate, as ground-

truth answers are typically only available for users

who were ranked in the top 15 by the MLE.

Unfortunately, because of the experiment’s design, we typi-cally only know ground truth for the individuals who rankedin the top 15 by our approach, and it is possible in theorythat the other two methods succeed precisely where the MLEfails. To assess this possibility, we consider the 11 cases inwhich an individual did not appear in our list of top 15 can-didates but disclosed their identity by signing into Twitter.In all of these 11 cases, both the intersection method andJaccard failed to successfully identify the user. Thus, whilebased on a small sample, it seems reasonable to assume thatif a participant is not ranked in the top 15 by the MLEmethod, then other de-anonymization methods would alsohave failed. Based on this assumption, Figure 6 comparesthe performance of all three de-anonymization methods onthe full set of 374 users. As on the simulated data, we findthat our method outperforms Jaccard similarity and inter-section size, often by a substantial margin.

We can further use the MLE scores to estimate the con-fidence of our predictions. Given ordered candidate scoress1 � s2 � · · · � sn for an anonymous browsing history H,the eccentricity [28] of H is (s1�s2)/std-dev({si}). Figure 7

70%

75%

80%

85%

90%

95%

100%

0% 20% 40% 60% 80% 100%Coverage

Accu

racy

EccentricityHistory length

Figure 7: De-anonymization accuracy on the top-khistories ranked by eccentricity and history length.

shows prediction accuracy on the top-k instances ranked byeccentricity. The right-most point on the plot correspondsto accuracy on the full set of 374 histories (72%); if we limitto the 50% of histories with the highest eccentricity, accu-racy increases to 96%. For comparison, the plot also showsaccuracy as a function of history length, and indicates thateccentricity is the better predictor of accuracy.

7. THREAT MODELSOur de-anonymization strategy assumes access to an in-

dividual’s Twitter browsing history. Such data are availableto a variety of organizations with commercial or strategic in-centives to de-anonymize users. In this section, we describetwo such possible attackers and evaluate the e�cacy of ourapproach on data available to them.Third-party trackers are entities embedded into some web-

sites for the purpose of collecting individual user browsinghabits. Trackers can determine whether a user arrived fromTwitter to a site where they are embedded by examiningthe page’s document.referrer property. We estimate thede-anonymization capabilities of four common third-partytrackers: Google, Facebook, ComScore, and AppNexus. Foreach user-contributed history, and for each organization, wefirst determine which URLs in the history they are likelyable to track by checking if the organization has a tracker

https://5harad.com/papers/twivacy.pdf