Blindfolded Record Linkage

22
Blindfolded Record Linkage Presented by Gautam Sanka Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris

description

Blindfolded Record Linkage. Presented by Gautam Sanka. Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris. Introduction and Objectives. Challenges Patient Privacy vs. Building Cross-Site records Solutions Mandate that identifiers be disclosed Privacy officers find this unacceptable - PowerPoint PPT Presentation

Transcript of Blindfolded Record Linkage

Page 1: Blindfolded Record Linkage

Blindfolded Record Linkage

Presented by Gautam Sanka

Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris

Page 2: Blindfolded Record Linkage

Introduction and Objectives Challenges

Patient Privacy vs. Building Cross-Site records Solutions

Mandate that identifiers be disclosed Privacy officers find this unacceptable

Keep only de-identified information in the registry but share an algorithm to Third Parties for generating an anonymous identifier

Page 3: Blindfolded Record Linkage

De-identification Explained This anonymous identifier will be

created in such a way that: Probability of same identifier generated at

two different sites is high for the same person

And low for different people

Page 4: Blindfolded Record Linkage

What can be used? Using SSN – Bad Idea Using names and DOB may seem best

but: Nicknames at one site and full name at

another Misspellings Different Titles (Mr. Ms. Mrs.)

Page 5: Blindfolded Record Linkage

Goal of Project Breast Cancer Patients at PAMF (Palo Alto Medical

Foundation) and Stanford University Medical Center Merge the Data with de-identification under HIPAA

and IRB approval

Page 6: Blindfolded Record Linkage

Interesting Approaches Bigrams

For the names Ann and Anne [AN, NN] [AN, NN, NE] The Dice Co-efficient is 2 * (2/5) = 4/5

Bloom Filter Both were not implemented due to the

complexities

Page 7: Blindfolded Record Linkage

A single SHA-1 string was constructed based on Gender DOB Zip Three letter Prefix of last name

In their case, only first two letters of patients’ first and last names were used

Page 8: Blindfolded Record Linkage

Composite Identifier Felt that a combination of DOB and the first

two letters of names would uniquely identify Most applicable when:

Compliance restrictions preclude the exchange of actual identifiers

Total number of comparisons is less than 10^8

Names and DOB are easily available DOB has a low error rate

Page 9: Blindfolded Record Linkage

Methods Measured Rate of false positives in data

Dropped name prefixes Dropped DOB stating 1/1/1900 and

1/1/1901 Performed a self-join on three sets of

1.5M rows, 0.5M rows and 10,000 rows

Page 10: Blindfolded Record Linkage

Specificity based on Data Set Size

Page 11: Blindfolded Record Linkage

Measure False Negative Both sites exchanged cryptographic

hashes based on SSNs The number of matches found by

matching SSNs and not composite identifiers became the Lower Bound for False Negatives

Removal of all False Positives based on real identifiers

Page 12: Blindfolded Record Linkage

Sensitivity: Specificity:

Page 13: Blindfolded Record Linkage

PAMF8,166

Stanford

10,939

2087 Common Patients

Page 14: Blindfolded Record Linkage

Total found by

Composite Identifier

2028

Exact Matches in Names +

DOB1824

Confirmed by Full

Identifiers Later204

“This was a very interesting result in that it provided us with a measure of how much better our approach is compared to using full names rather than two-letter prefixes.”

Page 15: Blindfolded Record Linkage

Reasons for False Negatives in Composite Identification

Found by SSN and later confirmed manually

Page 16: Blindfolded Record Linkage

Simply Using SSN SSNs found only 1806 out of 2028 Rate of false negatives is 10% higher

than a composite identifier Reasons

172 of the 222 with false negatives had a missing SSN

Page 17: Blindfolded Record Linkage

What about the other 50?

In conclusion, 57 False Positives for SSN matches3 False Positives for Composite Identifier20 times worse

Page 18: Blindfolded Record Linkage
Page 19: Blindfolded Record Linkage

Which identifiers are best?

Page 20: Blindfolded Record Linkage
Page 21: Blindfolded Record Linkage

When should we use this tool? Most useful where privacy policies preclude

the full exchange of the identifiers required by more sophisticated and sensitive linkage algorithms

For Data Sets of High quality, this approach (in comparison to complex algorithms) Easy to explain Adheres to minimum rules set by HIPAA Faster and less cumbersome

Page 22: Blindfolded Record Linkage

Suggestions