Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

43
Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC

Transcript of Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Page 1: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Issues in Deterministic and Probabilistic Record Linkage

Scott DuVallSalt Lake City

VHA MC

Page 2: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

the age of

informatiinformationon

Page 3: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

informatiinformationon

informaticianinformatician

information = information =

Page 4: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Linkage Adds Information

Page 5: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Linkage Corrects Errors

Page 6: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

6

• Missing informationaffects patient care1

1 Stiell et al. Prevalence of information gaps in the emergency department and the effect on patient outcomes. Cmaj 2003;169(10):1023-8.

2 Coleman et al. Lost in transition: challenges and opportunities for improving the quality of transitional care. Ann Intern Med 2004;141(7):533-6.

•Transitions in care cause breakdown in communication2

Page 7: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

• Resolving duplicates can cost $60 per case.1

1Thornton SN, Hood SK. Reducing Duplicate Patient Creation Using a Probabilistic Matching Algorithm in an Open-access Community Data Sharing Environment. Proc AMIA Symp 2005:1135.

Page 8: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

• “between $0.30 and $0.40 of every dollar spent on health care is wasted on overuse, under use, misuse, duplication, system failures, unnecessary repetition, poor communications and inefficiency.”1

1Reid PP, Compton WD, Grossman JH, Fanjiang G. Building a Better Delivery System: A New Engineering/Health Care Partnership. National Academies Press, 2005:99.

Page 9: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

• Key element of health care information exchange and interoperability, estimated to be able to reduce costs $77.8 billion annually.1

1Walker J, Pan E, Johnston D, Adler-Milstein J, Bates DW, Middleton B. The value of health care information exchange and interoperability. Health Aff (Millwood). 2005 Jan-Jun;Suppl Web Exclusives: W5-10-W5-18.

Page 10: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

10

Record Matching

• Many systems have record matching software.

• Errors still exist– 50% missed in CDC Survey1

– 5% missed in 1.5 million records = 75,0002

1 User Manual for the CDC Deduplication Evaluation Toolkit2 Snow LA, DuVall SL. Clinical Data Exchange Through A Looking Glass: A Gray-Box Approach To Record Linkage. NLM 2005.

Page 11: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Old Technology

Page 12: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Misunderstood Technology

Page 13: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Misunderstood Technology

Page 14: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Score Is Not Probability

score

probability

Page 15: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Information is not Used

Page 16: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

MPIMPIMPIMPIName +

Date of Birth + Social Security Number

Page 17: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

MPIMPIMPIMPI

Page 18: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Deterministic Linkage

1)IF r1.social_security_number = r2.social_security_number

THEN match.

2) IF SoundexCompare(r1.last_name, r2.last_name) AND

SoundexCompare(r1.first_name, r2.first_name) AND

EditDistance(r1.birth_place, r2.place)<2 AND

r1.birth_date = r2.birth_date AND

r1.multiplicity = r2.multiplicity AND

r1.birth_order = r2.birth_order

THEN match.

Page 19: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

IF contains(0..9)

THEN NUMBER

IF contains(North, South, East, West)

THEN DIRECTION

IF contains(Street, Road, Lane, Drive, ...)

THEN STREET_TYPE

ELSE STREET_NAME

IF (NUMBER = NUMBER) AND (DIRECTION = DIRECTION) AND (STREET = STREET) AND (STREET_TYPE = STREET_TYPE)

THEN MATCH

Page 20: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Probabilistic Linkage

Each field given AGREEMENT and DISAGREEMENT weight

Weight proportional to the field’s DISCRIMINATION and RELIABILITY

Many more parameters, possibility of better matching

Page 21: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

21

Record Matching

Understand your Data+

Understand Mistakes in your Data

Good Strategy for LinkageMANUAL REVIEW

MANUAL REVIEW

Page 22: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Understanding the Data

• Compare characteristics of records in the duplicate subset with records in the full enterprise data warehouse

• Describe instances where records in the duplicate subset are not typical of the database at large

• Provide considerations for others looking at duplicate records in master patient indexes

Page 23: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

UUHSC Friedman

Extra names and titles 34.3% 36.9%

Nicknames, spelling variations 21.8% 13.9%

One letter substitutions 13.6% 13.7%

One letter added or deleted 7.6% 12.9%

Punctuation or spaces 1.9% 11.8%

Different last names for females 12.9% 7.8%

Permuted parts of names 3.2% 1.4%

Different first names 2.8% 1.4%

One letter transposed 1.9% 0.8%

Nicknames, spelling variations 21.8% 13.9%

Punctuation or spaces 1.9% 11.8%

Page 24: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

UUHSC Grannis

Missing SSN 52.4% 35%

Typographical errors 62.7% 35.5%

Spouse (family) collisions 14.8% 47.5%

Unexplained collisions 9.9% 17%

Invalid SSN 12.6% 0%

Missing SSN 52.4% 35%

Typographical Errors 62.7% 35.5%

All Collisions 24.7% 64.5%

Invalid SSN 12.6% 0%

Page 25: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Extension of the Probabilistic Model for Approximate Field Comparators

Page 26: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Probabilistic Model

Field in Record A = Field in Record BAgreement Weight

Field in Record A ≠ Field in Record BDisagreement Weight

Page 27: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

M – probability that field matches in dup pair

U – probability that field matches in non-dup pair

Page 28: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Agreement WeightLOG(M/U)

Disagreement WeightLOG(1-M/1-U)

Page 29: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Field in Record A ≈ Field in Record B?

Page 30: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Approximate Comparator

Edit Distance

ED( Johnathan, Jonathan ) = 1

Page 31: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.
Page 32: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.
Page 33: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Approximate Comparator Weight

LOG(Mδ /Uδ)

Page 34: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Mδ – probability that field approximately matches by δ in dup pair

Uδ – probability that field approximately matches by δ in non-dup pair

Page 35: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Dups Non-Dups

Load and randomizetraining set

Classify with estimated

parameters

Estimate Dups and Non-Dups

Update Parameters

Initial Parameters

Page 36: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Dups Non-Dups

Load and randomizetraining set

Classify with updated

parameters

Re-estimate Dups and Non-Dups

Update Parameters

Updated Parameters

Page 37: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

Dups Non-Dups

Load and randomizevalidation set

Classify with training set parameters

Classified Dups and Non-Dups

Training Set Parameters

Page 38: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.
Page 39: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.
Page 40: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.
Page 41: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.
Page 42: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.
Page 43: Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.

questionsquestions??