Issues in Deterministic and Probabilistic Record Linkage Scott DuVall Salt Lake City VHA MC.
1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine...
-
Upload
mary-malone -
Category
Documents
-
view
214 -
download
0
Transcript of 1 Probabilistic Linkage: Issues and Strategies Craig A. Mason, Ph.D. University of Maine...
1
Probabilistic Linkage: Issues and Strategies
Craig A. Mason, Ph.D.
University of Maine
2
Faculty Disclosure Information
In the past 12 months, I have not had a significant financial interest or other relationship with the manufacturer(s) of the product(s) or provider(s) of the service(s) that will be discussed in my presentation
This presentation will (not) include discussion of pharmaceuticals or devices that have not been approved by the FDA or if you will be discussing unapproved or "off-label" uses of pharmaceuticals or devices.
3
Acknowledgements
• Shihfen Tu, Quansheng Song
• Keith Scott, Marygrace Yale, Tony Gonzalez
• Derek Chapman
4
Overview of Linkage Process
• Two databases containing information on some of the same individuals
Birth Certificates EHDI Diagnostic Data
5
Overview of Linkage Process
• Many births not in Diagnostic Data
Birth Certificates EHDI Diagnostic Data
6
Overview of Linkage Process
• Some entries in EHDI Diagnostic Data do not appear in Electronic Birth Certificates
Birth Certificates EHDI Diagnostic Data
7
Overview of Linkage Process
• Final linkage is a subset of each
Birth Certificates EHDI Diagnostic Data
8
Linkage Algorithms
• Deterministic– Exactly match on specified common fields– Easiest, quickest linkage strategy– Misconception that this is the “gold standard”
ID First Mid Last ID First Mid Last
12 John J Dawson 382 John J Dawson
EHDI Data Birth Defects Registry
9
Linkage Algorithms
• Deterministic– May result in significant bias
• Non-traditional spellings in African American names
– Result in errors due to non-links• Many non-links can result in greater bias than a few
erroneous pairings
ID First Mid Last ID First Mid Last
9 Zbignew Brezinsky 534 Zbignew J Brezinski
EHDI Data Birth Defects Registry
10
Linkage Algorithms
• Probabilistic– Statistically estimate likelihood or odds that two
records are for the same individual, even if they disagree on some fields
ID First Mid Last ID First Mid Last
9 Zbignew Brezinsky 534 Zbignew J Brezinski
EHDI Data Birth Defects Registry
11
Linkage Algorithms
• Factors Impacting Probabilistic Linkage– Likelihood that a fields would agree if a correct link
• Good quality data counts more than poor quality data
– Likelihood that fields would agree if not a correct link• Rare values count more than common values
– Number of expected matches
• Much more complicated and expensive strategy
ID First Mid Last ID First Mid Last
9 Zbignew Brezinsky 534 Zbignew J Brezinski
EHDI Data Birth Defects Registry
12
Good work,but I think we mightneed just a littlemore detail righthere.
Implementing an Effective Data Linkage
Then amiracleoccurs
out
Start
•Modified from Kim Church, Maine Genetics Program
13
Probabilistic Matching
• Probabilistic Matching: Two records are not required to match in all fields– Two records are compared on each of the specified
fields. – A weight—wi—is calculated for each field in a potential
match reflecting the strength of the agreement or disagreement
w1 w2
ID First Mid Last ID First Mid Last
9 Zbignew Brezinsky 534 Zbignew J Brezinski
EHDI Data Birth Defects Registry
14
• Reliability of data fields– Greater reliability results in increased odds of
correct match• A match on a high-quality, reliably entered field is good
• Not matching on a poor-quality field with lots of known data entry errors may not be a fatal error
– If a field is pure noise, correct matches will be random across the databases
Factors Influencing Likelihood of Match
15
• Frequency of field values– The more common the value in a field, the greater
the odds that the records will be erroneously matched• A match based on the name Zbignew is a relatively good
indicator of a match, even if there may be disagreement in other fields
• A match based on the name John may be of much less value, requiring matches on more fields in order to conclude two records are the same individual
• Number of expected matches one would obtain randomly
Factors Influencing Likelihood of Match
16
• Weight Calculation– M-probability
• Probability that a field agrees if the pairing reflects a correct match
– U-probability• Probability that a field agrees if the pairing reflects an incorrect
match• Chance that a given field will agree randomly• Approximately = # records with a specific value/total # of records
Calculating Match Weights
ID First Mid Last ID First Mid Last
9 Zbignew Brezinsky 534 Zbignew J Brezinski
EHDI Data Birth Defects Registry
17
Probabilistic Matching
• If the field agrees, wi is equal to ….
w1 w2
)(log2i
ii u
mw
ID First Mid Last ID First Mid Last
9 Zbignew Brezinsky 534 Zbignew J Brezinski
EHDI Data Birth Defects Registry
18
Probabilistic Matching
– mi for first name = .98, or 98% of the time, if it’s a correct
match, the first names will agree
– ui for Zbignew is .00001 is the probability of randomly
getting two first names that are Zbignew
w1 w2
16.58049)00001.
98.(log)(log 221
i
ii u
mw
ID First Mid Last ID First Mid Last
9 Zbignew Brezinsky 534 Zbignew J Brezinski
EHDI Data Birth Defects Registry
19
Probabilistic Matching
• In cases where two records disagree on a specified field, wi is equal to …..
w1 w2
)1
1(log2
i
ii u
mw
ID First Mid Last ID First Mid Last
9 Zbignew Brezinsky 534 Zbignew J Brezinski
EHDI Data Birth Defects Registry
20
Probabilistic Matching
– mi for last name = .96, or 96% of the time, if it’s a correct
match, the last names will agree
– ui for Brezinsky is .00003 is the probability of randomly
getting two last names that are Brezinsky
w1 w2
-4.64381)00003.1
96.1(log)
1
1(log 222
i
ii u
mw
ID First Mid Last ID First Mid Last
9 Zbignew Brezinsky 534 Zbignew J Brezinski
EHDI Data Birth Defects Registry
21
• A composite weight, wt calculated for each pair of
records– The sum of weights across all fields used in linkage
• Larger wt suggest a correct match,
• Smaller or negative wt suggest an incorrect match.
Calculating Match Weights
11.936684.64381-16.580491
it
k
iit
w
ww
22
• Match Determination– Could compare every record in one dataset with
every record in the second dataset• Result in N1 x N2 comparisons
– Blocking• Records first “blocked” on a subset of fields for which
a deterministic match is required. • Within each block, all records from the one dataset
are compared to all records from the other dataset• wt calculated for each of these possible pairings.
• The distribution of wt’s across all blocks examined in order to determine a critical cut-off score necessary to classify two records as a match.
Blocking
23
0
0.2
0.4
0.6
0.8
1
1.2
-8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
Wt for Pairings
24
0
0.5
1
1.5
2
2.5
-8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20
Wt for Pairings
25
• The total-weight required for two records to have a probability, p, of being a match is equal to…
– Where p is the desired probability of a match, – E is the expected potential matches
– N1 and N2 are the number of records in each database,
Estimating Probabilities
ENN
E
212log
ENN
E
p
pwt
2122 log
1log
is the base 2 log of the odds of a random match
26
if two fields agree, and…
Estimating Probabilities
i
iii
i
iii
u
mx
u
mx
ENN
Ex
1
10,
0,
210
10
0
K
ii
K
ii
x
xp
if two fields do not agree
odds of a random match,
From this formula, it is possible to derive an equation for estimating the probability that any two records are a match
27
• Note that the probability equation is equivalent to a base-2 version of the logistic probability formula
• The computational formula avoids the need to repeatedly calculate powers of 2 and log2
– This is due to the weights in the exponent themselves being a log-value
• The same probability is obtained using e and the natural log in place of 2 and log2 throughout – Base 2 results in improved computational speed
Notes
12
20
0
t
t
ww
ww
p
28
That’s nice, but …..
• All right. But apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, the fresh water system, and public health… What have the Romans ever done for us?
--- Reg, spokesman for the People’s Front of Judea
Monty PythonLife of Brian
(and Martin White, UC Berkeley)