Overview of Link Plus Probabilistic Record Linkage Software

35
Overview of Link Plus Probabilistic Record Linkage Software

description

Overview of Link Plus Probabilistic Record Linkage Software. Acknowledgements. Slides adapted from training materials developed by CDC–NPCR Faculty: Melissa Jim, CDC/IHS [email protected] David Espey, CDC/IHS [email protected] CDC/Link Plus development and training: - PowerPoint PPT Presentation

Transcript of Overview of Link Plus Probabilistic Record Linkage Software

Page 1: Overview of Link Plus  Probabilistic  Record Linkage Software

Overview of Link Plus Probabilistic

Record Linkage Software

Page 2: Overview of Link Plus  Probabilistic  Record Linkage Software

AcknowledgementsSlides adapted from training materials developed by CDC–NPCR Faculty:•Melissa Jim, CDC/IHS [email protected]•David Espey, CDC/IHS [email protected]

CDC/Link Plus development and training:•Kathleen Thoburn•David Gu

Adapted by:•Megan Hoopes, NW Tribal Epidemiology Center [email protected]

Page 3: Overview of Link Plus  Probabilistic  Record Linkage Software

Link Plus Software

• Stand-alone probabilistic record linkage program

• Combines ease of use and statistical sophistication

• Detects duplicates within a single database, or links 2 database files

• Supports North American Association of Central Cancer Registries files, fixed width files, delimited files, and CRS Plus database

Page 4: Overview of Link Plus  Probabilistic  Record Linkage Software

Link Plus Software

• Can handle missing values of matching variables– automatically treats null or empty values as missing data

and allows user to indicate additional values to be treated as missing data

• Facilitates blocking ("OR blocking") by indexing the variables and comparing the pairs with the identical values on at least one of those variables

• Provides support for manual review of uncertain matches

Page 5: Overview of Link Plus  Probabilistic  Record Linkage Software

Link Plus Is Free

$0.00

Page 6: Overview of Link Plus  Probabilistic  Record Linkage Software

Link Plus Is Easy To Use

Link Plus gets you from HERE:

Last name First Name Site SSN DOB Sex DateDx

SMITH JOHN C619 123654789 02111934 1 06152004

Last name First Name DOB Date of Death COD Death Cert #

SMITH JOHN 02011934 03202006 123654789 01234

Cancer Registry data for John Smith:

Vital Statistics data for John Smith:

Page 7: Overview of Link Plus  Probabilistic  Record Linkage Software

Link Plus Is Easy To Use

To HERE:

Last name First Name Site SSN DOB Sex DateDx Death Date COD Death Cert #

SMITH JOHN C619 123654789 02011934 1 06152004 03202006 C100 01234

Linked data for John Smith:

Page 8: Overview of Link Plus  Probabilistic  Record Linkage Software

Link Plus Is Easy To Use

Without having to go HERE:

Page 9: Overview of Link Plus  Probabilistic  Record Linkage Software

Link Plus Is Easy To Use

• Designed especially for cancer registry work– HOWEVER, can be used with any data

• Mathematics largely hidden from user

• Practical default values supplied for many tasks

• Familiar Windows interface

• Includes Help and test examples

Page 10: Overview of Link Plus  Probabilistic  Record Linkage Software

UsingLink Plus

Page 11: Overview of Link Plus  Probabilistic  Record Linkage Software

Obtaining/UpdatingLink Plus

1. Go to NPCR Home Page:http://www.cdc.gov/cancer/npcr

2. In the ‘Software and Tools’ Section - click on Registry Plus

3. Under ‘Registry Plus Components’ - click on Link Plus - Download

Page 12: Overview of Link Plus  Probabilistic  Record Linkage Software

Getting Started • Make sure you know your data!

• Review and clean data files

• Frequency distributions very helpful

• Look for errant values

– e.g. - DOB day component = 16

Page 13: Overview of Link Plus  Probabilistic  Record Linkage Software

Data Cleaning Tips • Last Name

– Link Plus automatically cleans punctuation and strips off suffices, numbers III

• First Name– May find Dr. Bill or Rev Bill or Sister Mary– Remove prefix in First Name field

• Middle Name – Link Plus automatically cleans numbers, weird symbols– Link Plus accounts for the switching of first and middle names – NMI-no middle initial or NMN-no middle name

• DOB – Review day, mo, yyyy component– Replace errant values with missing

• Sex– Make sure files use same coding convention; M, F, or Blank OR 1, 2,

9

Page 14: Overview of Link Plus  Probabilistic  Record Linkage Software

Linkage Overview

Two main types of linkage:• Linkage of 2 files

– Probabilistically link one file to another file

• Deduplication– Special case of record linkage

– Records in the same file are blocked, compared, and scored against each other

– Result is a ranked list of record pairs

– High-scoring pairs may be duplicates

Page 15: Overview of Link Plus  Probabilistic  Record Linkage Software

Linkage configuration screen

Page 16: Overview of Link Plus  Probabilistic  Record Linkage Software

File ImportFile 1 versus File 2

• File 1 and File 2 have a one to many relationship– File 1 (ONE); File 2 (MANY)

• Death record is File 1; cancer registry is File 2:– One John Smith in death record file can link to many John

Smith’s in cancer registry file

• Cancer registry is File 1; death record is File 2– One John Smith in cancer registry file can link to many John

Smith’s in death record file

Page 17: Overview of Link Plus  Probabilistic  Record Linkage Software

File Layout

• After designating File 1 and File 2, specify file layout for each file

• Link Plus reads .txt files:– Fixed width– Comma delimited– Tab delimited

• Link Plus provides view of first 20 records of each input file– Verify that data is being read in properly

Page 18: Overview of Link Plus  Probabilistic  Record Linkage Software

Blocking Variables• Exact matches

• Blocks of data to compare variables within

• Common blocking variables are:– Last Name– First Name– Social Security Number – Date of Birth

Page 19: Overview of Link Plus  Probabilistic  Record Linkage Software

Phonetic Systems• Phonetic coding involves coding a string based on how it

is pronounced

• Link Plus offers a choice of 2 Phonetic Coding Systems:

Soundex

– Code for a name consisting of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants

– Reduces matching problems due to different spellings

– Simple and fast

Page 20: Overview of Link Plus  Probabilistic  Record Linkage Software

Phonetic SystemsNew York State Identification and IntelligenceSystem (NYSIIS)

• Maps similar phonemes to the same letter; maintains relative vowel positioning

• String can be pronounced by the reader without decoding • Improvement to the Soundex algorithm

�More distinctive; people are more likely to have the same Soundex than the same NYSIIS

�Reported accuracy increase of 2.7% over Soundex �Studies suggest NYSIIS performs better than Soundex when Spanish

names are used

• However, Soundex may bring more pairs for comparison when it used for blocking

Page 21: Overview of Link Plus  Probabilistic  Record Linkage Software

Matching Variables• Up to 10 fields may be selected for matching

• Recommended variables (Matching Methods):– Name--Last (LastName)– Name--First (FirstName)– Name--Middle (MiddleName)– Sex (Exact)– Race (Exact)– Birth Date (Date)– Social Security Number (SSN)

Page 22: Overview of Link Plus  Probabilistic  Record Linkage Software

Matching Methods• Exact

– Case insensitive character-for-character string comparison method

– Results are either yes or no

• Generic String– Uses edit distance function (Levenshtein distance) to compute

the similarity of two long strings • Minimum number of operations (insertion, deletion, or substitution of a

single character) needed to transform one string into the other

• Last Name/First Name– Incorporate both partial matching and value-specific matching

to account for minor typographical errors, misspellings, and hyphenated names

Page 23: Overview of Link Plus  Probabilistic  Record Linkage Software

Matching Methods• SSN

– Specifically for Social Security Number– Incorporates partial matching to account for typographical

errors and transposition of digits

• Date– Incorporates partial matching to account for missing month

values and/or day values

• Middle Name– Accounts for occurrence of the middle initial only versus the full

middle name

Page 24: Overview of Link Plus  Probabilistic  Record Linkage Software

Matching Methods• Value-Specific (Frequency-Based)

– Intended for advanced users

– Sets weights for matching values based on the frequencies of values in the two files being compared

– A match on a frequent value is associated with a low weight, while a match on a rare value is associated with a high weight

– In a file with a high proportion of records with a white race (value of 01), a match on value 01 would be weighted lower than a match on the value 03 (American Indian)

Page 25: Overview of Link Plus  Probabilistic  Record Linkage Software

Missing Values

• Automatically treats null or empty values as missing data for Matching Variables

• Allows user to indicate additional values which are to be treated as missing data by the program

• Specify date format on the missing value grid when the Date Matching method is applied to a matching variable

Page 26: Overview of Link Plus  Probabilistic  Record Linkage Software

Missing Values• Specify date format on the missing value grid

Page 27: Overview of Link Plus  Probabilistic  Record Linkage Software

Direct Method• "Direct Method" refers to the method used to derive

the M-Probabilities used in linkage– Probability that a matching variable agrees given that a

comparison pair is a match

• Click if you prefer to use the default M-Probabilities or user-defined M-Probabilities

• Click on Advanced to view the default M-Probabilities or define your own

• Un-click if you prefer Link Plus to compute the M-Probabilities based on your data using the EM algorithm

Page 28: Overview of Link Plus  Probabilistic  Record Linkage Software

EM Algorithm• Expectation-Maximization algorithm

• Method for maximum likelihood estimation in problems involving incomplete data

• Unclick Direct Method if you would like Link Plus to compute the M-Probabilities based on your data– Estimates reflect the characteristics of the data dynamically

• Very popular computational method

• Basic principle is to derive a solution in a complicated case from a corresponding solution in a simple case

Page 29: Overview of Link Plus  Probabilistic  Record Linkage Software

Cut Off Value• The score value above which comparison pairs are accepted as

potential links and presented for review • Value should always be positive • Initial value of around 7 recommended when using the

recommended Matching Variables• Run linkage, and quickly review potential matches to identify

upper and lower cut off scores– Upper cut off = where do “perfect” matches end?– Lower cut off = where do non-matches begin?Example:– Solid matches somewhere between 20-30 – Non-matches 10-20 – Solid = 27 and higher, and Non-matches = 17 and lower– Gray Area = Greater than 17 and less than 27

Page 30: Overview of Link Plus  Probabilistic  Record Linkage Software

Manual Review of Uncertain Matches

Page 31: Overview of Link Plus  Probabilistic  Record Linkage Software

Manual Review of Uncertain Matches

• Once Cut Off scores have been identified

Page 32: Overview of Link Plus  Probabilistic  Record Linkage Software

Manual Review of Uncertain Matches

• Manual review screen provides special sorting that always keeps the comparison pairs together

• Fields that are not on the linkage report can be included for manual review

• Users can switch between the two viewing modes: the Datasheet View and the Pair view

• Match status may be assigned manually, or may be assigned automatically by match score

Page 33: Overview of Link Plus  Probabilistic  Record Linkage Software

Manual Review of Uncertain Matches

• Includes the option to restrict view to only uncertain matches

• Individual columns may be sorted, hidden and unhidden from view, and re-sized

• Order of the columns on the review screen may be modified

• Review sessions may be saved and re-opened at a later time

• Allows two reviews to be compared so that the difference can be resolved into a final review file

Page 34: Overview of Link Plus  Probabilistic  Record Linkage Software

Tips for Manual Review • Focus initially on SSN and DOB

– Names have a lot of issues (spelling and spacing) • Once matches on SSN go away, pay attention to DOB, name, and sex

– First and middle name switches are common• Use race & address variables if available• First time you start doubting if a pair is a match

– Score will be upper cut off score– Anything above is considered a match

• When start to see junk– Score will be lower cut off score– Anything below is considered a non-match

• Keep an eye out for– Husbands and wives matching (SSN’s match/sex different))– Brothers, sisters, and twins (LN match, SSN off by 1)

• Two people should review so that results can be combined and resolved• These are suggestions - need to know your own data

Page 35: Overview of Link Plus  Probabilistic  Record Linkage Software

Helpful Link Plus Tips• With real data Link Plus may take awhile to read data

files

• Be patient - Linkage times vary – IHS Linkage (2.4 million records)

• Wait (but not days) – something went wrong– Can run out of virtual memory

• Takes a lot of CPU– Shutdown and restart computer right before linkage to clear

up as much space as possible prior to linkage – Turn off screen saver/Close all other programs

With Cancer Registry Data:• Oregon: 36 min/94,172 recs• California: 13 hrs/1,935,255 recs

With VS Data:• Oregon: 2 hrs/400,000 recs• Michigan: 6.5 hrs/1,207,000 recs