CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER...

19
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Transcript of CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER...

Page 1: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

CONCEPTS AND TECHNIQUES FOR RECORD L INKAGE, ENTITY RESOLUTION, AND

DUPLICATE DETECTION

BY PETER CHRISTEN

PRESENTED BY JOSEPH PARK

Data Matching

Page 2: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Introduction

“Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several databases”

Also known as: Record or data linkage Entity resolution Object identification Field matching

Page 3: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Aims & Challenges

Three tasks: Schema matching Data matching Data fusion

Challenges: Lack of unique entity identifier and data quality Computation complexity Lack of training data (e.g. gold standards) Privacy and confidentiality (health informatics & data

mining)

Page 4: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Overview of Data Matching

Five major steps: Data pre-processing Indexing Record pair comparison Classification Evaluation

Page 5: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Diagram

Page 6: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Data Pre-processing

Remove unwanted characters and wordsExpand abbreviations and correct

misspellingsSegment attributes into well-defined and

consistent output attributesVerify the correctness of attribute values

Page 7: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Example of Data Pre-processing

Page 8: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Indexing

Reduces computational complexityGenerates candidate record pairsCommon technique—Blocking

Page 9: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Example of Blocking

Page 10: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Record Pair Comparison

Comparison vector – vector of numerical similarity values

Page 11: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Example of Record Pair Comparison

Page 12: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Jaro and Winkler String Comparison

Jaro: Combines edit distance and q-gram based comparison

Winkler: Increases Jaro similarity for up to four agreeing initial

chars

Page 13: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Record Pair Classification

Two-class or three-class classification: Match or non-match Match or non-match or potential match (requires

clerical review)Supervised and unsupervisedActive learning

Page 14: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Example of Record Pair Classification

Page 15: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Unsupervised Classification

Threshold-based classificationProbabilistic classificationCost-based classificationRule-based classificationClustering-based classification

Page 16: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Probabilistic Classification

Three-class basedDifferent weights assigned to different

attributes Newcombe & Kennedy – cardinalities

Comparison vectors, binary comparisonConditionally independent attributes

assumed

Page 17: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Formulae

Page 18: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Example of Probabilistic Classification

Page 19: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Active Learning

Trains a model with small set of seed dataClassifies comparison vectors not in training

set as matches or non-matchesAsks users for help on the most difficult to

classifyAdds manually classified to training data setTrains the next, improved, classification

modelRepeats until stopping criteria met