Record Linkage/ Duplicate Elimination Sunita Sarawagi [email protected]

35
Record Linkage/ Duplicate Elimination Sunita Sarawagi [email protected]

description

Record Linkage/ Duplicate Elimination Sunita Sarawagi [email protected]. The de-duplication problem. Given a list of semi-structured records, find all records that refer to a same entity Example applications: Data warehousing: merging name/address lists Entity: Person Household - PowerPoint PPT Presentation

Transcript of Record Linkage/ Duplicate Elimination Sunita Sarawagi [email protected]

Page 1: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Record Linkage/

Duplicate Elimination

Sunita Sarawagi

[email protected]

Page 2: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

The de-duplication problemGiven a list of semi-structured records, find all records that refer to a same entity Example applications:

Data warehousing: merging name/address lists Entity:

a) Personb) Household

Automatic citation databases (Citeseer): references Entity: paper

Page 3: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

ExampleLink together people from several data sources

Taxpayers

SOUZA ,D ,D ,,

GORA

VILLA

VIMAN NAGAR

411014 ,

Land records

CHAFEKAR ,RAMCHANDRA

….

Passport

DERYCK ,D ,SOZA

,03 ,GERA VILLA

,VIMAN NAGAR PUNE

,411014 ,1

Transport Telephone

SOUZA ,D ,D ,,GORA VILLA ,,VIMAN NAGAR ,,,411014 ,

DERYCK ,D ,SOZA ,03 ,GERA VILLA ,,VIMAN NAGAR PUNE, 411014

Duplicates:

CHAFEKAR ,RAMCHANDRA ,DAMODAR ,SHOP 8 ,H NO 509 NARAYAN PETH PUNE 411030 CHITRAV ,RAMCHANDRA ,D ,FLAT 5 ,H NO 2105 SADASHIV PETH PUNE 411 030

Non-duplicates:

Page 4: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Challenges Errors and inconsistencies in data Spotting duplicates might be hard as they

may be spread far apart: may not be group-able using obvious keys

Domain-specific Existing manual approaches require retuning

with every new domain

Page 5: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

How are Such Problems Created? Human factors

Incorrect data entry Ambiguity during data

transformations

Application factors Erroneous applications

populating databases Faulty database design

(constraints not enforced)

Obsolence Real-world is dynamic

Page 6: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Database level linkage

Patent database (XML)

InventorTitleAssigneeAbstract

Authors

Computer sciencepublications (DBLP)

Authors Titles

AuthorsTitlesAbstracts

Medical publications

Inventorlist

Probabilistic links

Exploit various kinds of information in decreasing order of information content and increasing computation overhead

Page 7: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Partspart-iddescriptionimportancereplacement frequency

Replacementpart-idship-iddate

OrdersPart-idVendor-idOrder datePrice

Trip-logship-idtrip start datetrip lengthdestination

VendorVendor-idNameAddressJoining-date

Shipship-idnameweightcapacity

Duplicates can be in all of the interlinked tablesGoal: resolving them simultaneously could be better

Multi-table data

Page 8: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Outline Part I: Motivation, similarity measures (90 min)

Data quality, applications Linkage methodology, core measures Learning core measures Linkage based measures

Part II: Efficient algorithms for approximate join (60 min)

Part III: Clustering/partitioning algorithms (30 min)

Page 9: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

String similarity measures Token-based

Examples Jaccard TF-IDF Cosine similarities

Suitable for large documents Character-based

Examples: Edit-distance and variants like Levenshtein, Jaro-Winkler Soundex

Suitable for short strings with spelling mistakes Hybrids

Page 10: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Token based Tokens/words

‘AT&T Corporation’ -> ‘AT&T’ , ‘Corporation’

Similarity: various measures of overlap of two sets S,T Jaccard(S,T) = |ST|/|ST| Example

S = ‘AT&T Corporation’ -> ‘AT&T’ , ‘Corporation’ T = ‘AT&T Corp’ -> ‘AT&T’ , ‘Corp.’ Jaccard(S,T) = 1/3

Variants: weights attached with each token Useful for large strings: example web

documents

Page 11: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Cosine similarity with TF/IDF weights Cosine similarity:

Sets transformed to vectors with each term as dimension Similarity: dot-product of two vectors each normalized to unit

length cosine of angle between them

Term weight == TF/IDF log (tf+1) * log idf where

tf : frequency of ‘term’ in a document d idf : number of documents / number of documents containing ‘term’

Intuitively: rare ‘terms’ are more important Widely used in traditional IR

Example: ‘AT&T Corporation’, ‘AT&T Corp’ or ‘AT&T Inc’

Low weights for ‘Corporation’,’Corp’,’Inc’, Higher weight for ‘AT&T’

Page 12: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Edit Distance [G98] Given two strings, S,T, edit(S,T):

Minimum cost sequence of operations to transform S to T. Character Operations: I (insert), D (delete), R (Replace).

Example: edit(Error,Eror) = 1, edit(great,grate) = 2 Folklore dynamic programming algorithm to compute edit();

O(m2) versus O(2m log m) for token-based measures Several variants (gaps,weights) --- becomes NP-complete

easily. Varying costs of operations: can be learnt [RY97].

Observations Suitable for common typing mistakes on small strings

Comprehensive vs Comprenhensive Problematic for specific domains

Page 13: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Edit Distance with affine gaps Differences between ‘duplicates’ often due to

abbreviations or whole word insertions. IBM Corp. closer to ATT Corp. than IBM Corporation John Smith vs John Edward Smith vs John E. Smith

Allow sequences of mis-matched characters (gaps) in the alignment of two strings.

Penalty: using the affine cost model Cost(g) = s+e l s: cost of opening a gap e: cost of extending the gap l: length of a gap

Similar dynamic programming algorithm Parameters domain-dependent, learnable, e.g.,

[BM03, MBP05]

Page 14: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Approximate edit-distance: Jaro Rule

Given strings s = a1,…,ak and t = b1,…,bL ai in s is common to a character in t if there is a bj in t

such that ai = bj i-H j i+H where H = min(|s|,|t|)/2 Let s’ = a1’,…,ak’’ and t’ = b1’,…,bL’’ characters in s (t)

common with t (s) Ts’,t’ = number of transpositions in s’ and t’

Jaro(s,t) = Martha vs Marhta

H = 3, s’ = Martha, t’ = Marhta, Ts’,t’ = 2, Jaro(Martha,Marhta) = (1+1+1/6)/3=0.7

Jonathan vs Janathon (H=4) s’ = jnathn t’ = jnathn Ts’,t’ =

0,Jaro(Jonathan,Janathon)=0.5

)|'|

5.0||

|'|

||

|'|(

3

1 ','

s

T

t

t

s

s ts

Page 15: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Hybrids [CRF03] Example: Edward, John Vs Jon Edwerd Let S = {a1,…,aK}, T = {b1,…bL} sets of terms: Sim(S,T) =

Sim’() some other similarity function C(t,S,T) = {wS s.t v T, sim’(w,v) > t} D(w,T) = maxvTsim’(w,v), w C(t,S,T)

sTFIDF =

)b,a('max1

ji

11simK

K

i

Lj

),,(

),(*),(*),(TStCw

TwDTwWSwW

Page 16: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Soundex Encoding A phonetic algorithm that indexes names by their

sounds when pronounced in english. Consists of the first letter of the name followed by

three numbers. Numbers encode similar sounding consonants.

Remove all W, H B, F, P, V encoded as 1, C,G,J,K,Q,S,X,Z as 2 D,T as 3, L as 4, M,N as 5, R as 6, Remove vowels Concatenate first letter of string with first 3

numerals Ex: great and grate become G6EA3 and G6A3E and

then G63 More recent, metaphone, double metaphone etc.

Page 17: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Learning similarity functions Per attribute

Term based (vector space) Edit based

Learning constants in character-level distance measures like levenshtein distances

Useful for short strings with systematic errors (e.g., OCRs) or domain specific error (e.g.,st., street)

Multi-attribute records Useful when relative importance of match along different

attributes highly domain dependent Example: comparison shopping website

Match on title more indicative in books than on electronics Difference in price less indicative in books than electronics

Page 18: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Machine Learning approachGiven examples of duplicates and non-duplicate

pairs, learn to predict if pair is duplicate or not.

Input features: Various kinds of similarity functions between attributes

Edit distance, Soundex, N-grams on text attributes Absolute difference on numeric attributes Capture domain-specific knowledge on comparing data

Page 19: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

The learning approach

f1 f2 …fn Similarity functions

Examplelabeledpairs

Record 1 D Record 2

Record 1 NRecord 3

Record 4 DRecord 5

1.0 0.4 … 0.2 1

0.0 0.1 … 0.3 0

0.3 0.4 … 0.4 1

Mapped examples

Classifier

Record 6 Record 7Record 8 Record 9Record 10Record 11

Unlabeled list0.0 0.1 … 0.3 ?1.0 0.4 … 0.2 ?0.6 0.2 … 0.5 ?0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?0.0 0.1 … 0.1 ?0.3 0.8 … 0.1 ?0.6 0.1 … 0.5 ?

0.0 0.1 … 0.3 01.0 0.4 … 0.2 10.6 0.2 … 0.5 00.7 0.1 … 0.6 00.3 0.4 … 0.4 10.0 0.1 … 0.1 00.3 0.8 … 0.1 10.6 0.1 … 0.5 1

AuthorTitleNgrams 0.4

AuthorEditDist 0.8

YearDifference > 1

All-Ngrams 0.48Non-Duplicate

Non Duplicate

Duplicate TitleIsNull < 1

PageMatch 0.5

Non-Duplicate

Duplicate

Duplicate

Duplicate

Similarity

functions

Page 20: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Experiences with the learning approach

Too much manual search in preparing training data Hard to spot challenging and covering sets of

duplicates in large lists Even harder to find close non-duplicates that

will capture the nuances

Active learning is a generalization of this!

examine instances that are similar on one attribute but dissimilar on another

Page 21: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

The active learning approach

f1 f2 …fn Similarity functions

Examplelabeledpairs

Record 1 D Record 2

Record 3 NRecord 4

1.0 0.4 … 0.2 1

0.0 0.1 … 0.3 0Classifier

Record 6 Record 7Record 8 Record 9Record 10Record 11

Unlabeled list 0.0 0.1 … 0.3 ?1.0 0.4 … 0.2 ?0.6 0.2 … 0.5 ?0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?0.0 0.1 … 0.1 ?0.3 0.8 … 0.1 ?0.6 0.1 … 0.5 ?

0.7 0.1 … 0.6 10.3 0.4 … 0.4 0

Active learner

0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?

Page 22: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

The ALIAS deduplication system Interactive discovery of deduplication

function using active learning Efficient active learning on large lists

using novel indexing mechanisms Efficient application of learnt function

on large lists using Novel cluster-based evaluation engine Cost-based optimizer

Page 23: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Experimental analysis 250 references from Citeseer 32000

pairs of which only 150 duplicates Citeseer’s script used to segment into

author, title, year, page and rest. 20 text and integer similarity functions Average of 20 runs Default classifier: decision tree Initial labeled set: just two pairs

Page 24: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Benefits of active learning

Active learning much better than random With only 100 active instances

97% accuracy, Random only 30% Committee-based selection close to optimal

Page 25: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Analyzing selected instances Fraction of duplicates in selected

instances: 44% starting with only 0.5% Is the gain due to increased fraction of

duplicates? Replaced non-duplicates in selected set with

random non-dups Result only 40% accuracy!!!

Page 26: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Finding all duplicate pairs in large lists Input: a large list of records R with string attributes Output: all pairs (S,T) of records in R which satisfy

a Similarity Criteria: Jaccard(S,T) > 0.7 Overlapping tokens (S,T) > 5 TF-IDF-Cosine(S,T) > 0.8 Edit-distance(S,T) < k

More complicated similarity functions use these as filters (high recall, low precision)

Naïve method: for each record pair, compute similarity score

I/O and CPU intensive, not scalable to millions of records Goal: reduce O(n2) cost to O(n*w), where w << n

Reduce number of pairs on which similarity is computed

Page 27: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

General template for similarity functions Sets: r,s threshol

dCommon

tokens

Page 28: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Approximating edit distance [GIJ+01]

EditDistance(s,t) ≤ d |q-grams(s) q-grams(t)| max(|s|,|t|) - (d-1)*q – 1

Q-grams (sequence of q-characters in a field) ‘AT&T Corporation’ 3-grams: {‘AT&’,’T&T’,’&T ‘, ‘T C’,’

Co’,’orp’,’rpo’,’por’,’ora’,’rat’,’ati’,’tio’,’ion’}

Typically, q=3 Large q-gram sets Approximate large q-gram sets to smaller

sets

Page 29: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Step 1: Pass over data, for each token create list of sets that contain it

Step 2: generate pairs of sets, count and output those with count > T

Pair-Count

Inverted indext1

t2

2

21

The pair-counting table could be large too memory intensiveNot good when list lengths highly skewed

Self-join lists

(Broder et al WWW 1997)

Data

Page 30: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Step 2: Using each record, probemerge lists to find rids in T of

them

Probe-Count

Step 1: Create inverted index

w1w2

Data Inverted index

Heapw1, w4,wk

Page 31: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Threshold sensitive list merge

Lists to be mergedSort by increasing sizeExcept T-1 largest,organize rest in heap(T=3)

Heap

Pop from heap successively

Heap

Search in large lists in increasing orderUse lower bounds to terminate early

(SK04, CGK06)

Page 32: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Summary of the pair creation step Can be extended to the weighted case

fitting the general framework. More complicated similarity functions use

set similarity functions as filters Set sizes can be reduced through

techniques like MinHash (weighted versions also exist) Small sets (average set size < 20), most

database entities with word tokens: use as-is Large sets: web documents, sets of q-grams

Use Minhash or random projection

Page 33: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Creating partitions Transitive closure

Dangers: unrelated records collapsed into a single cluster 2

1

10

7

93

5

8

6

4

2

1

10

7

93

5

8

6

4

3 disagreements

Correlation clustering (Bansal et al 2002) Partition to minimize total disagreements

1. Edges across partitions2. Missing edges within partition

More appealing than clustering: No magic constants: number of clusters,

similarity thresholds, diameter, etc Extends to real-valued scores NP Hard: many approximate algorithms

Page 34: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

Empirical results on data partitioning

Setup: Online comparison shopping, Fields: name, model, description, price Learner: Online perceptron learner

Complete-link clustering >> single-link clustering(transitive closure) An issue: when to stop merging clusters

Digital cameras Camcoder Luggage(From: Bilenko et al, 2005)

Page 35: Record Linkage/ Duplicate Elimination Sunita Sarawagi Sunita@iitb.ac

References [CGGM04] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani: Robust and

Efficient Fuzzy Match for Online Data Cleaning. SIGMOD Conference 2003: 313-324 [CGG+05] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R.

Narasayya, Theo Vassilakis: Data cleaning in microsoft SQL server 2005. SIGMOD Conference 2005: 918-920

[CGK06] Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik: A primitive operator for similarity [CRF03] William W. Cohen, Pradeep Ravikumar, Stephen E. Fienberg: A Comparison of String Distance Metrics for Name-Matching Tasks. IIWeb 2003: 73-78

[J89] M. A. Jaro: Advances in record linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84: 414-420.

[ME97] Alvaro E. Monge, Charles Elkan: An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. DMKD 1997

[RY97] E. Ristad, P. Yianilos : Learning string edit distance. IEEE Pattern analysis and machine intelligence 1998.

[SK04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD Conference 2004: 743-754

[W99] William E. Winkler: The state of record linkage and current research problems. IRS publication R99/04 (http://www.census.gov/srd/www/byname.html)

[Y02] William E. Yancey: BigMatch: A program for extracting probable matches from a large file for record linkage. RRC 2002-01. Statistical Research Division, U.S. Bureau of the Census.