DataDirty-4

49
Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation by Rhonda Kost, 06.April 2004

Transcript of DataDirty-4

Page 1: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 1/49

Real-World Data Is Dirty

Data Cleansing and theMerge/Purge ProblemHernandez & Stolfo: Columbia University - 1998

Class Presentation by Rhonda Kost, 06.April 2004

Page 2: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 2/49

rmk 06.April.2004

TOPICS

Introduction

 A Basic Data Cleansing Solution

Test & Real World Results

Incremental Merge Purge w/ New Data

ConclusionRecap

Page 3: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 3/49

Introduction

Page 4: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 4/49

rmk 06.April.2004

The problem:

Some corporations acquire large amounts of information every month

The data is stored in many large databases(DB)

These databases may be heterogeneous  Variations in schema

The data may be represented differentlyacross the various datasets

Data in these DB may simply be inaccurate

Page 5: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 5/49

rmk 06.April.2004

Requirement of the analysis

The data mining needs to be done

Quickly

Efficiently

 Accurately

Page 6: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 6/49

rmk 06.April.2004

Examples of real-world applications

Credit card companies

 Assess risk of potential new customers

Find false identities

Match disparate records concerning acustomer

Mass Marketing companies Government agencies

Page 7: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 7/49

 A Basic Data Cleansing Solution

Page 8: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 8/49

rmk 06.April.2004

Duplicate Elimination

Sorted-Neighborhood Method (SNM)

This is done in three phases

Create a Key for each record

Sort records on this key

Merge/Purge records

Page 9: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 9/49

rmk 06.April.2004

SNM: Create key

Compute a key for each record byextracting relevant fields or portions of 

fields

Example:

First Last Address ID Key

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

Page 10: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 10/49

rmk 06.April.2004

SNM: Sort Data

Sort the records in the data list usingthe key in step 1

This can be very time consuming

O(NlogN) for a good algorithm,

O(N2) for a bad algorithm

Page 11: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 11/49

rmk 06.April.2004

SNM: Merge records

Move a fixed sizewindow through the

sequential list of records.

This limits thecomparisons to the

records in thewindow

Page 12: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 12/49

rmk 06.April.2004

SNM: Considerations

What is the optimal window size while

Maximizing accuracy

Minimizing computational cost 

Execution time for large DB will bebound by

Disk I/O Number of passes over the data set 

Page 13: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 13/49

rmk 06.April.2004

Selection of Keys

The effectiveness of the SNM highlydepends on the key selected to sort the

records A key is defined to be a sequence of asubset of attributes

Keys must provide sufficient discriminating power

Page 14: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 14/49

rmk 06.April.2004

Example of Records and Keys

First Last Address ID Key

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

Sal Stolpho 123 First Street 45678987 STLSAL123FRST456

Sal Stiles 123 Forest Street 45654321 STLSAL123FRST456

Page 15: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 15/49

rmk 06.April.2004

Equational Theory

The comparison during the mergephase is an inferential process

Compares much more information thansimply the key

The more information there is, the

better inferences can be made

Page 16: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 16/49

rmk 06.April.2004

Equational Theory - Example

Two names are spelled nearly identically andhave the same address

It may be inferred that they are the same personTwo social security numbers are the same but the names and addresses are totally different 

Could be the same person who moved

Could be two different people and there is anerror in the social security number

Page 17: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 17/49

rmk 06.April.2004

 A simplified rule in English

Given two records, r1 and r2

IF the last name of r1 equals the last name of r2,

 AND the first names differ slightly,

 AND the address of r1 equals the address of r2

THEN

r1 is equivalent to r2

Page 18: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 18/49

rmk 06.April.2004

The distance function

 A distance function is used tocompare pieces of data (usually text)

 Apply distance function to data that  differ slightly

Select a threshold to capture obvioustypographical errors. Impacts number of successful matches and

number of false positives

Page 19: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 19/49

rmk 06.April.2004

Examples of matched records

SSN Name (First, Initial, Last) Address

334600443

334600443

Lisa Boardman

Lisa Brown

144 Wars St.

144 Ward St.

525520001

525520001

Ramon Bonilla

Raymond Bonilla

38 Ward St.

38 Ward St.

0

0

Diana D. Ambrosion

Diana A. Dambrosion

40 Brik Church Av.

40 Brick Church Av.

789912345

879912345

Kathi Kason

Kathy Kason

48 North St.

48 North St.

879912345

879912345

Kathy Kason

Kathy Smith

48 North St.

48 North St.

Page 20: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 20/49

rmk 06.April.2004

Building an equational theory

The process of creating a goodequational theory is similar to the

process of creating a good knowledge-base for an expert system

In complex problems, an experts

assistance is needed to write theequational theory

Page 21: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 21/49

rmk 06.April.2004

Transitive Closure

In general, no single pass (i.e. no single key)will be sufficient to catch all matching records

 An attribute that appears first in the key hashigher discriminating power than thoseappearing after them

If an employee has two records in a DB with SSN

193456782 and 913456782, its unlikely they willfall under the same window

Page 22: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 22/49

rmk 06.April.2004

Transitive Closure

To increase the number of similarrecords merged

Widen the scanning window size, w

Execute several independent runs of theSNM

Use a different key each timeUse a relatively small window

Call this the Multi-Pass approach

Page 23: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 23/49

rmk 06.April.2004

Transitive Closure

Each independent run of the Multi-Passapproach will produce a set of pairs of 

records Although one field in a record may be inerror, another field may not 

Transitive closure can be applied tothose pairs to be merged

Page 24: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 24/49

rmk 06.April.2004

Multi-pass Matches

Pass 1 (Lastname discriminates)KSNKAT48NRTH789 (Kathi Kason 789912345 )

KSNKAT48NRTH879 (Kathy Kason 879912345 )Pass 2 (Firstname discriminates)KATKSN48NRTH789 (Kathi Kason 789912345 )

KATKSN48NRTH879 (Kathy Kason 879912345 )

Pass 3 (Address discriminates)48NRTH879KSNKAT (Kathy Kason 879912345 )

48NRTH879SMTKAT (Kathy Smith 879912345 )

Page 25: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 25/49

rmk 06.April.2004

Transitive Equality Example

IF A implies B

 AND B implies C

THEN A implies C

From example:789912345 Kathi Kason 48 North St. (A)

879912345 Kathy Kason 48 North St. (B)

879912345 Kathy Smith 48 North St. (C)

Page 26: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 26/49

Test Results

Page 27: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 27/49

rmk 06.April.2004

Test Environment 

Test data was created by a databasegenerator

Names are randomly chosen from a list of 63000real names

The database generator provides a largenumber of parameters:

size of the DB, percentage of duplicates,

amount of error

Page 28: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 28/49

rmk 06.April.2004

Correct Duplicate Detection

Page 29: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 29/49

rmk 06.April.2004

Time for each run

Page 30: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 30/49

rmk 06.April.2004

 Accuracy for each run

Page 31: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 31/49

rmk 06.April.2004

Real-World Test 

Data was obtained from the Office of Children Administrative Research

(OCAR) of the Department of Social andHealth Services (State of Washington)

OCARs goals

How long do children stay in foster care? How many different homes do children

typically stay in?

Page 32: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 32/49

rmk 06.April.2004

OCARs Database

Most of OCARs data is stored in onerelation

The DB contains 6,000,000 total records

The DB grows by about 50,000 recordsper month

Page 33: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 33/49

rmk 06.April.2004

Typical Problems in the DB

Names are frequently misspelled

SSN or birthdays are either missing or clearly

wrongCase number often changes when the childsfamily moves to another part of the state

Some records use service provider namesinstead of the childs

No reliable unique identifier

Page 34: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 34/49

rmk 06.April.2004

OCAR Equational Theory

Keys for the independent runs

Last Name, First Name, SSN, Case Number

First Name, Last Name, SSN, Case Number

Case Number, First Name, Last Name, SSN

Page 35: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 35/49

rmk 06.April.2004

OCAR Results

Page 36: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 36/49

Incremental Merge/Purge w/New Data

Page 37: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 37/49

rmk 06.April.2004

Incremental Merge/Purge

Lists are concatenated for first timeprocessing

Concatenating new data before reapplyingthe merge/purge process may be veryexpensive in both time and space

 An incremental merge/purge approach is

needed: Prime Representatives method

Page 38: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 38/49

rmk 06.April.2004

Prime-Representative: Definition

 A set of records extracted from eachcluster of records used to represent the

information in the clusterThe Cluster Centroid or base element of equivalence class

Page 39: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 39/49

rmk 06.April.2004

Prime-Representative creation

Initially, no PR exists

 After the execution of the first 

merge/purge create clusters of similiarrecords

Correct selection of PR from clusterimpacts accuracy of results

No PR can be the best selection forsome clusters

Page 40: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 40/49

rmk 06.April.2004

3 Strategies for Choosing PR

Random Sample Select a sample of records at random from

each clusterN-Latest  Most recent elements entered in DB

Syntactic Choose the largest or more complete

record

Page 41: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 41/49

rmk 06.April.2004

Important Assumption

No data previously used to select eachclusters PR will be deleted

Deleted records could require restructuringof clusters (expensive)

No changes in the rule-set will occurafter the first increment of data is

processed Substantial rule change could invalidate

clusters.

Page 42: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 42/49

rmk 06.April.2004

Results

Cumulative running time for theIncremental Merge/Purge algorithm is

higher than the classic algorithmPR selection methodology couldimprove cumulative running time

Total running time of the IncrementalMerge/Purge algorithm is alwayssmaller

Page 43: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 43/49

Conclusion

Page 44: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 44/49

rmk 06.April.2004

Cleansing of Data

Sorted-Neighborhood Method is expensivedue to

the sorting phase the need for large windows for high accuracy

Multiple passes with small windows followedby transitive closure improves accuracy and

performance for level of accuracy increasing number of successful matches

decreasing number of false positives

Page 45: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 45/49

Recap

Page 46: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 46/49

rmk 06.April.2004

2 major reasons merging largedatabases becomes a difficult problem:

The databases are heterogeneous The identifiers or strings differ in how they

are represented within each DB

Page 47: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 47/49

rmk 06.April.2004

The 3 steps in SNM are:

Creation of key(s)

Sorting records on this key

Merge/Purge records

Page 48: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 48/49

rmk 06.April.2004

Prime representative - set of recordsfrom cluster considered to berepresentative of data contained in

cluster

3 strategies for selecting a PR:

Random Sample

N-Latest  Syntactic

Page 49: DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 49/49

rmk 06.April.2004

Questions: