1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem...

48
1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation by Haiguang Li, 01. Dec 2011
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    212
  • download

    0

Transcript of 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem...

Page 1: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

1Haiguang Li 01. Dec. 2011

Real-World Data Is Dirty

Data Cleansing and the Merge/Purge ProblemHernandez & Stolfo: Columbia University - 1998

Class Presentation by Haiguang Li, 01. Dec 2011

Page 2: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 20112

TOPICS

IntroductionA Basic Data Cleansing SolutionTest & Real World ResultsIncremental Merge Purge w/ New DataConclusionRecap

Page 3: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

3Haiguang Li 01. Dec. 2011

Introduction

Page 4: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 20114

The problem:

Some corporations acquire large amounts of information every monthThe data is stored in many large databases (DB)These databases may be heterogeneous Variations in schema

The data may be represented differently across the various datasetsData in these DB may simply be inaccurate

Page 5: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 20115

Requirement of the analysis

The data mining needs to be done Quickly Efficiently Accurately

Page 6: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 20116

Examples of real-world applications

Credit card companies Assess risk of potential new

customers Find false identities

Match disparate records concerning a customer Mass Marketing companies Government agencies

Page 7: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

7Haiguang Li 01. Dec. 2011

A Basic Data Cleansing Solution

Page 8: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 20118

Duplicate Elimination

Sorted-Neighborhood Method (SNM)This is done in three phases Create a Key for each record Sort records on this key Merge/Purge records

Page 9: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 20119

SNM: Create key

Compute a key for each record by extracting relevant fields or portions of fieldsExample:

First Last Address ID Key

Sal Stolfo 123 First Street

45678987

STLSAL123FRST456

Page 10: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201110

SNM: Sort Data

Sort the records in the data list using the key in step 1This can be very time consuming O(NlogN) for a good algorithm, O(N2) for a bad algorithm

Page 11: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201111

SNM: Merge records

Move a fixed size window through the sequential list of records.This limits the comparisons to the records in the window

Page 12: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201112

SNM: Considerations

What is the optimal window size while Maximizing accuracy Minimizing computational cost

Execution time for large DB will be bound by Disk I/O Number of passes over the data set

Page 13: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201113

Selection of Keys

The effectiveness of the SNM highly depends on the key selected to sort the recordsA key is defined to be a sequence of a subset of attributesKeys must provide sufficient discriminating power

Page 14: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201114

Example of Records and Keys

First Last Address ID Key

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456

Sal Stolpho

123 First Street 45678987 STLSAL123FRST456

Sal Stiles 123 Forest Street

45654321 STLSAL123FRST456

Page 15: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201115

Equational Theory

The comparison during the merge phase is an inferential processCompares much more information than simply the keyThe more information there is, the better inferences can be made

Page 16: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201116

Equational Theory - Example

Two names are spelled nearly identically and have the same address It may be inferred that they are the same

person

Two social security numbers are the same but the names and addresses are totally different Could be the same person who moved Could be two different people and there is an

error in the social security number

Page 17: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201117

A simplified rule in English

Given two records, r1 and r2IF the last name of r1 equals the last name

of r2,AND the first names differ slightly,AND the address of r1 equals the address of r2

THENr1 is equivalent to r2

Page 18: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201118

The distance function

A “distance function” is used to compare pieces of data (usually text)Apply “distance function” to data that “differ slightly” Select a threshold to capture obvious typographical errors. Impacts number of successful matches

and number of false positives

Page 19: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201119

Examples of matched records

SSN Name (First, Initial, Last)

Address

334600443

334600443

Lisa BoardmanLisa Brown

144 Wars St.144 Ward St.

525520001

525520001

Ramon BonillaRaymond Bonilla

38 Ward St.38 Ward St.

00

Diana D. AmbrosionDiana A. Dambrosion

40 Brik Church Av.40 Brick Church Av.

789912345

879912345

Kathi KasonKathy Kason

48 North St.48 North St.

879912345

879912345

Kathy KasonKathy Smith

48 North St.48 North St.

Page 20: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201120

Building an equational theory

The process of creating a good equational theory is similar to the process of creating a good knowledge-base for an expert systemIn complex problems, an expert’s assistance is needed to write the equational theory

Page 21: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201121

Transitive Closure

In general, no single pass (i.e. no single key) will be sufficient to catch all matching recordsAn attribute that appears first in the key has higher discriminating power than those appearing after them If an employee has two records in a DB with

SSN 193456782 and 913456782, it’s unlikely they will fall under the same window

Page 22: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201122

Transitive Closure

To increase the number of similar records merged Widen the scanning window size, w Execute several independent runs of

the SNM Use a different key each time Use a relatively small window Call this the Multi-Pass approach

Page 23: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201123

Transitive Closure

Each independent run of the Multi-Pass approach will produce a set of pairs of recordsAlthough one field in a record may be in error, another field may notTransitive closure can be applied to those pairs to be merged

Page 24: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201124

Multi-pass Matches

Pass 1 (Lastname discriminates)KSNKAT48NRTH789 (Kathi Kason 789912345 )KSNKAT48NRTH879 (Kathy Kason 879912345 )

Pass 2 (Firstname discriminates)KATKSN48NRTH789 (Kathi Kason 789912345 )KATKSN48NRTH879 (Kathy Kason 879912345 )

Pass 3 (Address discriminates)48NRTH879KSNKAT (Kathy Kason 879912345 )48NRTH879SMTKAT (Kathy Smith 879912345 )

Page 25: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201125

Transitive Equality Example

IF A implies BAND B implies C

THEN A implies CFrom example:789912345 Kathi Kason 48 North St. (A)879912345 Kathy Kason 48 North St. (B)879912345 Kathy Smith 48 North St. (C)

Page 26: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

26Haiguang Li 01. Dec. 2011

Test Results

Page 27: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201127

Test Environment

Test data was created by a database generator Names are randomly chosen from a list of

63000 real names

The database generator provides a large number of parameters: size of the DB, percentage of duplicates, amount of error…

Page 28: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201128

Correct Duplicate Detection

Page 29: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201129

Time for each run

Page 30: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201130

Accuracy for each run

Page 31: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201131

Real-World Test

Data was obtained from the Office of Children Administrative Research (OCAR) of the Department of Social and Health Services (State of Washington)OCAR’s goals How long do children stay in foster care? How many different homes do children

typically stay in?

Page 32: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201132

OCAR’s Database

Most of OCAR’s data is stored in one relationThe DB contains 6,000,000 total recordsThe DB grows by about 50,000 records per month

Page 33: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201133

Typical Problems in the DB

Names are frequently misspelledSSN or birthdays are either missing or clearly wrongCase number often changes when the child’s family moves to another part of the stateSome records use service provider names instead of the child’sNo reliable unique identifier

Page 34: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201134

OCAR Equational Theory

Keys for the independent runs Last Name, First Name, SSN, Case

Number First Name, Last Name, SSN, Case

Number Case Number, First Name, Last Name,

SSN

Page 35: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201135

OCAR Results

Page 36: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

36Haiguang Li 01. Dec. 2011

Incremental Merge/Purge w/ New Data

Page 37: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201137

Incremental Merge/Purge

Lists are concatenated for first time processingConcatenating new data before reapplying the merge/purge process may be very expensive in both time and spaceAn incremental merge/purge approach is needed: Prime Representatives method

Page 38: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201138

Prime-Representative: Definition

A set of records extracted from each cluster of records used to represent the information in the clusterThe “Cluster Centroid” or base element of equivalence class

Page 39: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201139

Prime-Representative creation

Initially, no PR existsAfter the execution of the first merge/purge create clusters of similiar recordsCorrect selection of PR from cluster impacts accuracy of resultsNo PR can be the best selection for some clusters

Page 40: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201140

3 Strategies for Choosing PR

Random Sample Select a sample of records at random

from each cluster

N-Latest Most recent elements entered in DB

Syntactic Choose the largest or more complete

record

Page 41: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201141

Important Assumption

No data previously used to select each cluster’s PR will be deleted Deleted records could require

restructuring of clusters (expensive)

No changes in the rule-set will occur after the first increment of data is processed Substantial rule change could

invalidate clusters.

Page 42: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201142

Results

Cumulative running time for the Incremental Merge/Purge algorithm is higher than the classic algorithmPR selection methodology could improve cumulative running time Total running time of the Incremental Merge/Purge algorithm is always smaller

Page 43: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

43Haiguang Li 01. Dec. 2011

Conclusion

Page 44: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201144

Cleansing of Data

Sorted-Neighborhood Method is expensive due to the sorting phase the need for large windows for high accuracy

Multiple passes with small windows followed by transitive closure improves accuracy and performance for level of accuracy increasing number of successful matches decreasing number of false positives

Page 45: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201145

2 major reasons merging large databases becomes a difficult problem: The databases are heterogeneous The identifiers or strings differ in how

they are represented within each DB

Questions 1?

Page 46: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201146

The 3 steps in SNM are: Creation of key(s) Sorting records on this key Merge/Purge records

Questions 2?

Page 47: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201147

3 strategies for selecting a PR: Random Sample N-Latest Syntactic

Questions 3?

Page 48: 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation.

Haiguang Li 01. Dec. 201148

The End

Thanks very much!