DataDirty-4

8/7/2019 DataDirty-4

http://slidepdf.com/reader/full/datadirty-4 1/49

Real-World Data Is Dirty

Data Cleansing and theMerge/Purge ProblemHernandez & Stolfo: Columbia University - 1998

Class Presentation by Rhonda Kost, 06.April 2004



rmk 06.April.2004

TOPICS

Introduction

A Basic Data Cleansing Solution

Test & Real World Results

Incremental Merge Purge w/ New Data

ConclusionRecap



Introduction



rmk 06.April.2004

The problem:

Some corporations acquire large amounts of information every month

The data is stored in many large databases(DB)

These databases may be heterogeneous Variations in schema

The data may be represented differentlyacross the various datasets

Data in these DB may simply be inaccurate



rmk 06.April.2004

Requirement of the analysis

The data mining needs to be done

Quickly

Efficiently

Accurately



rmk 06.April.2004

Examples of real-world applications

Credit card companies

Assess risk of potential new customers

Find false identities

Match disparate records concerning acustomer

Mass Marketing companies Government agencies



A Basic Data Cleansing Solution



rmk 06.April.2004

Duplicate Elimination

Sorted-Neighborhood Method (SNM)

This is done in three phases

Create a Key for each record

Sort records on this key

Merge/Purge records



rmk 06.April.2004

SNM: Create key

Compute a key for each record byextracting relevant fields or portions of

fields

Example:

First Last Address ID Key

Sal Stolfo 123 First Street 45678987 STLSAL123FRST456



rmk 06.April.2004

SNM: Sort Data

Sort the records in the data list usingthe key in step 1

This can be very time consuming

O(NlogN) for a good algorithm,

O(N2) for a bad algorithm



rmk 06.April.2004

SNM: Merge records

Move a fixed sizewindow through the

sequential list of records.

This limits thecomparisons to the

records in thewindow



rmk 06.April.2004

SNM: Considerations

What is the optimal window size while

Maximizing accuracy

Minimizing computational cost

Execution time for large DB will bebound by

Disk I/O Number of passes over the data set



rmk 06.April.2004

Selection of Keys

The effectiveness of the SNM highlydepends on the key selected to sort the

records A key is defined to be a sequence of asubset of attributes

Keys must provide sufficient discriminating power



rmk 06.April.2004

Example of Records and Keys

First Last Address ID Key



Sal Stolpho 123 First Street 45678987 STLSAL123FRST456

Sal Stiles 123 Forest Street 45654321 STLSAL123FRST456



rmk 06.April.2004

Equational Theory

The comparison during the mergephase is an inferential process

Compares much more information thansimply the key

The more information there is, the

better inferences can be made



rmk 06.April.2004

Equational Theory - Example

Two names are spelled nearly identically andhave the same address

It may be inferred that they are the same personTwo social security numbers are the same but the names and addresses are totally different

Could be the same person who moved

Could be two different people and there is anerror in the social security number



rmk 06.April.2004

A simplified rule in English

Given two records, r1 and r2

IF the last name of r1 equals the last name of r2,

AND the first names differ slightly,

AND the address of r1 equals the address of r2

THEN

r1 is equivalent to r2



rmk 06.April.2004

The distance function

A distance function is used tocompare pieces of data (usually text)

Apply distance function to data that differ slightly

Select a threshold to capture obvioustypographical errors. Impacts number of successful matches and

number of false positives



rmk 06.April.2004

Examples of matched records

SSN Name (First, Initial, Last) Address

334600443

334600443

Lisa Boardman

Lisa Brown

144 Wars St.

144 Ward St.

525520001

525520001

Ramon Bonilla

Raymond Bonilla

38 Ward St.

38 Ward St.

0

0

Diana D. Ambrosion

Diana A. Dambrosion

40 Brik Church Av.

40 Brick Church Av.

789912345

879912345

Kathi Kason

Kathy Kason

48 North St.

48 North St.

879912345

879912345

Kathy Kason

Kathy Smith

48 North St.

48 North St.



rmk 06.April.2004

Building an equational theory

The process of creating a goodequational theory is similar to the

process of creating a good knowledge-base for an expert system

In complex problems, an experts

assistance is needed to write theequational theory



rmk 06.April.2004

Transitive Closure

In general, no single pass (i.e. no single key)will be sufficient to catch all matching records

An attribute that appears first in the key hashigher discriminating power than thoseappearing after them

If an employee has two records in a DB with SSN

193456782 and 913456782, its unlikely they willfall under the same window



rmk 06.April.2004

Transitive Closure

To increase the number of similarrecords merged

Widen the scanning window size, w

Execute several independent runs of theSNM

Use a different key each timeUse a relatively small window

Call this the Multi-Pass approach



rmk 06.April.2004

Transitive Closure

Each independent run of the Multi-Passapproach will produce a set of pairs of

records Although one field in a record may be inerror, another field may not

Transitive closure can be applied tothose pairs to be merged



rmk 06.April.2004

Multi-pass Matches

Pass 1 (Lastname discriminates)KSNKAT48NRTH789 (Kathi Kason 789912345 )

KSNKAT48NRTH879 (Kathy Kason 879912345 )Pass 2 (Firstname discriminates)KATKSN48NRTH789 (Kathi Kason 789912345 )

KATKSN48NRTH879 (Kathy Kason 879912345 )

Pass 3 (Address discriminates)48NRTH879KSNKAT (Kathy Kason 879912345 )

48NRTH879SMTKAT (Kathy Smith 879912345 )



rmk 06.April.2004

Transitive Equality Example

IF A implies B

AND B implies C

THEN A implies C

From example:789912345 Kathi Kason 48 North St. (A)

879912345 Kathy Kason 48 North St. (B)

879912345 Kathy Smith 48 North St. (C)



Test Results



rmk 06.April.2004

Test Environment

Test data was created by a databasegenerator

Names are randomly chosen from a list of 63000real names

The database generator provides a largenumber of parameters:

size of the DB, percentage of duplicates,

amount of error



rmk 06.April.2004

Correct Duplicate Detection



rmk 06.April.2004

Time for each run



rmk 06.April.2004

Accuracy for each run



rmk 06.April.2004

Real-World Test

Data was obtained from the Office of Children Administrative Research

(OCAR) of the Department of Social andHealth Services (State of Washington)

OCARs goals

How long do children stay in foster care? How many different homes do children

typically stay in?



rmk 06.April.2004

OCARs Database

Most of OCARs data is stored in onerelation

The DB contains 6,000,000 total records

The DB grows by about 50,000 recordsper month



rmk 06.April.2004

Typical Problems in the DB

Names are frequently misspelled

SSN or birthdays are either missing or clearly

wrongCase number often changes when the childsfamily moves to another part of the state

Some records use service provider namesinstead of the childs

No reliable unique identifier



rmk 06.April.2004

OCAR Equational Theory

Keys for the independent runs

Last Name, First Name, SSN, Case Number

First Name, Last Name, SSN, Case Number

Case Number, First Name, Last Name, SSN



rmk 06.April.2004

OCAR Results



Incremental Merge/Purge w/New Data



rmk 06.April.2004

Incremental Merge/Purge

Lists are concatenated for first timeprocessing

Concatenating new data before reapplyingthe merge/purge process may be veryexpensive in both time and space

An incremental merge/purge approach is

needed: Prime Representatives method



rmk 06.April.2004

Prime-Representative: Definition

A set of records extracted from eachcluster of records used to represent the

information in the clusterThe Cluster Centroid or base element of equivalence class



rmk 06.April.2004

Prime-Representative creation

Initially, no PR exists

After the execution of the first

merge/purge create clusters of similiarrecords

Correct selection of PR from clusterimpacts accuracy of results

No PR can be the best selection forsome clusters



rmk 06.April.2004

3 Strategies for Choosing PR

Random Sample Select a sample of records at random from

each clusterN-Latest Most recent elements entered in DB

Syntactic Choose the largest or more complete

record



rmk 06.April.2004

Important Assumption

No data previously used to select eachclusters PR will be deleted

Deleted records could require restructuringof clusters (expensive)

No changes in the rule-set will occurafter the first increment of data is

processed Substantial rule change could invalidate

clusters.



rmk 06.April.2004

Results

Cumulative running time for theIncremental Merge/Purge algorithm is

higher than the classic algorithmPR selection methodology couldimprove cumulative running time

Total running time of the IncrementalMerge/Purge algorithm is alwayssmaller



Conclusion



rmk 06.April.2004

Cleansing of Data

Sorted-Neighborhood Method is expensivedue to

the sorting phase the need for large windows for high accuracy

Multiple passes with small windows followedby transitive closure improves accuracy and

performance for level of accuracy increasing number of successful matches

decreasing number of false positives



Recap



rmk 06.April.2004

2 major reasons merging largedatabases becomes a difficult problem:

The databases are heterogeneous The identifiers or strings differ in how they

are represented within each DB



rmk 06.April.2004

The 3 steps in SNM are:

Creation of key(s)

Sorting records on this key

Merge/Purge records



rmk 06.April.2004

Prime representative - set of recordsfrom cluster considered to berepresentative of data contained in

cluster

3 strategies for selecting a PR:

Random Sample

N-Latest Syntactic



rmk 06.April.2004

Questions:

DataDirty-4

Documents

Transcript of DataDirty-4