DataDirty-4
-
Upload
mohanakrishna -
Category
Documents
-
view
217 -
download
0
Transcript of DataDirty-4
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 1/49
Real-World Data Is Dirty
Data Cleansing and theMerge/Purge ProblemHernandez & Stolfo: Columbia University - 1998
Class Presentation by Rhonda Kost, 06.April 2004
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 2/49
rmk 06.April.2004
TOPICS
Introduction
A Basic Data Cleansing Solution
Test & Real World Results
Incremental Merge Purge w/ New Data
ConclusionRecap
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 3/49
Introduction
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 4/49
rmk 06.April.2004
The problem:
Some corporations acquire large amounts of information every month
The data is stored in many large databases(DB)
These databases may be heterogeneous Variations in schema
The data may be represented differentlyacross the various datasets
Data in these DB may simply be inaccurate
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 5/49
rmk 06.April.2004
Requirement of the analysis
The data mining needs to be done
Quickly
Efficiently
Accurately
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 6/49
rmk 06.April.2004
Examples of real-world applications
Credit card companies
Assess risk of potential new customers
Find false identities
Match disparate records concerning acustomer
Mass Marketing companies Government agencies
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 7/49
A Basic Data Cleansing Solution
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 8/49
rmk 06.April.2004
Duplicate Elimination
Sorted-Neighborhood Method (SNM)
This is done in three phases
Create a Key for each record
Sort records on this key
Merge/Purge records
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 9/49
rmk 06.April.2004
SNM: Create key
Compute a key for each record byextracting relevant fields or portions of
fields
Example:
First Last Address ID Key
Sal Stolfo 123 First Street 45678987 STLSAL123FRST456
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 10/49
rmk 06.April.2004
SNM: Sort Data
Sort the records in the data list usingthe key in step 1
This can be very time consuming
O(NlogN) for a good algorithm,
O(N2) for a bad algorithm
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 11/49
rmk 06.April.2004
SNM: Merge records
Move a fixed sizewindow through the
sequential list of records.
This limits thecomparisons to the
records in thewindow
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 12/49
rmk 06.April.2004
SNM: Considerations
What is the optimal window size while
Maximizing accuracy
Minimizing computational cost
Execution time for large DB will bebound by
Disk I/O Number of passes over the data set
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 13/49
rmk 06.April.2004
Selection of Keys
The effectiveness of the SNM highlydepends on the key selected to sort the
records A key is defined to be a sequence of asubset of attributes
Keys must provide sufficient discriminating power
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 14/49
rmk 06.April.2004
Example of Records and Keys
First Last Address ID Key
Sal Stolfo 123 First Street 45678987 STLSAL123FRST456
Sal Stolfo 123 First Street 45678987 STLSAL123FRST456
Sal Stolpho 123 First Street 45678987 STLSAL123FRST456
Sal Stiles 123 Forest Street 45654321 STLSAL123FRST456
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 15/49
rmk 06.April.2004
Equational Theory
The comparison during the mergephase is an inferential process
Compares much more information thansimply the key
The more information there is, the
better inferences can be made
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 16/49
rmk 06.April.2004
Equational Theory - Example
Two names are spelled nearly identically andhave the same address
It may be inferred that they are the same personTwo social security numbers are the same but the names and addresses are totally different
Could be the same person who moved
Could be two different people and there is anerror in the social security number
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 17/49
rmk 06.April.2004
A simplified rule in English
Given two records, r1 and r2
IF the last name of r1 equals the last name of r2,
AND the first names differ slightly,
AND the address of r1 equals the address of r2
THEN
r1 is equivalent to r2
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 18/49
rmk 06.April.2004
The distance function
A distance function is used tocompare pieces of data (usually text)
Apply distance function to data that differ slightly
Select a threshold to capture obvioustypographical errors. Impacts number of successful matches and
number of false positives
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 19/49
rmk 06.April.2004
Examples of matched records
SSN Name (First, Initial, Last) Address
334600443
334600443
Lisa Boardman
Lisa Brown
144 Wars St.
144 Ward St.
525520001
525520001
Ramon Bonilla
Raymond Bonilla
38 Ward St.
38 Ward St.
0
0
Diana D. Ambrosion
Diana A. Dambrosion
40 Brik Church Av.
40 Brick Church Av.
789912345
879912345
Kathi Kason
Kathy Kason
48 North St.
48 North St.
879912345
879912345
Kathy Kason
Kathy Smith
48 North St.
48 North St.
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 20/49
rmk 06.April.2004
Building an equational theory
The process of creating a goodequational theory is similar to the
process of creating a good knowledge-base for an expert system
In complex problems, an experts
assistance is needed to write theequational theory
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 21/49
rmk 06.April.2004
Transitive Closure
In general, no single pass (i.e. no single key)will be sufficient to catch all matching records
An attribute that appears first in the key hashigher discriminating power than thoseappearing after them
If an employee has two records in a DB with SSN
193456782 and 913456782, its unlikely they willfall under the same window
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 22/49
rmk 06.April.2004
Transitive Closure
To increase the number of similarrecords merged
Widen the scanning window size, w
Execute several independent runs of theSNM
Use a different key each timeUse a relatively small window
Call this the Multi-Pass approach
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 23/49
rmk 06.April.2004
Transitive Closure
Each independent run of the Multi-Passapproach will produce a set of pairs of
records Although one field in a record may be inerror, another field may not
Transitive closure can be applied tothose pairs to be merged
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 24/49
rmk 06.April.2004
Multi-pass Matches
Pass 1 (Lastname discriminates)KSNKAT48NRTH789 (Kathi Kason 789912345 )
KSNKAT48NRTH879 (Kathy Kason 879912345 )Pass 2 (Firstname discriminates)KATKSN48NRTH789 (Kathi Kason 789912345 )
KATKSN48NRTH879 (Kathy Kason 879912345 )
Pass 3 (Address discriminates)48NRTH879KSNKAT (Kathy Kason 879912345 )
48NRTH879SMTKAT (Kathy Smith 879912345 )
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 25/49
rmk 06.April.2004
Transitive Equality Example
IF A implies B
AND B implies C
THEN A implies C
From example:789912345 Kathi Kason 48 North St. (A)
879912345 Kathy Kason 48 North St. (B)
879912345 Kathy Smith 48 North St. (C)
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 26/49
Test Results
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 27/49
rmk 06.April.2004
Test Environment
Test data was created by a databasegenerator
Names are randomly chosen from a list of 63000real names
The database generator provides a largenumber of parameters:
size of the DB, percentage of duplicates,
amount of error
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 28/49
rmk 06.April.2004
Correct Duplicate Detection
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 29/49
rmk 06.April.2004
Time for each run
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 30/49
rmk 06.April.2004
Accuracy for each run
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 31/49
rmk 06.April.2004
Real-World Test
Data was obtained from the Office of Children Administrative Research
(OCAR) of the Department of Social andHealth Services (State of Washington)
OCARs goals
How long do children stay in foster care? How many different homes do children
typically stay in?
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 32/49
rmk 06.April.2004
OCARs Database
Most of OCARs data is stored in onerelation
The DB contains 6,000,000 total records
The DB grows by about 50,000 recordsper month
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 33/49
rmk 06.April.2004
Typical Problems in the DB
Names are frequently misspelled
SSN or birthdays are either missing or clearly
wrongCase number often changes when the childsfamily moves to another part of the state
Some records use service provider namesinstead of the childs
No reliable unique identifier
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 34/49
rmk 06.April.2004
OCAR Equational Theory
Keys for the independent runs
Last Name, First Name, SSN, Case Number
First Name, Last Name, SSN, Case Number
Case Number, First Name, Last Name, SSN
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 35/49
rmk 06.April.2004
OCAR Results
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 36/49
Incremental Merge/Purge w/New Data
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 37/49
rmk 06.April.2004
Incremental Merge/Purge
Lists are concatenated for first timeprocessing
Concatenating new data before reapplyingthe merge/purge process may be veryexpensive in both time and space
An incremental merge/purge approach is
needed: Prime Representatives method
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 38/49
rmk 06.April.2004
Prime-Representative: Definition
A set of records extracted from eachcluster of records used to represent the
information in the clusterThe Cluster Centroid or base element of equivalence class
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 39/49
rmk 06.April.2004
Prime-Representative creation
Initially, no PR exists
After the execution of the first
merge/purge create clusters of similiarrecords
Correct selection of PR from clusterimpacts accuracy of results
No PR can be the best selection forsome clusters
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 40/49
rmk 06.April.2004
3 Strategies for Choosing PR
Random Sample Select a sample of records at random from
each clusterN-Latest Most recent elements entered in DB
Syntactic Choose the largest or more complete
record
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 41/49
rmk 06.April.2004
Important Assumption
No data previously used to select eachclusters PR will be deleted
Deleted records could require restructuringof clusters (expensive)
No changes in the rule-set will occurafter the first increment of data is
processed Substantial rule change could invalidate
clusters.
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 42/49
rmk 06.April.2004
Results
Cumulative running time for theIncremental Merge/Purge algorithm is
higher than the classic algorithmPR selection methodology couldimprove cumulative running time
Total running time of the IncrementalMerge/Purge algorithm is alwayssmaller
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 43/49
Conclusion
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 44/49
rmk 06.April.2004
Cleansing of Data
Sorted-Neighborhood Method is expensivedue to
the sorting phase the need for large windows for high accuracy
Multiple passes with small windows followedby transitive closure improves accuracy and
performance for level of accuracy increasing number of successful matches
decreasing number of false positives
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 45/49
Recap
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 46/49
rmk 06.April.2004
2 major reasons merging largedatabases becomes a difficult problem:
The databases are heterogeneous The identifiers or strings differ in how they
are represented within each DB
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 47/49
rmk 06.April.2004
The 3 steps in SNM are:
Creation of key(s)
Sorting records on this key
Merge/Purge records
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 48/49
rmk 06.April.2004
Prime representative - set of recordsfrom cluster considered to berepresentative of data contained in
cluster
3 strategies for selecting a PR:
Random Sample
N-Latest Syntactic
8/7/2019 DataDirty-4
http://slidepdf.com/reader/full/datadirty-4 49/49
rmk 06.April.2004
Questions: