Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
-
Upload
rosalyn-hampton -
Category
Documents
-
view
216 -
download
1
Transcript of Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
Data Quality
Class 10
Agenda
• Review of Last week
• Cleansing Applications
• Guest Speaker
Review of Similarity
• We use measures of similarity to establish measures and thresholds for linkage
• Edit Distance• Phonetic Similarity• Ngrams
Linkage
• Critical component of data quality applications• Linkage involves finding a link between a pair of
records, either through an exact match or through an approximate match
• Linkage is useful for – Deduplification– Merge/Purge– Enhancement– Householding
Linkage 2
• Two records are linked when they match with enough weighted similarity
• Matching can be a combination of exact matching on particular fields, to approximat ematching with some degree of similarity
• For example: two customer records can be linked via an account number, if account numbers are uniquely assigned to customers
Thresholding
• We have functions to determine similarity
• Tune these functions to return a value between 0 and 1 (i.e., 0% and 100%)
• 0 = absolutely no match
• 1 = absolute match
Threhsolding 2
• For any pair of values, we can apply the similarity function and get a score
• For different kinds of data, we can set a minimum threshold, above which the two values are said to match
• For example, for an n-gram match, we can set the threshold at 75%
Thresholding 3
• When comparing 2 records, we apply the similairty functions in a pairwise manner to each of the important attribute value pairs
• We assign some weight to each attribute value pair score
• Total similarity score for each pair of records is the sum of individual attribute value pairs scores, adjusted by weight
• We can assign an overall threshold indicating a match
Thresholds and Matches
• We actually assign two thresholds, to partition scores into three groups:– Definite matches– User review– Definite no-matches
• Scores above the high threshold are definite matches
• Scores between low and high thresholds are user review matches
• All others are not matches
Deduplification
• Duplicates are sets of records that represent the same entity
• Duplicate elimination is the process of finding records that belong to the same equivalence class based on similarity above a specified threshold
• When duplicates are found, one record is created to replace all the suplicates
• That record is composed of the “best” data gleaned from all equivalence class members
Merge/Purge
• Similar to duplicate elimination• Application used when merging two or
more different database• Goal: find all the records associated with
the same entity• Example: when two banks merge, find all
accounts owned by the same person in both banks
Enhancement
• We can enhance data by merging it with other data sets
• Linkage may be based on profile information extracted from each record
Householding
• We link on address as well as some permutation of the entity name
• Look to establish a location match and some relation between entities at that location
Naive Algorithm
• Goal: find all possible matches between any pair of records
• Approach: Perform a pairwise similarity score for all record pairs
• Downside: O(n2)
Improvements
• Desire to reduce the number of candidates for pairwise similarity testing
• We can use fast matching for fixed pivot values when merging data sets
• We use a concept called blocking to reduce the search set
Fast Matching for Merging
• We have looked at Bloom filters for fast matching
• We can load all (recordID,attribute value) tuples from the first data set into the Bloom filter O(n)
• We can then test each of the values from the second set to see if the pivot matches in the first set
Blocking
• Goal: reduce number of match candidates by using some form of “compression” on the records to be linked
• Example: phonetic encodings• Example: limit by fixing one of the
attributes• Example: find a pivot attribute and use that
for affinity
Blocking 2
• Example: if we want to perform householding on a mailing list:– Block by ZIP code, since we don’t expect to
find members of the same household living in different locations
Linkage Algorithms
• All linkage algorithms make use of these ideas:– A blocking mechanism
• We must choose based on the data available and what makes sense for the application
– Similarity functions• Every data type and data domain should have an associated
similarity function, even if it is a 0/1 exact match test
– Weights for the similarity functions• This requires more insight into the problem, to see how each
attribute’s scores should weigh for the overall score