Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

19
Data Quality Class 10

Transcript of Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Page 1: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Data Quality

Class 10

Page 2: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Agenda

• Review of Last week

• Cleansing Applications

• Guest Speaker

Page 3: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Review of Similarity

• We use measures of similarity to establish measures and thresholds for linkage

• Edit Distance• Phonetic Similarity• Ngrams

Page 4: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Linkage

• Critical component of data quality applications• Linkage involves finding a link between a pair of

records, either through an exact match or through an approximate match

• Linkage is useful for – Deduplification– Merge/Purge– Enhancement– Householding

Page 5: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Linkage 2

• Two records are linked when they match with enough weighted similarity

• Matching can be a combination of exact matching on particular fields, to approximat ematching with some degree of similarity

• For example: two customer records can be linked via an account number, if account numbers are uniquely assigned to customers

Page 6: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Thresholding

• We have functions to determine similarity

• Tune these functions to return a value between 0 and 1 (i.e., 0% and 100%)

• 0 = absolutely no match

• 1 = absolute match

Page 7: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Threhsolding 2

• For any pair of values, we can apply the similarity function and get a score

• For different kinds of data, we can set a minimum threshold, above which the two values are said to match

• For example, for an n-gram match, we can set the threshold at 75%

Page 8: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Thresholding 3

• When comparing 2 records, we apply the similairty functions in a pairwise manner to each of the important attribute value pairs

• We assign some weight to each attribute value pair score

• Total similarity score for each pair of records is the sum of individual attribute value pairs scores, adjusted by weight

• We can assign an overall threshold indicating a match

Page 9: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Thresholds and Matches

• We actually assign two thresholds, to partition scores into three groups:– Definite matches– User review– Definite no-matches

• Scores above the high threshold are definite matches

• Scores between low and high thresholds are user review matches

• All others are not matches

Page 10: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Deduplification

• Duplicates are sets of records that represent the same entity

• Duplicate elimination is the process of finding records that belong to the same equivalence class based on similarity above a specified threshold

• When duplicates are found, one record is created to replace all the suplicates

• That record is composed of the “best” data gleaned from all equivalence class members

Page 11: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Merge/Purge

• Similar to duplicate elimination• Application used when merging two or

more different database• Goal: find all the records associated with

the same entity• Example: when two banks merge, find all

accounts owned by the same person in both banks

Page 12: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Enhancement

• We can enhance data by merging it with other data sets

• Linkage may be based on profile information extracted from each record

Page 13: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Householding

• We link on address as well as some permutation of the entity name

• Look to establish a location match and some relation between entities at that location

Page 14: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Naive Algorithm

• Goal: find all possible matches between any pair of records

• Approach: Perform a pairwise similarity score for all record pairs

• Downside: O(n2)

Page 15: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Improvements

• Desire to reduce the number of candidates for pairwise similarity testing

• We can use fast matching for fixed pivot values when merging data sets

• We use a concept called blocking to reduce the search set

Page 16: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Fast Matching for Merging

• We have looked at Bloom filters for fast matching

• We can load all (recordID,attribute value) tuples from the first data set into the Bloom filter O(n)

• We can then test each of the values from the second set to see if the pivot matches in the first set

Page 17: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Blocking

• Goal: reduce number of match candidates by using some form of “compression” on the records to be linked

• Example: phonetic encodings• Example: limit by fixing one of the

attributes• Example: find a pivot attribute and use that

for affinity

Page 18: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Blocking 2

• Example: if we want to perform householding on a mailing list:– Block by ZIP code, since we don’t expect to

find members of the same household living in different locations

Page 19: Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.

Linkage Algorithms

• All linkage algorithms make use of these ideas:– A blocking mechanism

• We must choose based on the data available and what makes sense for the application

– Similarity functions• Every data type and data domain should have an associated

similarity function, even if it is a 0/1 exact match test

– Weights for the similarity functions• This requires more insight into the problem, to see how each

attribute’s scores should weigh for the overall score