1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC...

1

Searching and Integrating Information on the Web

Seminar 3: Data Cleansing

Professor Chen Li

UC Irvine

Seminar 3 2

Paper readings

• Efficient merge and purge: Hernandez and Stolfo, SIGMOD 1995

• Approximate String Joins in a Database (Almost) for Free, Gravano et al, VLDB 2001,

• Efficient Record Linkage in Large Data Sets, Liang Jin, Chen Li, Sharad Mehrotra, DASFAA, 2003

• Sunita Sarawagi Anuradha Bhamidipaty, Interactive Deduplication Using Active Learning. Sarawagi and Bhamidipaty, KDD 2003

Seminar 3 3

Motivation• Correlate data from different data

sources (e.g., data integration)– Data is often dirty– Needs to be cleansed before being used

• Example:– A hospital needs to merge patient

records from different data sources– They have different formats, typos, and

abbreviations

Seminar 3 4

Example

Name SSN Addr

Jack Lemmon

430-871-8294 Maple St

Harrison Ford

292-918-2913 Culver Blvd

Tom Hanks 234-762-1234 Main St

… … …

Table R

Name SSN Addr

Ton Hanks 234-162-1234 Main Street

Kevin Spacey

928-184-2813 Frost Blvd

Jack Lemon 430-817-8294 Maple Street

… … …

Table S

• Find records from different datasets that could be the same entity

Seminar 3 5

Another Example• P. Bernstein, D. Chiu: Using Semi-Joins

to Solve Relational Queries. JACM 28(1): 25-40(1981)

• Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

Seminar 3 6

Record linkage

Problem statement: “Given two relations, identify the

potentially matched records– Efficiently and– Effectively”

Seminar 3 7

Challenges • How to define good similarity functions?

– Many functions proposed (edit distance, cosine similarity, …)

– Domain knowledge is critical Names: “Wall Street Journal” and “LA Times” Address: “Main Street” versus “Main St”

• How to do matching efficiently– Offline join version– Online (interactive) search

Nearest search Range search

Seminar 3 8

Outline

• Supporting string-similarity joins using RDBMS

• Using mapping techniques• Interactive deduplication

Seminar 3 9

Edit Distance• A widely used metric to define

string similarity• Ed(s1,s2)= minimum # of

operations (insertion, deletion, substitution) to change s1 to s2

• Example:s1: Tom Hankss2: Ton Hanked(s1,s2) = 2

Seminar 3 10

Approximate String Joins

Service A

Jenny Stamatopoulou

John Paul McDougal

Aldridge Rodriguez

Panos Ipeirotis

John Smith

…

…

Service B

Panos Ipirotis

Jonh Smith

…

Jenny Stamatopulou

John P. McDougal

…

Al Dridge Rodriguez

• We want to join tuples with “similar” string fields• Similarity measure: Edit Distance • Each Insertion, Deletion, Replacement increases

distance by one

K=1

K=2

K=1

K=3

K=1

Seminar 3 11

Focus: Approximate String Joins over Relational DBMSs

• Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K

• Solve the problem in a database-friendly way(if possible with an existing "vanilla" RDBMS)

Seminar 3 12

Current Approaches for Processing Approximate String Joins

No native support for approximate joins in RDBMSs

Two existing (straightforward) solutions:• Join data outside of DBMS• Join data via user-defined functions (UDFs)

inside the DBMS

Seminar 3 13

Approximate String Joins outside of a DBMS

1. Export data2. Join outside of DBMS3. Import the resultMain advantage:

We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality

Disadvantages: • Substantial amounts of data to be

exported/imported• Cannot be easily integrated with further processing

steps in the DBMS

Seminar 3 14

Approximate String Joins with UDFs1. Write a UDF to check if two strings match within distance K2. Write an SQL statement that applies the UDF to the string

pairs

SELECT R.stringAttr, S.stringAttrFROM R, SWHERE edit_distance(R.stringAttr, S.stringAttr, K)

Main advantage: Ease of implementation

Main disadvantage: UDF applied to entire cross-product of relations

Seminar 3 15

Our Approach: Approximate String Joins over an Unmodified RDBMS

1. Preprocess data and generate auxiliary tables2. Perform join exploiting standard RDBMS

capabilities

Advantages• No modification of underlying RDBMS needed.• Can leverage the RDBMS query optimizer.• Much more efficient than the approach based on

naive UDFs

Seminar 3 16

Intuition and Roadmap• Intuition:

– Similar strings have many common substrings– Use exact joins to perform approximate joins (current

DBMSs are good for exact joins)– A good candidate set can be verified for false positives[Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977]

• Roadmap:– Break strings into substrings of length q (q-grams)– Perform an exact join on the q-grams– Find candidate string pairs based on the results– Check only candidate pairs with a UDF to obtain final

answer

Seminar 3 17

What is a “Q-gram”?

• Q-gram: A sequence of q characters of the original string

Example for q=3vacations

{##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$}

String with length L → L + q - 1 q-grams

• Similar strings have a many common q-grams

Seminar 3 18

Q-grams and Edit Distance Operations• With no edits: L + q - 1 common q-grams

• Replacement: (L + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}

• Insertion: (Lmax + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$

$}

• Deletion: (Lmax + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}

Seminar 3 19

Number of Common Q-grams and Edit Distance

• For Edit Distance = K, there could be at most K replacements, insertions, deletions

• Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common

• Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)

Seminar 3 20

Using a DBMS for Q-gram Joins• If we have the q-grams in the DBMS, we

can perform this counting efficiently.

• Create auxiliary tables with tuples of the form:

<sid, strlen, qgram>and join these tables

• A GROUP BY – HAVING COUNT clause can perform the counting / filtering

Seminar 3 21

Eliminating Candidate Pairs: COUNT FILTERING

SQL for this filter: (parts omitted for clarity)

SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgramGROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1)

– K*q

The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.

Seminar 3 22

Eliminating Candidate Pairs Further: LENGTH FILTERING

Strings with length difference larger than K cannot be within Edit Distance K

SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=KGROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1)

– K*q

We refer to this filter as LENGTH FILTERING

Seminar 3 23

Exploiting Q-gram Positions for Filtering

• Consider strings aabbzzaacczz and aacczzaabbzz

• Strings are at edit distance 4• Strings have identical q-grams for q=3

Problem: Matching q-grams that are at different positions in both strings– Either q-grams do not "originate" from same q-gram,

or– Too many edit operations "caused" spurious q-grams

at various parts of strings to match

Seminar 3 24

POSITION FILTERING - Filtering using positions

• Keep the position of the q-grams <sid, strlen, pos, qgram>

• Do not match q-grams that are more than K positions away

SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgram

AND abs(R.strlen - S.strlen)<=KAND abs(R.pos - S.pos)<=K

GROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as POSITION FILTERING

Seminar 3 25

The Actual, Complete SQL StatementSELECT R1.string, S1.string, R1.sid, S1.sid

FROM R1, S1, R, S,

WHERE R1.sid=R.sid

AND S1.sid=S.sid

AND R.qgram=S.qgram

AND abs(strlen(R1.string)–strlen(S1.string))<=K

AND abs(R.pos - S.pos)<=K

GROUP BY R1.sid, S1.sid, R1.string, S1.string

HAVING COUNT(*) >=

(max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q

Seminar 3 26

Summary of 1st paper• Introduced a technique for mapping approximate

string joins into a “vanilla” SQL expression

• Our technique does not require modifying the underlying RDBMS

Seminar 3 27

Outline



Seminar 3 28

Single-attribute Case• Given

– two sets of strings, R and S– a similarity function f between strings (metric space)

Reflexive: f(s1,s2) = 0 iff s1=s2 Symmetric: f(s1,s2) = d(s2, s1) Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3)

– a threshold k • Find: all pairs of strings (r, s) from R and S, such

that f(r,s) <= k.R S

Seminar 3 29

Nested-loop?

• Not desirable for large data sets• 5 hours for 30K strings!

Seminar 3 30

Our 2-step approach• Step 1: map strings (in a metric

space) to objects in a Euclidean space

• Step 2: do a similarity join in the Euclidean space

Seminar 3 31

Advantages

• Applicable to many metric similarity functions– Use edit distance as an example– Other similarity functions also tried, e.g.,

q-gram-based similarity

• Open to existing algorithms– Mapping techniques– Join techniques

Seminar 3 32

Step 1Map strings into a high-dimensional Euclidean space

Metric Space Euclidean Space

Seminar 3 33

Mapping: StringMap• Input: A list of strings• Output: Points in a high-dimensional Euclidean

space that preserve the original distances well• A variation of FastMap

– Each step greedily picks two strings (pivots) to form an axis

– All axes are orthogonal

ba

ibbaiai d

dddx

,

2,

2,

2,

2

Seminar 3 34

Can it preserve distances?• Data Sources:

– IMDB star names: 54,000– German names: 132,000

• Distribution of string lengths:

Seminar 3 35

• Use data set 1 (54K names) as an example• k=2, d=20

– Use k’=5.2 to differentiate similar and dissimilar pairs.

Can it preserve distances?

Seminar 3 36

Choose Dimensionality d

Increase d?• Good :

– better to differentiate similar pairs from dissimilar ones.

• Bad :

– Step 1: Efficiency ↓– Step 2: “curse of dimensionality”

Seminar 3 37

Choose dimensionality d using sampling

• Sample 1Kx1K strings, find their similar pairs (within distance k)

• Calculate maximum of their new distances w

• Define “Cost” of finding a similar pair:

# of similar pairs

# of pairs within distance wCost=

Seminar 3 38

Choose Dimensionality d

d=15 ~ 25

Seminar 3 39

Choose new threshold k’ • Closely related to the mapping property• Ideally, if ed(r,s) <= k, the Euclidean

distance between two corresponding points <= k’.

• Choose k’ using sampling– Sample 1Kx1K strings, find similar pairs– Calculate their maximum new distance as

k’– repeat multiple times, choose their

maximum

Seminar 3 40

New threshold k’ in step 2

d=20

Seminar 3 41

Step 2: Similarity Join• Input: Two sets of points in Euclidean space.• Output: Pairs of two points whose distance is

less than new threshold k’.• Many join algorithms can be used

Seminar 3 42

Example• Adopted an algorithm by Hjaltason and Samet.

– Building two R-Trees.– Traverse two trees, find points whose distance is

within k’.– Pruning during traversal (e.g., using MinDist).

Seminar 3 43

Final processing• Among the pairs produced from

the similarity-join step, check their edit distance.

• Return those pairs satisfying the threshold k

Seminar 3 44

Running time

Seminar 3 45

Recall• Recall: (#of found similar pairs)/(#of all similar

pairs)

Seminar 3 46

Multi-attribute linkage• Example: title + name + year• Different attributes have different

similarity functions and thresholds• Consider merge rules in disjunctive

format:

Seminar 3 47

Evaluation strategies• Many ways to evaluate rules• Finding an optimal one: NP-hard• Heuristics:

– Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct.

– Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

Seminar 3 48

Summary of 2nd paper

• A novel two-step approach to record linkage.

• Many existing mapping and join algorithms can be adopted

• Applicable to many distance metrics. • Time and space efficient.• Multi-attribute case studied

Seminar 3 49

Outline



Seminar 3 50

Problems with Existing Deduplication Methods

Matching Functions• Calculate similarity

scores, thresholds• Tedious coding

Learning-based Methods• Require large-sized

training set for accuracy(static training set)

• Difficult to provide a covering and challenging training set that will bring out the subtlety of deduplication function

ALIAS: Active Learning led Interactive Alias Suppression

Seminar 351

New Approach • Relegate the task of finding the deduplication

function to a machine learning algorithm• Design goal:

Less training instances Interactive response Fast convergence High accuracy

Design an interactive deduplication system called ALIAS

Seminar 3 52

ALIAS SYSTEM• A learned-based method• Exploit existing similarity functions• Use Active Learning

- an active learner actively picks unlabeled instances with the most information gain in the training set

• Produce a deduplication function that can identify duplicates

Seminar 3 53Predicate for uncertain region

Overall Architecture

Similarity Functions

Mapper

Mapper

Training data

Trainclassifiers

Select ninstances

for labeling

SimilarityIndices

L

D

F

Learner

Feedback From userLp

Dp

T

S

Seminar 3 54

Primary Inputs for ALIAS• A set of initial training pairs (L)

– less than 10 labeled records– arranged in pairs of duplicates and non-

duplicates

• A set of similarity functions (F)– Ex: word-match, qgram-match…– To compute similarities scores between 2 records

based on any subset of attributes. – Learner will find the right way of combining those

scores to get the final deduplication function

• A database of unlabeled records(D)• Number of classifiers (<5)


L

F

D

Seminar 3 55

Mapped Labeled Instances


Mapper

L

F

LpTake r1, r2 from input L Record r1(a1, a2, a3) Record r2(a1, a2, a3) Use similarity functions

f1, f2…fn to compute similarity scores s between r1 and r2

New record r1&r2 (s1, s2…sn, y/n) y = duplicate; n = non-duplicate

Put new record in LpD

Seminar 3 56

Mapped Unlabeled Instances


Mapper

L

F

LpTake r1, r2 from D x D Record r1(a1, a2, a3) Record r2(a1, a2, a3) Use similarity functions

f1, f2…fn to compute similarity scores between r1 and r2

New record r1&r2 (s1, s2…sn) No y/n field

Put new record in DpD

Mapper

Dp

Seminar 3 57

Active Learner


Mapper

Mapper

Training data

TrainClassifiers

Select setS of n

instancesfor labeling

L

D

F

Learner

Feedback From userLp

Dp

T

s

Seminar 3 58

ALIAS Algorithm• 1. Input: L, D, F.• 2. Create pairs Lp from the labeled data L and F.• 3. Create pairs Dp from the unlabeled data D and F.• 4. Initial training set T = Lp• 5. Loop until user satisfaction

– Train classier C using T.– Use C to select a set S of n instances from Dp for

labeling.– If S is empty, exit loop.– Collect user feedback on the labels of S.– Add S to T and remove S from Dp.

• 6. Output classifier C

Seminar 3 59

The Indexing Component

• Purpose: – Avoid mapping all pairs of records in D x D

• 3 Methods:– Grouping– Sampling– Indexing

Seminar 3 60

The Indexing Component• Grouping

– Example: group records in D according to the field year of publication

– Mapped pairs are formed only within records of a group.

• Sampling– Sample in units of a group instead of

individual records.

Seminar 3 61

The Indexing Component • Indexing

– A similarity function:“fraction of common words between two text attributes >=0.4”

– we can create an index on the words of the text attributes

Seminar 3 62

The learning component• Contain a number of

classifiers• A classifier: a machine

learning algorithm such as decision tree (D-tree), naïve Bayes (NB), Support Vector Machine (SVM)…to classify instances

• A classifier is trained using a training data set

Seminar 3 63

Criteria for a classifier

• Accuracy• Interpretability• Indexability• Efficient training

Seminar 3 64

Accuracy of a Classifier

• Measured by the mean F of recall r and precision pr = fraction of duplicates correctly classifiedp = fraction correct amongst all instances

actually libeled duplicate

1 2

1 10.5

prF

p r

p r

Seminar 3 65

Accuracy (cont.) Example

A case with 1% duplicates; a classifier labels all pairs as non-duplicates

recall r = 0 mean F = 0 accuracy = 0% A case with 1% duplicates; a classifier identifies all

duplicates correctly but misclassifies 1% non-duplicates

recall r = 1, p = 0.5 F = 0.667 accuracy = 66.7%

If we don’t use r and p, then they both have 99% accuracy!

Seminar 3 66

Criteria for a classifier (cont.)

• Interpretability– Final deduplication

rule is easy to understand and interpret

• Indexability– Final deduplication

rule has indexable predicates

• Efficient training– Fast to train

Seminar 3 67

Comparison of Different Classifiers

Seminar 3 68

Active Learning

The goal is to seek out from the unlabeled pool the instances which when labeled will help strengthen the classifier at the fastest possible rate.

Seminar 3 69

A simple Example

Assume all are unlabeled except a and b

Suppose r-coordinate = 0, b-coordinate = 1

Any unlabeled point x to the left of r and to the right of b will have no effect in reducing the region of uncertainty

By including m in the training set, the size of the uncertain region will reduce by half

Most uncertain instance

Seminar 3 70

How to select an unlabeled instance

• Uncertainty– The instance about which the learner was most unsure

was also the instance for which the expected reduction in confusion was the largest

– Uncertainty score the disagreement among the predictions the instance get from a

committee of N classifiers

A sure duplicate or non-duplicate would get the same prediction from all members

• Representativeness– An uncertain instance representing a larger number of

unlabeled instances has greater impact to the classifier

Seminar 3 71

Example• 3 similarity functions: word match f1,

qgram match f2, string edit distance f3.• Take from Mapped Unlabeled Instances

(Dp)– r1&r2 (s1, s2…sn)– r3&r4 (s1, s2…sn)– s1, s2…sn: scores using functions f1, f2, f3

• 3 classifiers: D-tree, Naïve Bayes, SVM

–D-tree Naïve Bayes SVM

r1&r2 duplicate Non-duplicate

duplicate

r3&r4 duplicate duplicate duplicate

selected

Seminar 3 72

How to Combine Uncertainty and Representativeness

– Proposed two approaches– 1st approach: weighted sum

Cluster the unlabeled instances Estimate the density of points around it The instances are scored using a weighted

sum of its density and uncertainty value n highest scoring instances selected

– 2nd approach: sampling

Seminar 3 73

Conclusion

ALIAS:• Makes deduplication

much easier (less training instances)

• Provide interaction response to the user

• High accuracy

1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC...

Documents

Transcript of 1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC...