DataLinkage1 - Compatibility Modevizclass/classes/infsci2711/Slides/DataLinkage1.pdf · Similarity...

1

Data Linkage

Vladimir Zadorozhny, DINS, University of Pittsburgh

Advanced Topics in Database Management (INFSCI 2711)

Some materials are from Similarity Joins in Relational Database Systems,

Augsten and Bohlenand

Semantic Web for Working Ontologists,Allemang and Hendler

Problem with matching related records

Equi-join on name attribute:

SELECT * FROM A,BWHERE A.name = B.name

Problem: exact match does not workSolution: approximate match based on a “distance” measure between strings:

SELECT * FROM A,BWHERE distance(A.name, B.name) <= k ???????

A B

2

Edit-Based Distance ! Minimum number of edit operations that transform one string into another.

" Edit operations: (1) insertion of a character, (2) deletion of a character, (3) replacing a character in the string by another character.

! Instead of counting the edit operations, a cost can be assigned to each operation." we introduce the empty character e" cost of transforming character a to character b: c(a,b)" cost for inserting b: c(e,b)" cost for deleting a: c(a,e)" cost for replacing a by b: c(a, b) " Unit cost model: all costs c(a,b), a ¹ b are one.

! The string edit distance, sed(s1, s2), between two strings s1 and s2 is the minimum cost sequence of edit operations that transforms s1 to s2.

! Example: sed(banana, ananas) = 2" remove the first character b from banana è anana" insert a new character s at the end: anana è ananas

Similarity Join based on Edit Distance

Edit distance join on name attribute:

SELECT * FROM A,BWHERE sed(A.name, B.name) <= k

Result of edit distance join for k = 3

A B

3

Performance! In a similarity join all pairs of records must be considered: it is essential that the

query predicate is evaluated efficiently. ! Often there is a preferable distance measure from the application point of view; this

distance is expensive to compute. ! Algorithm 1: a nested- loop similarity join which returns all pairs of records (s1,s2)

from the sets (tables) X and Y such that distance(s1,s2) £ k. " Algorithm 1: nestedLoopNaive(X,Y)" foreach s1 Î X do" foreach s1 Î Y do" if distance(s1,s2) £ k" output (s1,s2);

! The number of calls to the expensive distance function is |X | ´ |Y|.! In many applications the result size of the similarity join is much smaller than the

cross product.

Similarity Join based on Edit DistanceA B

Number of all record pairs, for which the similarity is computed: |A|´|B| = 16

Result size of the similarity join is much smaller than the cross productResult size of edit distance join (for k = 3) = 4

Many of the record pairs, for which the similarity is computed, are very different from each other. è filters come into play.

4

FiltersFilters preprocess the input sets and produce a set of candidate pairs, which is a subset of the cross product, C Í X ´Y . The distance function is then evaluated only on the candidate pairs. If the filter condition can be evaluated faster than the distance function, the overall join will be faster.

X = {1, 6, 11}Y = {2, 5, 20}X ´ Y = { (1,2), (1,5), (1,20), (6,2), (6,5), (6,20), (11,2), (11,5), (11,20) }

Check: 2*x + 2*y < 10 (“distance” function)

Filters :

x > 10 è 2*x + 2*y >10, C1 = X ´ Y - { (11,2), (11,5), (11,20) }y > 10 è 2*x + 2*y >10 C2 = X ´ Y - { (1,20), (6,20), (11,20) } x + y > 10 è 2*x + 2*y >10 C3 = X ´ Y - { (1,20), (6,5), (6,20), (11,2), (11,5), (11,20) }

Error Types of Filters

Filters place a pair (s1,s2) Î X ´Y into the candidate set based on a fast guess. There are four possibilities for this guess:

• On true positives and true negatives the filter guess is correct. • Candidates that do not qualify for the result set are false positives. Since the

actual distance is computed on all pairs in the candidate set, false positives are removed.

• False positives increase the runtime, but do not affect the correctness. • A false negative is a pair that should be in the result set, but does not make it into

the candidate set and is thus missed. • Ideally, a filter produces no false negatives (to guarantee the correctness of the

result) and few false positives (to increase the efficiency).

Result Pair

Candidate Pair

5

Example Filters preprocess the input sets and produce a set of candidate pairs, which is a subset of the cross product, C Í X ´Y .

X = {1, 6, 11}Y = {2, 5, 20}X ´ Y = { (1,2), (1,5), (1,20), (6,2), (6,5), (6,20), (11,2), (11,5), (11,20) }


Filters :

x > 10 è 2*x + 2*y >10, C1 = X ´ Y - { (11,2), (11,5), (11,20) } true negative

= { (1,2), (1,5), (1,20), (6,2), (6,5), (6,20) } true positive, false positive

y > 10 è 2*x + 2*y >10 C2 = X ´ Y - { (1,20), (6,20), (11,20) }

= { (1,2), (1,5), (6,2), (6,5), (11,2), (11,5), (11,20) }

x + y > 10 è 2*x + 2*y >10 C3 = X ´ Y - { (1,20), (6,5), (6,20), (11,2), (11,5), (11,20) }

= { (1,2), (1,5), (6,2) }

Lower and upper bounds! a lower bound function produces a value that is within the distance value for

any pair of input records. ! an upper bound function produces a value that is equal or larger than the

distance value. ! More formally, for any pair (s1,s2) of records lowerBound(s1,s2) is a lower

bound and upperBound(s1,s2) is an upper bound for the function distance iff" distance(s1,s2) ³ lowerBound(s1,s2)" distance(s1,s2) £ upperBound(s1,s2)

6

ExampleFilters preprocess the input sets and produce a set of candidate pairs, which is a subset of the cross product, C Í X ´Y .

X = {1, 6, 11}Y = {2, 5, 20}X ´ Y = { (1,2), (1,5), (1,20), (6,2), (6,5), (6,20), (11,2), (11,5), (11,20) }


Filters :Lower Bounds: 2*x + 2*y ≥ x, 2*x + 2*y ≥ y, 2*x + 2*y ≥ x+y

x > 10 è 2*x + 2*y >10, C1 = X ´ Y - { (11,2), (11,5), (11,20) } true negative y > 10 è 2*x + 2*y >10 C2 = X ´ Y - { (1,20), (6,20), (11,20) } x + y > 10 è 2*x + 2*y >10 C3 = X ´ Y - { (1,20), (6,5), (6,20), (11,2), (11,5), (11,20) }

Upper Bounds: 2*x + 2*y ≤ 4*max(x,y)4*max(x,y) < 10 è 2*x + 2*y < 10, C1 = X ´ Y - { (1,2) } true positive

Similarity Join with bounds! Useful properties of lower and upper bounds:

" upperBound(s1,s2) £ k è distance(s1,s2) £ k è (s1,s2) is in result set (true positive)

" lowerBound(s1,s2) > k è distance(s1,s2) > k è (s1,s2) is not in result set (true negative)

! These properties are used to rephrase the nested loop similarity join. " Algorithm 2: nestedLoopWithBounds(X,Y)" foreach s1 Î X do" foreach s1 Î Y do" if upperBound(s1,s2) £ k" output (s1,s2);" else if lowerBound(s1,s2) > k" /* nothing to do */" else if distance(s1,s2) £ k" output (s1,s2);

• Before the distance(s1,s2) is computed the lower and upper bound are evaluated.

• If upperBound(s1,s2) £ k è (s1,s2) is in result set (true positive)

• If lowerBound(s1,s2) > k è (s1,s2) is not in result set (true negative)

• All other pairs are candidates and the distance function must be called to remove false positives. There are nofalse negatives.

7

Developing bounds for distance functions

! Only bounding functions that are evaluated faster than the distance function are useful.

! Then the similarity join with upper and lower bounds is typically much faster than the naive nested loop join.

! The use of lower and upper bounds in the similarity join is very appealing since the join can be computed faster without sacrificing the correctness of the result.

! Upper and lower bounds have been developed for the edit distance between strings

Length Filter

! The length of two strings s1 and s2 that are at edit distance k = sed(s1,s2) cannot differ by more than k, since k insertions or deletions are required to get strings of the same length.

" sed(s1,s2) ³ abs(|s1|-|s2|) lowerBound(s1,s2)! The length filter can be evaluated in constant time. Its effectiveness

depends on the length distribution of the strings in the dataset. ! The filter is effective only for strings with different lengths. For datasets,

where many string have similar lengths, many false positives will be produced for the candidate set.

8

Using Length FilterA B

SELECT * FROM A,B WHERE ABS(LENGTH(A.name)-LENGTH(B.name)) <= k (lowerBound > k è true negative)

AND sed(A.name, B.name) <= k

Number of all record pairs, for which the similarity is computed (for k = 3) : 12 < |A|´|B| = 16

Te length filter will prune the pairs (J. R. R. Tolkien, C. S. Lewis), (Frodo Baggins, John R. R. Tolkien), (C.S. Lewis, John R. R. Tolkien), (Bilbo Baggins, John R. R. Tolkien).

13

16

10

13

18

11

13

13

Problems with Edit Distance

! The edit distance counts the differences between strings, but does not take into account the string length. This can lead to undesired effects.

" sed(International Business Machines Corporation, International Bussiness Machine_Corporation) = 2

" sed(IBM,BMW) = 2" sed(Int. Business Machines Corp.,

International Business Machines Corporation) = 17! For above string pairs we cannot find a single distance threshold that

distinguishes matches from non-matches. ! The value of the edit distance ranges between zero and the length of the

longer string. ! Overall, the non-normalized edit distance is useful only for very small

thresholds (e.g., when we allow a single typo) or when all involved strings are of similar length.

9

Normalized Edit Distance

! We normalize the distance between two strings s1 and s2 (with length |s1| and |s2|, respectively) to values between zero and one:" sed_norm(s1,s2) = sed(s1,s2) / max(|s1|,|s2|)

! This distance is called the normalized string edit distance. Intuitively, it is the percentage of characters that need to be changed to turn one string into the other.

! The normalized edit distance is zero if and only if the two strings are identical; the maximum value for very different strings is 1.

" sed_norm(International Business Machines Corporation, International Bussiness Machine_Corporation) = 0.047

" sed_norm(IBM,BMW) = 0.66" sed_norm(Int. Business Machines Corp.,

International Business Machines Corporation) = 0.4

Need for other Distance Measures

! In some application scenarios even normalization does not help. ! For example, in the street matching scenario, the streets named Trienterstr

and Triesterstr differ by only one character but are different streets. ! Another limitation of the edit distance are word transpositions.

" For example, the strings James Wood and Wood James refer to the same person, but are at distance 10, which is the maximum for two strings of length 10. Normalization does not help here: the normalized edit distance is 1.0.

10

Similarity Measures for Sets: Overlap Similarity

! The overlap similarity between two sets S1 and S2 is the cardinality of their intersection: O(S1, S2)= |S1 Ç S2|, ranges from 0 (no overlap) to min(|S1|,|S2|) (one set is a subset of another or sets are identical)

! The overlap distance: H(S1,S2) = |S1|+|S2| - 2* |S1 Ç S2| , ranges from |S1|+|S2| (no common elements) to 0 (full overlap, S1=S2)" |{Trienterstr} Ç {Triesterstr}| = 0,

O(S1,S2) = 0, H(S1,S2) = 1+1-2*0 = 2

" |{James, Wood} Ç {Wood, James}| = 2,O(S1,S2) = 2, H(S1,S2) = 2+2-2*2 = 0

Similarity Measures for Sets: Jaccard Similarity

! Normalizes overlap similarity between two sets S1 and S2 : J(S1, S2)= |S1 Ç S2| / |S1 È S2| ranges from 0 (no overlap) to 1 (identical)

! The Jaccard distance: 1 - J(S1, S2)" |{Trienterstr} Ç {Triesterstr}| = 0,

J(S1,S2) = 0/2 = 0, 1-J(S1,S2) = 1" |{James, Wood} Ç {Wood, James}| = 2,

J(S1,S2) = 2/2 = 1, 1-J(S1,S2) = 0

11

Converting Strings to Sets: String Tokens! The tokens for a string s are typically substrings of s.! In information retrieval scenarios, where a string is a document, the tokens

may be individual words, phrases, or even sentences of the document. " “James Wood” è {James, Wood}

! For shorter strings, (e.g., person names or addresses), overlapping tokens are used: q-grams is a widely used token technique." The q-grams of a string s are all contiguous substrings of length q." The q-grams of a string are produced by shifting a window of length q

over the string such that each character of the string appears in each window position.

" The window positions that extend beyond the string border are filled with a dummy character #.

“James” è {“##J”, “#Ja”, “Jam”, “ame”, ”mes”, “es#”, “s##”}

Set Similarity Join! A join between two relations R and S with attributes (id, toks), where id is

an identifier of some object (for example, a string) and toks is the set of tokens for the object identified by id (for example, the q-grams of a string).

! The join predicate is an overlap constraint on the toks attribute, which represents the similarity between objects. " E.g., find all pairs of objects which have at least 1 token in common.

Id toksr1 {a,b}r2 {a,c}r3 {d,e}

Id tokss1 {a,b}s2 {f,g}s3 {c,h}

|r1 Ç s1| = 2 |r1 Ç s2| = 0 |r1 Ç s3| = 0 |r2 Ç s1| = 1 |r2 Ç s2| = 0 |r2 Ç s3| = 1 |r3 Ç s1| = 0 |r3 Ç s2| = 0 |r3 Ç s3| = 0

R S

12

Set Similarity Join: Performance! Set similarity join, requires computing intersections between pairs of token

sets and returns only the pairs that meet the overlap constraint ! The straightforward strategy for evaluating it is a nested-loop join that

computes n2 intersections for two relations with n tuples each.! Very slow when n grows large. ! Efficient strategies are based on the following core observations:

" empty and very small intersections need not be computed;" some pairs of token sets can be pruned by inspecting only a small

subsets of the tokens.

Avoiding Empty Intersection: Token Equi-Joins! Break up the sets (unnest) and express the set similarity join as an equality

join on tokens, called token equality join. ! Equality joins have a long tradition in databases and efficient techniques to

evaluate similarity joins have been developed,

Idr tokr1 ar1 br2 ar2 cr3 dr3 e

Ids toks1 as1 bs2 fs2 gs3 cs3 h

Idr ids tokr1 s1 ar1 s1 br2 s1 ar2 s3 c

Idr ids COUNT(*)r1 s1 2r2 s1 1r2 s3 1

Id toksr1 {a,b}r2 {a,c}r3 {d,e}

Id tokss1 {a,b}s2 {f,g}s3 {c,h}

Step1: Unnest Step2: Join Step 3: Group

|r1 Ç s1| = 2 |r1 Ç s2| = 0 |r1 Ç s3| = 0 |r2 Ç s1| = 1 |r2 Ç s2| = 0 |r2 Ç s3| = 1|r3 Ç s1| = 0 |r3 Ç s2| = 0 |r3 Ç s3| = 0

13

Token Equality Joins for Sets (cont.)

• The approach will not give a correct result for duplicate tokens, i.e., for token bags.

• Example: two tuples (r4, {a, a}) in R and (s4, {a, a}) in S. • Then Step 2 of the token equality join produces four tuples

(r4, s4, a) and Step 3 produces a tuple (r4, s4, 4). • This is wrong since the intersection between the token sets

r4 and s4 is 2, not 4.

Avoiding Small Intersection: Prefix Filter! The overlap between two sets cannot reach a given threshold if a specific subset

(called prefix) of the two sets has an empty intersection. ! Thus, the computation of the intersection between two sets can be avoided when

the intersection of their prefixes is zero. Prefix Filtering Principle. Given two sets A and B, a strict total order defined on the elements of both sets, and an integer threshold t, if |A Ç B| ³ t, then the (|A| - t +1)-prefix Ap of A and the (|B| - t + 1)-prefix Bp of B with respect to the given ordershare at least one element, |Ap Ç Bp| ¹ Æ. Example: Find pairs of sets with overlap at least 4 and use the prefix filter to prune pairs of sets that cannot reach the threshold:A = {a,c,f,g,m}, B= {b,c,f,g,m,p}, C= {a,h,g,m,p}, t= 4

The prefix lengths: |A|-t+1 = 2; |B|-t+1 = 3; |C|-t+1 = 2A = {a,c,f,g,m}, ( Ap = {a,c} ) B= {b,c,f,g,m,p}, ( Bp = {b,c,f} ) C= {a,h,g,m,p} ( Cp = {a,h} )

Using prefix filtering: |Ap Ç Bp| ¹ Æ è |A Ç B| ³ 4 |Ap Ç Cp| ¹ Æ è |A Ç C| ³ 4 (false postitive, |A Ç C| = 3 )|Bp Ç Cp| = Æ è |B Ç C| < 4

Prefix filter condition is necessary but not sufficient (the filter may produce false positives)

14

Summary

• Similarity joins allows us to perform data linkage based on similarity measures.

• Similarity measures may be costly to evaluate and they are not always able to capture non-obvious relationships between data items.

• More advanced data similarity and data linkage methods are required to make sense from large distributed repositories of related information

DataLinkage1 - Compatibility Modevizclass/classes/infsci2711/Slides/DataLinkage1.pdf · Similarity...

Documents

Transcript of DataLinkage1 - Compatibility Modevizclass/classes/infsci2711/Slides/DataLinkage1.pdf · Similarity...