String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li,...

33
String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China

Transcript of String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li,...

Page 1: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

String Similarity Measures and Joins with Synonyms

Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang

Jiaheng LuRenmin University of China

Page 2: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Motivation Example (String Measure)

no semantic

S1=“International Conference on Management of Data NY USA”S2=“SIGMOD 2013 New York United States”

Page 3: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Example (String Measure)

S1=“International Conference on Management of Data NY USA”S2=“SIGMOD 2013 New York United States”

SIGMOD International Conference on Management of DataNY New YorkUSA United States

Synonyms

How to use the existing synonyms?

SIGMOD ACM's Special Interest Group on Management Of Data

Page 4: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Research Problem 1--(String Measurements)

Two strings s and t, and a set of synonyms R

Input

Using R to return the maximal Jaccard similarityJaccard(s,t,R)

Output

Page 5: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Problem 2-- (String Similarity Join)

Two set of strings S and T, and a set of synonyms R, and a threshold value

Input

Return all similar pairs , such that Jaccard(s,t,R)>=

Output

Page 6: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

6

An example of similarity join

ID String

q1 2013 ACM Intl Conf on Management of Data USA

q2 Very Large Data Bases Conf

q3 VLDB Conf

q4 ICDE 2013

Table S1 Table S2

ID String

s1 SIGMOD

s2 VLDB

SIGMOD International Conference on Management of DataVLDB Very Large Data Bases

Synonyms

Page 7: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

7

Existing works on approximate string match with synonyms

Transform based framework (JaccT) [1], compared with our method.

Machine leaning method [2], Hidden Markov Model-based measure. Depend on training data, not efficient

[1] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework forrecord matching. In ICDE, pages 40–49, 2008.[2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnablestring similarity measures. In KDD, pages 39–48, 2003.

Transform based framework (JaccT) [1], compared with our method.

Machine leaning method [2], Hidden Markov Model-based measure. Depend on training data, not efficient

[1] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework forrecord matching. In ICDE, pages 40–49, 2008.[2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnablestring similarity measures. In KDD, pages 39–48, 2003.

Page 8: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Motivation & Problem Statement

String Similarity Measures

String Similarity Joins

Experimental Results

Conclusion

Outline

Page 9: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

String Similarity Measures (Full-expansion)

S1=“International Conference on Management of Data NY USA”S2=“SIGMOD 2013 New York United States”

SIGMOD ACM's Special Interest Group on Management Of DataSIGMOD International Conference on Management of DataNY New YorkUSA United States

Synonyms

S1’=" International Conference on Management of Data NY USA SIGMOD New York United States "

S2’=" SIGMOD 2013 New York United States International Conference on Management of Data NY USA ACM's Special Interest Group on

Management Of Data

Expanding using all synonyms

Jaccard(S1’,S2’)= 13/18 = 0.72

Page 10: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

String Similarity Measures (Selective-expansion)

S1=“International Conference on Management of Data NY USA”S2=“SIGMOD 2013 New York United States of America”

Synonyms

Expanding using only good synonyms

SIGMOD ACM's Special Interest Group on Management Of DataSIGMOD International Conference on Management of DataNY New YorkUSA United States

S1’=" International Conference on Management of Data NY USA SIGMOD New York United States "

S2’=" SIGMOD 2013 New York United States International Conference on Management of Data NY USA "

Jaccard(S1’,S2’)= 13/14 = 0.93

Page 11: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

String Similarity Measures (Selective)

Selective-expansion is:NP-hard : Reduction from 3-SAT

Choose synonyms that can increase current similarity by computing the similarity-gain

Prop

erty Optimal, when more than 70% cases in

practice.

Greedy algorithm

Page 12: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Motivation & Problem Statement

String Similarity Measures

String Similarity Joins

Experimental Results

Conclusion

Outline

Page 13: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

13

Similarity Joins (Filtering and Verification)

Generate Signatures with full expansion

Filtering candidates

Prefix method

LSH method

Similarity Measures

Verify candidates

Selective expansion

Full expansion

Page 14: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

14

String Similarity Joins (SN-Join)

Global ordering: {a b c d e f g h i j k l}

S1=“c k, e, a, f” S2=“d, b, f, e, k”

Threshold=0.8Order the strings

S1’=“a, c, e, f, k” S2’=“b, d, e, f, k”

Sig(s1)=“a, c” Sig(s2)=“b, d”

Get signatures

No overlap Jacc(s1,s2)<0.8

Prefix method

Page 15: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

15

Signatures selection is important

How to select signatures to enhance the signature filtering power?

It is unrealistic to find a “one-size-fits-all” solution.

Page 16: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

16

Estimation-based signatures selection

. Three steps to select signatures:

• Generate multiple signatures schemes for each data set.

• Given two tables for join, quickly estimate the filtering power of each scheme.

• Select the scheme with the best filtering power.

Page 17: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

17

An example on estimator

ID String Signatures

q1 2013 ACM Intl Conf on Management of Data USA

ACM, International, Conference, on

q2 Very Large Data Bases Conf

Conf, Conference

q3 VLDB Conf Conf, Conference

q4 ICDE 2013 ICDE

ACM Conf Conference International on ICDE

q1 q2

q3

q1

q2

q3

q1 q1 q4

Self-join:

Filtering results

(candidates):

(q2,q3) ,(q1,q2)

(q1,q3)

Page 18: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

18

Applying FM sketches on inverted lists

ID String Signatures

q1 2013 ACM Intl Conf on Management of Data USA

ACM, International, Conference, on

q2 Very Large Data Bases Conf

Conf, Conference

q3 VLDB Conf Conf, Conference

q4 ICDE 2013 ICDE

ACM Conf Conference International on ICDE

q1

Filtering results

(candidates):

(q2,q3) ,(q1,q2),

(q1,q3)

q2

q3

q1

q2

q3

q1 q1 q4

Self-join:

Using Flajolet-Martin (FM) sketch for each list

Page 19: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

19

FM sketches ( Flajolet and Martin JCSS 1985)

• Estimates the number of distinct items in a multi-set of values from [0,…, M-1]

• Assume a hash function h(x) that maps incoming values x in [0,…, M-1] uniformly across [0,…, 2L-1], where L = O(logM)

• Let lsb(y) denote the position of the least-significant 1 bit in the binary representation of y– A value x is mapped to lsb(h(x))

3 0 5 3 0 1 7 5 1 0 3 7

Number of distinct values: 5

x = 5 h(x) = 101100 lsb(h(x)) = 2 0 0 0 001

BITMAP5 4 3 2 1 0

Page 20: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

20

Estimating the filtering power of a signature scheme

Constructing a two-dimensional hash sketch

Computing tighter upper and lower bounds of candidates size

Page 21: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

String Similarity Filtering with Length Filter

Generate Signatures

Filtering candidates

Prefix method

LSH method

Similarity Measures

Verify candidates

Selective expansion

Computelengths

Length filter

Full expansion

Page 22: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

String Similarity Joins (SI-Join)

Jacc(s1,s2,R)<0.9

Length filtering

Strings

S1=“a b c d e”S2=“x y z”

Synonyms

a->f g hx-> sb->k

Full-expansion

S1’=“a b c d e f g h k”S2’=“x y z s”

Length range

s1: [5, 9]s2: [3, 4]

Page 23: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

23

String Similarity Joins (SI-Join)

Generate Signatures

Filtering candidates

Prefix/LSH method

Similarity Measures

Verify candidates

Selective expansion

Computelengths

Length filterFull expansion

Page 24: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

String Similarity Joins (SI-tree)

Page 25: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Motivation & Problem Statement

String Similarity Measures

String Similarity Joins

Experimental Results

Conclusion

Outline

Page 26: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

26

Data sets and algorithms

• Compared method: JaccT [Arasu et al. ICDE 2008]• Three datasets:

Data # of strings

String Len (avg/max)

#of Synonyms

# of applied synonyms

(avg/max)USPS 1M 6.75/15 300 2.19/5

CONF 10K 5.84/14 1000 1.43/4

SPROT 1M 10.32/20 10K 37.78/104

Page 27: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Effectiveness of different similarity measurements

String Similarity Measures

Selective-expansion (SE) achieves the best effectiveness.

Page 28: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

String Similarity Joins

Efficiency of algorithms

SI-Join achieve the best performance.

S: selective expansion

F: full expansion

Page 29: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Prefix scheme VS. LSH schemee

LSH is better

Prefix is better

Prefix V.s. LSH

Page 30: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Estimation effectiveness

Page 31: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Motivation & Problem Statement

String Similarity Measures

String Similarity Joins

Experimental Results

Conclusion

Outline

Page 32: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

Conclusion and future work

String similarity measure with synonyms

Two new measures and a new join algorithm

One estimator for signature selection

Future work: how to deal with synonym ambiguityE.g. UW = University of Washington UW = University of Waterloo

OR ?

Page 33: String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang Jiaheng Lu Renmin University of China.

String Similarity Measures and Joins with Synonyms