String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li,...
-
Upload
charles-haley -
Category
Documents
-
view
225 -
download
1
Transcript of String Similarity Measures and Joins with Synonyms Joint work with Chunbin Lin, Wei Wang, Chen Li,...
String Similarity Measures and Joins with Synonyms
Joint work with Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang
Jiaheng LuRenmin University of China
Motivation Example (String Measure)
no semantic
S1=“International Conference on Management of Data NY USA”S2=“SIGMOD 2013 New York United States”
Example (String Measure)
S1=“International Conference on Management of Data NY USA”S2=“SIGMOD 2013 New York United States”
SIGMOD International Conference on Management of DataNY New YorkUSA United States
Synonyms
How to use the existing synonyms?
SIGMOD ACM's Special Interest Group on Management Of Data
Research Problem 1--(String Measurements)
Two strings s and t, and a set of synonyms R
Input
Using R to return the maximal Jaccard similarityJaccard(s,t,R)
Output
Problem 2-- (String Similarity Join)
Two set of strings S and T, and a set of synonyms R, and a threshold value
Input
Return all similar pairs , such that Jaccard(s,t,R)>=
Output
6
An example of similarity join
ID String
q1 2013 ACM Intl Conf on Management of Data USA
q2 Very Large Data Bases Conf
q3 VLDB Conf
q4 ICDE 2013
Table S1 Table S2
ID String
s1 SIGMOD
s2 VLDB
SIGMOD International Conference on Management of DataVLDB Very Large Data Bases
Synonyms
7
Existing works on approximate string match with synonyms
Transform based framework (JaccT) [1], compared with our method.
Machine leaning method [2], Hidden Markov Model-based measure. Depend on training data, not efficient
[1] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework forrecord matching. In ICDE, pages 40–49, 2008.[2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnablestring similarity measures. In KDD, pages 39–48, 2003.
Transform based framework (JaccT) [1], compared with our method.
Machine leaning method [2], Hidden Markov Model-based measure. Depend on training data, not efficient
[1] A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework forrecord matching. In ICDE, pages 40–49, 2008.[2] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnablestring similarity measures. In KDD, pages 39–48, 2003.
Motivation & Problem Statement
String Similarity Measures
String Similarity Joins
Experimental Results
Conclusion
Outline
String Similarity Measures (Full-expansion)
S1=“International Conference on Management of Data NY USA”S2=“SIGMOD 2013 New York United States”
SIGMOD ACM's Special Interest Group on Management Of DataSIGMOD International Conference on Management of DataNY New YorkUSA United States
Synonyms
S1’=" International Conference on Management of Data NY USA SIGMOD New York United States "
S2’=" SIGMOD 2013 New York United States International Conference on Management of Data NY USA ACM's Special Interest Group on
Management Of Data
Expanding using all synonyms
Jaccard(S1’,S2’)= 13/18 = 0.72
String Similarity Measures (Selective-expansion)
S1=“International Conference on Management of Data NY USA”S2=“SIGMOD 2013 New York United States of America”
Synonyms
Expanding using only good synonyms
SIGMOD ACM's Special Interest Group on Management Of DataSIGMOD International Conference on Management of DataNY New YorkUSA United States
S1’=" International Conference on Management of Data NY USA SIGMOD New York United States "
S2’=" SIGMOD 2013 New York United States International Conference on Management of Data NY USA "
Jaccard(S1’,S2’)= 13/14 = 0.93
String Similarity Measures (Selective)
Selective-expansion is:NP-hard : Reduction from 3-SAT
Choose synonyms that can increase current similarity by computing the similarity-gain
Prop
erty Optimal, when more than 70% cases in
practice.
Greedy algorithm
Motivation & Problem Statement
String Similarity Measures
String Similarity Joins
Experimental Results
Conclusion
Outline
13
Similarity Joins (Filtering and Verification)
Generate Signatures with full expansion
Filtering candidates
Prefix method
LSH method
Similarity Measures
Verify candidates
Selective expansion
Full expansion
14
String Similarity Joins (SN-Join)
Global ordering: {a b c d e f g h i j k l}
S1=“c k, e, a, f” S2=“d, b, f, e, k”
Threshold=0.8Order the strings
S1’=“a, c, e, f, k” S2’=“b, d, e, f, k”
Sig(s1)=“a, c” Sig(s2)=“b, d”
Get signatures
No overlap Jacc(s1,s2)<0.8
Prefix method
15
Signatures selection is important
How to select signatures to enhance the signature filtering power?
It is unrealistic to find a “one-size-fits-all” solution.
16
Estimation-based signatures selection
. Three steps to select signatures:
• Generate multiple signatures schemes for each data set.
• Given two tables for join, quickly estimate the filtering power of each scheme.
• Select the scheme with the best filtering power.
17
An example on estimator
ID String Signatures
q1 2013 ACM Intl Conf on Management of Data USA
ACM, International, Conference, on
q2 Very Large Data Bases Conf
Conf, Conference
q3 VLDB Conf Conf, Conference
q4 ICDE 2013 ICDE
ACM Conf Conference International on ICDE
q1 q2
q3
q1
q2
q3
q1 q1 q4
Self-join:
Filtering results
(candidates):
(q2,q3) ,(q1,q2)
(q1,q3)
18
Applying FM sketches on inverted lists
ID String Signatures
q1 2013 ACM Intl Conf on Management of Data USA
ACM, International, Conference, on
q2 Very Large Data Bases Conf
Conf, Conference
q3 VLDB Conf Conf, Conference
q4 ICDE 2013 ICDE
ACM Conf Conference International on ICDE
q1
Filtering results
(candidates):
(q2,q3) ,(q1,q2),
(q1,q3)
q2
q3
q1
q2
q3
q1 q1 q4
Self-join:
Using Flajolet-Martin (FM) sketch for each list
19
FM sketches ( Flajolet and Martin JCSS 1985)
• Estimates the number of distinct items in a multi-set of values from [0,…, M-1]
• Assume a hash function h(x) that maps incoming values x in [0,…, M-1] uniformly across [0,…, 2L-1], where L = O(logM)
• Let lsb(y) denote the position of the least-significant 1 bit in the binary representation of y– A value x is mapped to lsb(h(x))
3 0 5 3 0 1 7 5 1 0 3 7
Number of distinct values: 5
x = 5 h(x) = 101100 lsb(h(x)) = 2 0 0 0 001
BITMAP5 4 3 2 1 0
20
Estimating the filtering power of a signature scheme
Constructing a two-dimensional hash sketch
Computing tighter upper and lower bounds of candidates size
String Similarity Filtering with Length Filter
Generate Signatures
Filtering candidates
Prefix method
LSH method
Similarity Measures
Verify candidates
Selective expansion
Computelengths
Length filter
Full expansion
String Similarity Joins (SI-Join)
Jacc(s1,s2,R)<0.9
Length filtering
Strings
S1=“a b c d e”S2=“x y z”
Synonyms
a->f g hx-> sb->k
Full-expansion
S1’=“a b c d e f g h k”S2’=“x y z s”
Length range
s1: [5, 9]s2: [3, 4]
23
String Similarity Joins (SI-Join)
Generate Signatures
Filtering candidates
Prefix/LSH method
Similarity Measures
Verify candidates
Selective expansion
Computelengths
Length filterFull expansion
String Similarity Joins (SI-tree)
Motivation & Problem Statement
String Similarity Measures
String Similarity Joins
Experimental Results
Conclusion
Outline
26
Data sets and algorithms
• Compared method: JaccT [Arasu et al. ICDE 2008]• Three datasets:
Data # of strings
String Len (avg/max)
#of Synonyms
# of applied synonyms
(avg/max)USPS 1M 6.75/15 300 2.19/5
CONF 10K 5.84/14 1000 1.43/4
SPROT 1M 10.32/20 10K 37.78/104
Effectiveness of different similarity measurements
String Similarity Measures
Selective-expansion (SE) achieves the best effectiveness.
String Similarity Joins
Efficiency of algorithms
SI-Join achieve the best performance.
S: selective expansion
F: full expansion
Prefix scheme VS. LSH schemee
LSH is better
Prefix is better
Prefix V.s. LSH
Estimation effectiveness
Motivation & Problem Statement
String Similarity Measures
String Similarity Joins
Experimental Results
Conclusion
Outline
Conclusion and future work
String similarity measure with synonyms
Two new measures and a new join algorithm
One estimator for signature selection
Future work: how to deal with synonym ambiguityE.g. UW = University of Washington UW = University of Waterloo
OR ?
String Similarity Measures and Joins with Synonyms