A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

55
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

description

A Pivotal Prefix Based Filtering Algorithm for String Similarity Search. Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng. Search is Important. Google Searches per Year. Source: http://www.internetlivestats.com/google-search-statistics/. - PowerPoint PPT Presentation

Transcript of A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Page 1: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Dong Deng, Guoliang Li, Jianhua Feng

Database Group, Tsinghua University

Present by Dong Deng

A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Page 2: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Search is Important

Source: http://www.internetlivestats.com/google-search-statistics/

Google Searches per Year

Page 3: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Speed Matters

Source:

Page 4: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Data is Dirty

• Typos

• Typo in “title”relaxed

related

Argyrios Zymnis

Argyris Zymnis

DBLP Complete Search

Page 5: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Similarity Search

Query

String Dataset

All the strings similar to the query

Page 6: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

• ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s.

• For example: ED(sigcom, sigmod) = 2

Edit Distance

sigcom

sigmom

sigmod

substitute c with m

substitute m with d

Page 7: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Problem Definition

Query string s = “yotubecom” and τ = 2

string dataset R

ed(s, r4) <= 2output r4 as a result

Page 8: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Application

• Spell Checking• Copy Detection• Entity Linking• Bioinformatic ….

Page 9: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Challenge

Naïve MethodTime complexity: for each query

Page 10: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

No

Filter-and-Verification Framework

Dataset R

Threshold τ

Query string s

ResultsFilter:

Signature(s) ∩Signature(r) = ϕ?

Verify:ED(r,s) ≤ τ?

YesIndex

Page 11: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Preliminary: q-gram

• q-gram of the substring with length q

yoouuttbbeeccoom

youtbecom

2-gram

Page 12: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

dd

d

Preliminary: q-gram• 1 edit operation destroies at most q grams.

• τ edit operations destroy at most qτ grams.• if r and s have more than qτ mismatch grams, ED(r, s)>τ.

yout ecomyoou

utt eeccoom

Page 13: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf

Pre(s)

q(r) : The sorted q-gram set of string rPre(r)

q(s): The sorted q-gram set of string s

Pre(•) is the prefix of q(•)

|Pre(•)|= qτ+1

Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ

suffix(r)

Page 14: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf

Pre(s)

g5 g6 g11 g12 g13g1 g2

g7 g8 g9 g10 g12g3 g4

q(r) : The sorted q-gram set of string rPre(r)

q(s): The sorted q-gram set of string s

Pre(•) is the prefix of q(•)

|Pre(•)|= qτ+1

Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ

>g10 >g10 >g10 >g10 >g10 >g10

suffix(r)

Page 15: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

d

d

Preliminary: disjoint q-gram• One edit operation destroies at most 1 disjoint gram.

• τ edit operations destroy at most τ disjoint grams.• if r and s have more than τ mismatch disjoint grams, ED(r, s)>

τ

yout ecom

e

yout

om

Page 16: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

q(s): The sorted q-gram set of string s

Pivotal Prefix FilterSort all q-grams by global ordering, such as idf

Pre(s)

q(r) : The sorted q-gram set of string rPre(r)

Piv(•) is the pivotal prefix of q(•)|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint

Piv(r)

Piv(s)

suffix(r)

If piv(s) ∩ pre(r) = ϕ and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ

Page 17: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

q(s): The sorted q-gram set of string s

Pivotal Prefix FilterSort all q-grams by global ordering, such as idf

Pre(s)

g8 g10g5

g6 g9 g11 g13g1 g3

q(r) : The sorted q-gram set of string r

Pivotal Prefix Filter: If last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ

Pre(r)

Piv(•) is the pivotal prefix of q(•)|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint

Piv(r)

Piv(s)>g10 >g10 >g10 >g10 >g10 >g10 >g10

last(r)

last(s)

suffix(r)

Page 18: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

q(s): The sorted q-gram set of string s

Pivotal Prefix FilterSort all q-grams by global ordering, such as idf

Pre(s)

g6 g9 g12 g13g1 g4

g7 g10 g11g3

q(r) : The sorted q-gram set of string r

Pivotal Prefix Filter: If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τ

Pre(r)

Piv(•) is the pivotal prefix of q(•)|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint

Piv(r)

Piv(s)

>g10 >g10 >g10 >g10 >g10 >g10 >g10

last(r)

last(s)

suffix(r)

Page 19: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Pivotal Prefix Filter

If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τIf last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ

• Existence: There must exist τ+1 disjoint grams in the prefix

• The Pivotal Prefix is a subset of the Prefix– The pivotal prefix filter dominates the prefix filter– Signature size are O(τ) and O(qτ) respectively

Page 20: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Related WorkMethod |Sig(r)| |Sig(s)|

Prefix Filter O(qτ) O(qτ)

Mismatch Filter O(qτ) O(qτ)

Qchunk Filter O(τ) O(l)Pivotal Prefix Filter O(τ) O(qτ)

• Mismatch Filter [Xiao VLDB08] : Shorten prefix length, but still O(qτ)• Qchunk Filter[Qin SIGMOD11] : Shorten one to O(τ) but increased the other one to O(l)• Adaptive Prefix[Wang SIGMOD12]

– Increase prefix length to reduce candidate number– Orthogonal and can be integrated into our method

• Flamingo[Li ICDE08]– Based on count filter. Accelerating counting process.– Orthogonal and can be integrated into our method

Page 21: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Pivotal Search Algorithm

• Indexing– Build inverted indexes for both the prefix and the pivotal prefix of the data strings

• Querying– Generate prefix and pivotal prefix for the query string– Probe the prefix index with the pivotal prefix of the query– Probe the pivotal prefix index with the prefix of the query– Verify the candidates and output results

Page 22: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Pivotal Prefix Selection

Evaluating Different Pivotal Prefixes: The longer the inverted lists we probe, the more candidates we may have.

min𝑝𝑖𝑣 (𝑠)

∑𝑔∈ 𝑝𝑖𝑣(𝑠 )

h𝑙𝑒𝑛𝑔𝑡 𝑜𝑓 𝑖𝑛𝑣𝑒𝑟𝑡𝑒𝑑 𝑙𝑖𝑠𝑡𝑜𝑓 𝑔

min𝑝𝑖𝑣 (𝑟 )

∑𝑔∈ 𝑝𝑖𝑣(𝑟 )

𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑜𝑓 𝑔

For query string:

For data string:

Page 23: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Optimal Pivotal Prefix SelectionDynamic Programming:

Select m-1 optimal pivotal q-grams from the first n-1 q-grams in prefix

Select as last pivotal q-gram

Object: Select m=τ+1 optimal pivotal q-grams from the first n=qτ+1 grams in the prefix

Page 24: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Optimal Pivotal Prefix SelectionDynamic Programming:

Select m-1 optimal pivotal q-grams from the first n-2 q-grams

Select as last pivotal q-gram

Page 25: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Optimal Pivotal Prefix SelectionDynamic Programming:

Select m-1 optimal pivotal q-grams from the first m-1 q-grams

Select as last pivotal q-gram

𝑓 (𝑚 ,𝑛 )= min1≤ 𝑘≤𝑚

¿

𝑤 h𝑒𝑖𝑔 𝑡 𝑖𝑠 h𝑙𝑒𝑛𝑔𝑡 𝑜𝑓 𝑖𝑛𝑣𝑒𝑟𝑡𝑒𝑑𝑙𝑖𝑠𝑡 𝑓𝑜𝑟 𝑞𝑢𝑒𝑟𝑦 𝑎𝑛𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑓𝑜𝑟 𝑑𝑎𝑡𝑎𝑠𝑡𝑟𝑖𝑛𝑔

Recursive formula:

Page 26: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

No

Filter-and-Verification Framework

Dataset R

Threshold τ

Query string s

ResultsFilter:

Signature(s) ∩Signature(r) = ϕ?

Verify:alignment filter?If yes, ED(r,s) ≤

τ?

YesIndex

Complexity Improvement: Improved from to

Page 27: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Alignment Filter

Intuition of Alignment Filter: suppose in the best case we need erri edit operations to transform to a substring of r, then

If

Page 28: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Alignment Filter

is the minimum edit distance between and any substring of r.

Substring edit distance (sed)

Alignment filter: If

Page 29: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Alignment Filter

Accelerating Calculation: • The computation complexity of sed(, r) is O(). • By position filter, can only align to a substring xi of r

where |xi|<. • Thus if , ED( , )𝑟 𝑠• The complexity reduced to

Complexity Improvement: Improved from to

Page 30: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Experiments

Settings:C++, g++ 4.8.2 with -O3 flags64bit Ubuntu Server 12.04 LTS versionIntel Xeon E5-2650 2.00GHz processor and 16GB memory.

Page 31: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Evaluating Pivotal Prefix FilterAverage Search Time

Mismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix Selection

Page 32: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Evaluating Pivotal Prefix FilterCandidate Number

Mismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix Selection

Page 33: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Evaluating Alignment FilterAverage Search Time

NoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment Filter

Page 34: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Evaluating Alignment FilterCandidate Number

NoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment FilterReal: Number of results

Page 35: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Comparison with State-of-the-arts

PivotalSearch: Our methodAdaptive: [Wang2012]Flamingo: [Li2008]Qchunk: [Qin 2011]

Page 36: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Scalability

Page 37: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Conclusion

• Pivotal prefix filter• Pivotal search algorithm• Optimal pivotal prefix selection• Alignment filter

Page 38: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

THANK YOUQ & A

Project hompage: http://dbgroup.cs.tsinghua.edu.cn/dd/pivotal.html

Page 39: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Outline

• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion

Page 40: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Outline

• Motivation and Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion

Page 41: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Outline

• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion

Page 42: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Outline

• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion

Page 43: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Outline

• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion

Page 44: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Complexity

• Space Complexity: • Time Complexity:

Page 45: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Pivotal Prefix Selection

Evaluating Different Pivotal Prefixes: The longer the inverted lists we scan, the larger the filtering cost is and the smaller the pruning power is.

min𝑝𝑖𝑣 (𝑟 )

∑𝑔∈ 𝑝𝑖𝑣(𝑟 )

¿ 𝐼 +¿[𝑔 ]∨¿¿¿

min𝑝𝑖𝑣 (𝑟 )

∑𝑔∈ 𝑝𝑖𝑣(𝑟 )

¿ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 [𝑔]∨¿¿

For query string:

For data string:

Existence of Pivotal Prefix:There must exist at least τ+1 disjoint q-grams in the prefix pre(r) for any string r

Page 46: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Complexity• Space Complexity: – Prefix Inverted Index Size: – Pivotal Prefix Inverted Index Size:

• Query Time Complexity:– Preprocess Query s: – Probing Inverted Indexes: where is the average

length of probed prefix inverted lists

• Verification Complexity: where c is the number of candidates and l is average string length

Page 47: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Complexity• Space Complexity: – Prefix Inverted Index Size: – Pivotal Prefix Inverted Index Size:

• Query Time Complexity:– Preprocess Query s: – Probing Inverted Indexes: where is the average

length of probed prefix inverted lists

• Verification Complexity: where c is the number of candidates and l is average string length

Page 48: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf

Pre(s)

g5 g6 g9 g10 g11g1 g2

g7 g8 g11 g12 g13g3 g4

q(r) : The sorted q-gram set of string rPre(r)

q(s): The sorted q-gram set of string s

Pre(•) is the prefix of q(•)

|Pre(•)|= qτ+1

Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ

>g10 >g10 >g10 >g10 >g10 >g10 >g10

Page 49: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Alignment Filternon-consecutive errors:

youtubecomyoytupecxm

q=3, the 3 non-consecutive errors destroy 8 q-grams

youtubecomyoutzpxcom

q=3, the 3 consecutive errors only destroy 5 q-grams

consecutive errors:

Page 50: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Indexing

• Fix a global gram order

We use gram frequency ascending order τ=2 q=2

Global gram order

im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec

1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4

Page 51: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Indexing

• Build inverted indexes for prefix and pivotal prefix

Global gram order

im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec

1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4

Sort and Split String,

Sort q-grams

q(r 1): {i m, my,t e, ca, yo ou, ut , ec}q(r 2): {bu, un, nt , uc, om ub, co, t u}q(r 3): {bb, ou, ut , ub, co t u, be, ec}q(r 4): {t b, om, yo,ou, ut co, be, ec}q(r 5): {oy, yt , ca, yo, ub t u, be, ec}

last(pre(ri))τ=2 q=2

slt(ri)

pre(ri)

Piv(ri)

Page 52: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Indexing

• Build inverted indexes for prefix and pivotal prefix

q(r 1): {i m, my,t e, ca, yo ou, ut , ec}q(r 2): {bu, un, nt , uc, om ub, co, t u}q(r 3): {bb, ou, ut , ub, co t u, be, ec}q(r 4): {t b, om, yo,ou, ut co, be, ec}q(r 5): {oy, yt , ca, yo, ub t u, be, ec}

pre(ri)

slt(ri)

imtebuntuctbyt

<r1,1>ca

omyoouutub

Inverted index I

<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>

<r1,8><r4,8><r4,1><r3,8><r3,1><r3,3>

Inverted index I

immytebuunnt

uc bb tb oy ytco

caom

ouutub

<r5,3>

+

<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>

<r2,6><r3,4><r4,4><r5,2>

<r5,7>

<r1,8><r5,8><r2,8><r4,8>

<r1,3><r4,1><r5,1>yo

<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>

-

Pivotal Prefix Index Prefix IndexPiv(ri

)

Page 53: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Querying

• Generate prefix and pivotal prefix for the query string

Global gram order

im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec

1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4

s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}last(pre(s))

Page 54: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Querying

• Probe the prefix index with the pivotal prefix of the query• Probe the pivotal prefix index with the prefix of the query

Inverted index I

imtebuntuctbyt

<r1,1>ca

omyoouutub

Inverted index I

s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}

Preprocess Probe ProbeQuerying

immytebuunnt

uc bb tb oy ytco

caom

ouutub

<r5,3>

last(pre(s))

+-

<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>

<r1,8><r5,8><r4,8><r4,1><r5,1><r3,8><r3,1><r3,3>

<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>

<r2,6><r3,4><r4,4><r5,2>

<r5,7>

<r1,8><r5,8><r2,8><r4,8>

<r1,3><r4,1><r5,1>yo

<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>

Page 55: A Pivotal Prefix Based Filtering  Algorithm for  String Similarity Search

Querying

• Verify the candidates and output results

Inverted index I

imtebuntuctbyt

<r1,1>ca

omyoouutub

Inverted index I

s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}

Preprocess Probe ProbeQuerying

immytebuunnt

uc bb tb oy ytco

caom

ouutub

<r5,3>

last(pre(s))

+-

<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>

<r1,8><r5,8><r4,8><r4,1><r5,1><r3,8><r3,1><r3,3>

<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>

<r2,6><r3,4><r4,4><r5,2>

<r5,7>

<r1,8><r5,8><r2,8><r4,8>

<r1,3><r4,1><r5,1>yo

<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>

Candidates: r3, r4, r5

Result:r4

verify