Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong...

31
Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China

Transcript of Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong...

Page 1: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists

Jialong Han

Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China

Page 2: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng2

Introduction: An Example

A dictionary of strings we are interested in E.g. product names, postal addresses…

We are going to locate their “approximate apparences” in a series of documents. See the meaning of “approximate apparence” in the

following example:

Page 3: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng3

Problem Definition

Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r R, and Similarity (r, m) ∈≥δ(or Distance(r, m) ≤k). Here we call r a piece of evidence for m. Similarity() is a function measuring the similarity of

two strings Strings are viewed as sets of tokens (words) An example for Sim(): Jaccard similarity:

)(

)(),(

mrwt

mrwtmrJ

Page 4: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng4

Outline

Introduction State-of-the-art techniques

The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable

Our algorithms and evaluations Conclusion

Page 5: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng5

Why pre-pruning is needed

We need spot evidence to decide whether a substring m should be extracted Simple verification on all dictionary strings may be

inefficient Pre-pruning and post-verifying is beneficial But should it be running-speed-oriented or filtering-

power-oriented? Less time or less survivors?

Page 6: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng6

The issue of compromise comes again

Balance between the two stages should be reached:

More(less)filtration time

Strong(weak)filtration power

Fewer(more)candidates

Less(more)verification time

Overall performance

=Tf+Tv ?????

Page 7: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng7

Outline

Introduction State-of-the-art techniques

The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable

Our algorithms and evaluations Conclusion

Page 8: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng8

K-signature scheme K-signature scheme

Proposed by Chakrabarti et al. (SIGMOD 2008) Choose several top-weighted tokens in a string as

signatures to represent it: s => Sig(s) Observation: if r cannot match m, r is likely to have

insufficient signature overlapping with m K is a parameter for filtration power tuning

Potential evidence loss A counter-example found when k=3 We tried and only proved that it works for k=1 and

k=∞

Page 9: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng9

Outline

Introduction State-of-the-art techniques

The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable

Our algorithms and evaluations Conclusion

Page 10: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng10

Inverted Signature-based Hashtable

Proposed by Chakrabarti et al. (SIGMOD 2008) Each dictionary string encoded into a solid 0-1 matrix An ‘1’ for each occurrence of a <token,sig-token>

tuple (‘1’- rectangle) Bitwise-or all solid matrices to get the matrix of R

Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R.

Formalized into an NPC problem Solution causes too weak filtering power

Page 11: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng11

Outline

Introduction State-of-the-art techniques Our algorithms and evaluations

Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds

Conclusion

Page 12: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng12

If Sim(m,r) ≥δ, what do we have ?wt(Sig(m)∩Sig(r)) ≥ τ(m)

wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) }

So the threshold does not remain constant involves unknown evidence

Our solution: Use inverted lists to count sig-token overlappings. Note that sig-tokens usually have low document

frequency (e.g. IDF as weights)

Our proposed theorem

Too strict !Too strict !

Proved by usProved by us

Page 13: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng13

Outline

Introduction State-of-the-art techniques Our algorithms and evaluations

Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds

Conclusion

Page 14: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng14

Lists indexed by sig-tokens Each sig-token of a string creates a node (containing

the string’s id) in the corresponding list. E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon

digital slr camera”, r3=“canon slr camera”}. wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2,

7 ,9).

Signature-based Inverted Lists

5d, 9.0

canon, 2.0

camera, 1.0

eos, 7.0

nikon, 2.0

slr, 2.0

5d, 9.0

canon, 2.0

camera, 1.0

eos, 7.0

nikon, 2.0

slr, 2.0

11

11

22

11

22

22

33

33

Page 15: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng15

Filtration by SIL Using an array called “accumulator” to compute

the overlapped sig weight wt(Sig(m)∩Sig(r)) E.g. m=“canon eos digital camera”, δ=0.8

5d, 9.0

canon, 2.0

camera, 1.0

eos, 7.0

nikon, 2.0

slr, 2.0

5d, 9.0

canon, 2.0

camera, 1.0

eos, 7.0

nikon, 2.0

slr, 2.0

11

11

22

11

22

22

33

33

rid 1 2 3wt(Sig(m)∩Sig(r))

min{τ(m),τ(r) } 6.8 3.8 3

AccumulatorAccumulator

2.02.09.09.0 2.02.000

Qualified!Qualified!

Page 16: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng16

Outline

Introduction State-of-the-art techniques Our algorithms and evaluations

Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds

Conclusion

Page 17: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng17

EvITER: Progressive Computation Recall we are checking all substrings

Some of them are quite similar, indicating that they share duplicate computation

An intuition: if m have potential evidence r, then m t is very likely to match r

Formally we proved that Let ES(m) be the set of “potential evidence” for m, list[t]={s| all

dictionary strings that contain token t} We have ES(m t) ES(m)∪list[t]

Page 18: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng18

Example

Docoment M:

m t

“…. cannon eos digital camera lens…”

We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens”

ES(m)ES(m)

{r1}

{r1}

lens, 3.0

lens, 3.0

2222 5353

List[t]List[t]

Page 19: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng19

Flow of Evidence

EvITER for “Evidence ITERATION”

……

Page 20: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng20

Outline

Introduction State-of-the-art techniques Our algorithms and evaluations

Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds

Conclusion

Page 21: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng21

The Static Threshold Problem

How does this index work so far? -“Get ready forδ=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1,δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I wantδ=0.9…” -“Sorry, please wait another 30min for index regeneration…” -“:-(”

Page 22: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng22

The Static Threshold Problem

This One Seems Better -“Get ready forδ>=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1,δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I wantδ=0.9…” -“…Extraction complete.” “:-)”

Page 23: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng23

Supporting Dynamic Thresholds

An Observation When δ descends, a string r’s tokens fall into Sig(r)

one by one, in the order of their weight ranking. I.e. any node <sig-token, rid> is “active” when δ is

below certain “threshold” u<sig-token, rid>.

We record u<sig-token, rid> in each node and sort all nodes in each list according to the descending order of their u value.

For any given δ, we only need retrieve a prefix of each list to get all “active nodes”

Page 24: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng24

Experimental Datasets

DBLP: 274,788 Paper titles 1,838,973 URLs

Page 25: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng25

Balance should be reached

Recall our two stages of filtration and verification

Page 26: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng26

Performance (DBLP)

Page 27: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng27

Outline

Introduction State-of-the-art techniques Our algorithms and evaluations

Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds

Conclusion

Page 28: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng28

Conclusion Our method causes no false negatives Our method achieves a good balance between the two

phases of filtration and verification

We also propose EvITER to eliminate duplicate computation

Our method has both effective & efficient performance

Page 29: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng29

Page 30: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng30

References [1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins.

In VLDB, pages 918-929, 2006. [2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter

for approximate membership checking. In SIGMOD Conference, 2008.

[3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006.

[4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.

[5] M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness.

[6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.

Page 31: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng31

References [7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for

approximate string searches. In ICDE, pages 257–266, 2008. [8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of

approximate queries on string collections using variable length grams. In VLDB 2007.

[9] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, 2001.

[10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, 2004.

[11] A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, 2001.

[12] E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages 327-340, 1995.

[13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009.