Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong...

Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists

Jialong Han

Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China

Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng2

Introduction: An Example

A dictionary of strings we are interested in E.g. product names, postal addresses…

We are going to locate their “approximate apparences” in a series of documents. See the meaning of “approximate apparence” in the

following example:


Problem Definition

Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r R, and Similarity (r, m) ∈≥δ(or Distance(r, m) ≤k). Here we call r a piece of evidence for m. Similarity() is a function measuring the similarity of

two strings Strings are viewed as sets of tokens (words) An example for Sim(): Jaccard similarity:

)(

)(),(

mrwt

mrwtmrJ


Outline

Introduction State-of-the-art techniques

The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable

Our algorithms and evaluations Conclusion


Why pre-pruning is needed

We need spot evidence to decide whether a substring m should be extracted Simple verification on all dictionary strings may be

inefficient Pre-pruning and post-verifying is beneficial But should it be running-speed-oriented or filtering-

power-oriented? Less time or less survivors?


The issue of compromise comes again

Balance between the two stages should be reached:

More(less)filtration time

Strong(weak)filtration power

Fewer(more)candidates

Less(more)verification time

Overall performance

=Tf+Tv ?????


Outline





K-signature scheme K-signature scheme

Proposed by Chakrabarti et al. (SIGMOD 2008) Choose several top-weighted tokens in a string as

signatures to represent it: s => Sig(s) Observation: if r cannot match m, r is likely to have

insufficient signature overlapping with m K is a parameter for filtration power tuning

Potential evidence loss A counter-example found when k=3 We tried and only proved that it works for k=1 and

k=∞


Outline





Inverted Signature-based Hashtable

Proposed by Chakrabarti et al. (SIGMOD 2008) Each dictionary string encoded into a solid 0-1 matrix An ‘1’ for each occurrence of a <token,sig-token>

tuple (‘1’- rectangle) Bitwise-or all solid matrices to get the matrix of R

Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R.

Formalized into an NPC problem Solution causes too weak filtering power


Outline

Introduction State-of-the-art techniques Our algorithms and evaluations

Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds

Conclusion


If Sim(m,r) ≥δ, what do we have ?wt(Sig(m)∩Sig(r)) ≥ τ(m)

wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) }

So the threshold does not remain constant involves unknown evidence

Our solution: Use inverted lists to count sig-token overlappings. Note that sig-tokens usually have low document

frequency (e.g. IDF as weights)

Our proposed theorem

Too strict !Too strict !

Proved by usProved by us


Outline



Conclusion


Lists indexed by sig-tokens Each sig-token of a string creates a node (containing

the string’s id) in the corresponding list. E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon

digital slr camera”, r3=“canon slr camera”}. wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2,

7 ,9).

Signature-based Inverted Lists

5d, 9.0

canon, 2.0

camera, 1.0

eos, 7.0

nikon, 2.0

slr, 2.0

5d, 9.0

canon, 2.0

camera, 1.0

eos, 7.0

nikon, 2.0

slr, 2.0

11

11

22

11

22

22

33

33


Filtration by SIL Using an array called “accumulator” to compute

the overlapped sig weight wt(Sig(m)∩Sig(r)) E.g. m=“canon eos digital camera”, δ=0.8

5d, 9.0

canon, 2.0

camera, 1.0

eos, 7.0

nikon, 2.0

slr, 2.0

5d, 9.0

canon, 2.0

camera, 1.0

eos, 7.0

nikon, 2.0

slr, 2.0

11

11

22

11

22

22

33

33

rid 1 2 3wt(Sig(m)∩Sig(r))

min{τ(m),τ(r) } 6.8 3.8 3

AccumulatorAccumulator

2.02.09.09.0 2.02.000

Qualified!Qualified!


Outline



Conclusion


EvITER: Progressive Computation Recall we are checking all substrings

Some of them are quite similar, indicating that they share duplicate computation

An intuition: if m have potential evidence r, then m t is very likely to match r

Formally we proved that Let ES(m) be the set of “potential evidence” for m, list[t]={s| all

dictionary strings that contain token t} We have ES(m t) ES(m)∪list[t]


Example

Docoment M:

m t

“…. cannon eos digital camera lens…”

We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens”

ES(m)ES(m)

{r1}

{r1}

…

lens, 3.0

…

…

lens, 3.0

…

2222 5353

List[t]List[t]


Flow of Evidence

EvITER for “Evidence ITERATION”

……


Outline



Conclusion


The Static Threshold Problem

How does this index work so far? -“Get ready forδ=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1,δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I wantδ=0.9…” -“Sorry, please wait another 30min for index regeneration…” -“:-(”


The Static Threshold Problem

This One Seems Better -“Get ready forδ>=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1,δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I wantδ=0.9…” -“…Extraction complete.” “:-)”


Supporting Dynamic Thresholds

An Observation When δ descends, a string r’s tokens fall into Sig(r)

one by one, in the order of their weight ranking. I.e. any node <sig-token, rid> is “active” when δ is

below certain “threshold” u<sig-token, rid>.

We record u<sig-token, rid> in each node and sort all nodes in each list according to the descending order of their u value.

For any given δ, we only need retrieve a prefix of each list to get all “active nodes”


Experimental Datasets

DBLP: 274,788 Paper titles 1,838,973 URLs


Balance should be reached

Recall our two stages of filtration and verification


Performance (DBLP)


Outline



Conclusion


Conclusion Our method causes no false negatives Our method achieves a good balance between the two

phases of filtration and verification

We also propose EvITER to eliminate duplicate computation

Our method has both effective & efficient performance


References [1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins.

In VLDB, pages 918-929, 2006. [2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter

for approximate membership checking. In SIGMOD Conference, 2008.

[3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006.

[4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.

[5] M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness.

[6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.


References [7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for

approximate string searches. In ICDE, pages 257–266, 2008. [8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of

approximate queries on string collections using variable length grams. In VLDB 2007.

[9] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, 2001.

[10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, 2004.

[11] A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, 2001.

[12] E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages 327-340, 1995.

[13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009.

Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong...

Documents

Transcript of Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong...