Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong...
-
Upload
molly-hutchison -
Category
Documents
-
view
222 -
download
4
Transcript of Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong...
Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists
Jialong Han
Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng2
Introduction: An Example
A dictionary of strings we are interested in E.g. product names, postal addresses…
We are going to locate their “approximate apparences” in a series of documents. See the meaning of “approximate apparence” in the
following example:
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng3
Problem Definition
Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r R, and Similarity (r, m) ∈≥δ(or Distance(r, m) ≤k). Here we call r a piece of evidence for m. Similarity() is a function measuring the similarity of
two strings Strings are viewed as sets of tokens (words) An example for Sim(): Jaccard similarity:
)(
)(),(
mrwt
mrwtmrJ
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng4
Outline
Introduction State-of-the-art techniques
The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable
Our algorithms and evaluations Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng5
Why pre-pruning is needed
We need spot evidence to decide whether a substring m should be extracted Simple verification on all dictionary strings may be
inefficient Pre-pruning and post-verifying is beneficial But should it be running-speed-oriented or filtering-
power-oriented? Less time or less survivors?
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng6
The issue of compromise comes again
Balance between the two stages should be reached:
More(less)filtration time
Strong(weak)filtration power
Fewer(more)candidates
Less(more)verification time
Overall performance
=Tf+Tv ?????
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng7
Outline
Introduction State-of-the-art techniques
The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable
Our algorithms and evaluations Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng8
K-signature scheme K-signature scheme
Proposed by Chakrabarti et al. (SIGMOD 2008) Choose several top-weighted tokens in a string as
signatures to represent it: s => Sig(s) Observation: if r cannot match m, r is likely to have
insufficient signature overlapping with m K is a parameter for filtration power tuning
Potential evidence loss A counter-example found when k=3 We tried and only proved that it works for k=1 and
k=∞
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng9
Outline
Introduction State-of-the-art techniques
The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable
Our algorithms and evaluations Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng10
Inverted Signature-based Hashtable
Proposed by Chakrabarti et al. (SIGMOD 2008) Each dictionary string encoded into a solid 0-1 matrix An ‘1’ for each occurrence of a <token,sig-token>
tuple (‘1’- rectangle) Bitwise-or all solid matrices to get the matrix of R
Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R.
Formalized into an NPC problem Solution causes too weak filtering power
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng11
Outline
Introduction State-of-the-art techniques Our algorithms and evaluations
Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng12
If Sim(m,r) ≥δ, what do we have ?wt(Sig(m)∩Sig(r)) ≥ τ(m)
wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) }
So the threshold does not remain constant involves unknown evidence
Our solution: Use inverted lists to count sig-token overlappings. Note that sig-tokens usually have low document
frequency (e.g. IDF as weights)
Our proposed theorem
Too strict !Too strict !
Proved by usProved by us
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng13
Outline
Introduction State-of-the-art techniques Our algorithms and evaluations
Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng14
Lists indexed by sig-tokens Each sig-token of a string creates a node (containing
the string’s id) in the corresponding list. E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon
digital slr camera”, r3=“canon slr camera”}. wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2,
7 ,9).
Signature-based Inverted Lists
5d, 9.0
canon, 2.0
camera, 1.0
eos, 7.0
nikon, 2.0
slr, 2.0
5d, 9.0
canon, 2.0
camera, 1.0
eos, 7.0
nikon, 2.0
slr, 2.0
11
11
22
11
22
22
33
33
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng15
Filtration by SIL Using an array called “accumulator” to compute
the overlapped sig weight wt(Sig(m)∩Sig(r)) E.g. m=“canon eos digital camera”, δ=0.8
5d, 9.0
canon, 2.0
camera, 1.0
eos, 7.0
nikon, 2.0
slr, 2.0
5d, 9.0
canon, 2.0
camera, 1.0
eos, 7.0
nikon, 2.0
slr, 2.0
11
11
22
11
22
22
33
33
rid 1 2 3wt(Sig(m)∩Sig(r))
min{τ(m),τ(r) } 6.8 3.8 3
AccumulatorAccumulator
2.02.09.09.0 2.02.000
Qualified!Qualified!
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng16
Outline
Introduction State-of-the-art techniques Our algorithms and evaluations
Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng17
EvITER: Progressive Computation Recall we are checking all substrings
Some of them are quite similar, indicating that they share duplicate computation
An intuition: if m have potential evidence r, then m t is very likely to match r
Formally we proved that Let ES(m) be the set of “potential evidence” for m, list[t]={s| all
dictionary strings that contain token t} We have ES(m t) ES(m)∪list[t]
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng18
Example
Docoment M:
m t
“…. cannon eos digital camera lens…”
We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens”
ES(m)ES(m)
{r1}
{r1}
…
lens, 3.0
…
…
lens, 3.0
…
2222 5353
List[t]List[t]
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng19
Flow of Evidence
EvITER for “Evidence ITERATION”
……
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng20
Outline
Introduction State-of-the-art techniques Our algorithms and evaluations
Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng21
The Static Threshold Problem
How does this index work so far? -“Get ready forδ=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1,δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I wantδ=0.9…” -“Sorry, please wait another 30min for index regeneration…” -“:-(”
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng22
The Static Threshold Problem
This One Seems Better -“Get ready forδ>=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1,δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I wantδ=0.9…” -“…Extraction complete.” “:-)”
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng23
Supporting Dynamic Thresholds
An Observation When δ descends, a string r’s tokens fall into Sig(r)
one by one, in the order of their weight ranking. I.e. any node <sig-token, rid> is “active” when δ is
below certain “threshold” u<sig-token, rid>.
We record u<sig-token, rid> in each node and sort all nodes in each list according to the descending order of their u value.
For any given δ, we only need retrieve a prefix of each list to get all “active nodes”
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng24
Experimental Datasets
DBLP: 274,788 Paper titles 1,838,973 URLs
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng25
Balance should be reached
Recall our two stages of filtration and verification
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng26
Performance (DBLP)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng27
Outline
Introduction State-of-the-art techniques Our algorithms and evaluations
Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds
Conclusion
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng28
Conclusion Our method causes no false negatives Our method achieves a good balance between the two
phases of filtration and verification
We also propose EvITER to eliminate duplicate computation
Our method has both effective & efficient performance
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng29
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng30
References [1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins.
In VLDB, pages 918-929, 2006. [2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter
for approximate membership checking. In SIGMOD Conference, 2008.
[3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006.
[4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.
[5] M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness.
[6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng31
References [7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for
approximate string searches. In ICDE, pages 257–266, 2008. [8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of
approximate queries on string collections using variable length grams. In VLDB 2007.
[9] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, 2001.
[10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, 2004.
[11] A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, 2001.
[12] E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages 327-340, 1995.
[13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009.