HmSearch : An Efficient Hamming Distance Query Processing Algorithm
description
Transcript of HmSearch : An Efficient Hamming Distance Query Processing Algorithm
Never Stand Still Faculty of Engineering Computer Science and Engineering
Click to edit Present’s Name
Never Stand Still Faculty of Engineering Computer Science and Engineering
Xiaoyang Zhang1, Jianbin Qin1, Wei Wang1, Yifang Sun1, Jiaheng Lu2
HmSearch: An Efficient Hamming Distance Query Processing
Algorithm
1 University of New South Wales, Australia2 Renmin University of China, Chnia
School of Computer Science and Engineering
Motivation• Identify Near Duplicate Webpages
0012345679ABCDEF
simhash
1012345679ABCDEF
Similar
• Chemical data
Maps in to Binary code
012345679ABCDEF0
012345679ABCDEF1
Similar
School of Computer Science and Engineering
More Applications• Iris recognition
• Image retrieval
• C2LSH
School of Computer Science and Engineering
Outline• Problem Definition• Framework• HmSearch
– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement
• Conclusion• Experiment
School of Computer Science and Engineering
Hamming Distance Query• Hamming distance
• Hamming distance query
Number of positions at which the corresponding symbols are different for two equal length vectors.
q: ABCD
v: ACCDHamming distance(R, S) = 1
Given a database V of vectors, a query vector Q (all the vectors have the same dimensionality N) and a Hamming distance threshold k,
find all vi in V, that hd (vi, Q) <= k
School of Computer Science and Engineering
Outline• Problem Definition• Framework• HmSearch
– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement
• Conclusion• Experiment
School of Computer Science and Engineering
Basic Idea
• General framework:1. We can do k=1 efficiently
(show later)2. So we transform larger k
problem to several small k=1 problem by partitioning
3. We do filtering by looking at each partition
4. We do verification at last
1 1 1 1
1 2 1 1v
qthe same
hd (q, v)<=1 hd(qleft, vleft)=0 or hd(qright, vright)=0So if k =1,
can be filtered by looking at each part
1 1 1 1
1 2 2 1v
q
School of Computer Science and Engineering
FrameworkData
Partitioning
Indexing
Index
Query
Partitioning
Candidates0
FilteringCandidates
1Verificatio
nResults
Generating Signatures
Generating Signatures
General Partitioni
ng Scheme1-variants
and 1-deletion variantsEnhanced
FilteringHierarchical Filtering
and Verificatio
n
Dimension Rearrangeme
nt
School of Computer Science and Engineering
Outline• Problem Definition• Framework• HmSearch
– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement
• Conclusion• Experiment
School of Computer Science and Engineering
PartitioningLowerbound for partition strategy
Given q and v such that hd(q, v)<=k, if the N dimensions are divided into κ parts, there should be at least partitions,
such that hd(qpart, vpart)<=
In our algorithm, we choose
When k is even, m = 1When k is odd, m = 2
When k= 0 or 1, m=1, hd = 0
When k>=2, hd <= 1
School of Computer Science and Engineering
Outline• Problem Definition• Framework• HmSearch
– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement
• Conclusion• Experiment
School of Computer Science and Engineering
Signature Generation• 1-variants • 1-deletion-variants
Substituting each dimension with‘#’ each time
Substituting each dimension with each domain value each time (plus itself)
v=[1, 2, 3]
1-del-val(v)=[#, 2, 3], [1, #, 1], [1, 2, #]
v=[1, 2, 3] and Σ (domain) =[1, 2, 3]
1-val(v)=[1, 2, 3], [2, 2, 3], [3, 2, 3],
[1, 1, 3], [1, 3, 3], [1, 2, 1], [1, 2, 2]
We index all 1-val(v) and whenq comes in, we search q in the index
We index all 1-del-val(v) and whenq comes in, we generate 1-del-val(q), andsearch all 1-del-val(q) in the index
OR
School of Computer Science and Engineering
Outline• Problem Definition• Framework• HmSearch
– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement
• Conclusion• Experiment
School of Computer Science and Engineering
Enhanced Filter (Even)
v
q
If k =2, based on the formula before,
m=1, hd(vpart, qpart)=1So this v becomes a false positive
However, we find thatIf k (k>=1) is even, v is qualified for two situations:1) m=1, where hd(vpart, qpart)=0
2) m=2, where hd(vpart, qpart)<=1 Using enhanced filter,
no situation appliedso v is filtered
Based on the Formula before
When k (k>=1) is even, m = 1
Example1 2 3 4 5 6
1 2 1 4 2 3
School of Computer Science and Engineering
Enhanced Filter (Odd)
If k =3, based on the formula before,
m=2, hd(vpart, qpart)=1So this v becomes a false positive
However, we find thatIf k (k>=1) is odd, v is qualified for two situations:1) m=2, where hd(vpart, qpart)<=1 and at least one of them = 0
2) m=3, where hd(vpart, qpart)<=1
Using enhanced filter, no situation appliedso v is filtered
Based on the Formula before
When k (k>=1) is odd, m = 2
Example
v
q 1 2 3 4 5 6
1 1 1 4 2 3
School of Computer Science and Engineering
Outline• Problem Definition• Framework• HmSearch
– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement
• Conclusion• Experiment
School of Computer Science and Engineering
Hierarchical Filtering and Verification Significant
bit
1st
2nd
3rd
v=[5, 0, 3, 6]
1
0
1
0
0
0
1
0
0
0
1
1
q=[5, 2, 2, 5]
1
0
1
0
1
0
0
1
0
1
1
1
Σ=|8|, k=1
So hd(v, q)>=2,filtered
More over, even if k=4
4 comparisons to calculate hd(v,q)=3
diff
0011
0110
0000
XOR
XOR
XOR
OR
OR
0111 hd(v,q)=3
We can use binary operations to do a hierarchical filtering and verification
School of Computer Science and Engineering
Hierarchical Filtering and Verification Significant
bit
1st
2nd
3rd
v=[5, 0, 3, 6]
1
0
1
0
0
0
1
0
0
0
1
1
q=[5, 2, 3, 5]
1
0
1
0
1
0
1
1
0
1
1
1
diff cumdiff
XOR
XOR
0001
0101
0001
0000
0101
OR
OR
Number of 1In cumdiff
1
2
<=1, conti.
>1, filtered
School of Computer Science and Engineering
Outline• Problem Definition• Framework• HmSearch
– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement
• Conclusion• Experiment
School of Computer Science and Engineering
Impact of Data SkewnessGiven k=2, then m = 1 and k’=1
Only v1 is qualified
We propose to reset the orderand partitionLength to
improveperformance
All vectors are qualified
Dim
v2v1
q
Partition1
1
10
1
10
1
10
Partition2
1
02
0
00
0
00
v3 2 0 2 0 0 0v4 3 0 0 0 0 0
1 2 3 4 5 6 Dim
v2v1
q
Partition1
1
10
1
10
0
00
Partition2
1
02
1
10
0
00
v3 2 0 0 0 2 0v4 3 0 0 0 0 0
1 2 5 4 3 6
School of Computer Science and Engineering
Greedy Dimension Rearrangement
Dim
v2v1
Partition1
10
10
10
Partition2
02
00
00
v3 2 0 2 0 0 0v4 3 0 0 0 0 0
1 2 3 4 5 6
MaxFreqfor Dim 1 3 3 3 4 4
MaxFreq is the Max Frequency of any values in each dimension
Dim
v2v1
Partition1
00
10
10
Partition2
00
10
02
v3 0 2 0 0 2 0v4 0 3 0 0 0 0
5 1 2 6 3 4
Our goal: Minimize the global MaxFreq
MaxFreqfor partition 4 41 211
Achieve the goal
School of Computer Science and Engineering
Outline• Problem Definition• Framework• HmSearch
– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement
• Conclusion• Experiment
School of Computer Science and Engineering
Conclusion1. General Partition Scheme
2. 1-variants and 1-deleltion-variants
3. Techniques help boost the performance– Enhanced Filtering– Hierarchical Filtering and Verification– Dimension Rearrangement
School of Computer Science and Engineering
Outline• Problem Definition• Framework• HmSearch
– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement
• Conclusion• Experiment
School of Computer Science and Engineering
Experiment Settings• Environment
– Intel Xeon X3330 2.664GHz CPU, 4GB RAM– Debian 5.0.6– AMD Operon™ 8378 2.4GHZ CPU, 96GB RAM (for Pubchem)– Ubuntu/Linaro 4.6.4-1 unbuntu5– All complied with GCC 4.1.2 with –O3
• Dataset
School of Computer Science and Engineering
Experiment Settings• Terms
– EF, Enhanced Filtering– HB, Hierarchical Binary Filter– RD, Rearranging Dimensions
• Our algorithms1. HSD, HSV, our proposed algorithms,
the former one using 1-deleltion-variants as signatures and the latter one using 1-varitnas as signatures
2. HSD-nEB, HSV-nEB, variations that remove EF and HB
3. HSD-nB, HSV-nB, variations that remove HB
4. HSD-nR, HSV-nR, variations that remove RD
• Baseline algorithm1. Scancount (Li et. ICDE08)
• State-of-the-art algorithms1. Google (Manku et. www07)2. Hengine (Liu et. ICDE11)
School of Computer Science and Engineering
Query time
HSV has the best performance
School of Computer Science and Engineering
Candidate Size
HSV has the smallest candidate size
School of Computer Science and Engineering
Effect of EF and HB
EF and HB help improve the performance
School of Computer Science and Engineering
Effect of RD
RD boost the performance for PubChem Data
School of Computer Science and Engineering
Index Size
HSV and HSD have a larger candidate size
School of Computer Science and Engineering
Thank you