HmSearch : An Efficient Hamming Distance Query Processing Algorithm

32
Never Stand Still Faculty of Engineering Computer Science and Engineering Click to edit Present’s Name Never Stand Still Faculty of Engineering Computer Science and Engineering Xiaoyang Zhang 1 , Jianbin Qin 1 , Wei Wang 1 , Yifang Sun 1 , Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing Algorithm 1 University of New South Wales, Australia 2 Renmin University of China, Chnia

description

HmSearch : An Efficient Hamming Distance Query Processing Algorithm. Xiaoyang Zhang 1 , Jianbin Qin 1 , Wei Wang 1 , Yifang Sun 1 , Jiaheng Lu 2. 1 University of New South Wales, Australia 2 Renmin University of China, Chnia. Motivation. Identify Near Duplicate Webpages. Chemical data. - PowerPoint PPT Presentation

Transcript of HmSearch : An Efficient Hamming Distance Query Processing Algorithm

Page 1: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

Never Stand Still Faculty of Engineering Computer Science and Engineering

Click to edit Present’s Name

Never Stand Still Faculty of Engineering Computer Science and Engineering

Xiaoyang Zhang1, Jianbin Qin1, Wei Wang1, Yifang Sun1, Jiaheng Lu2

HmSearch: An Efficient Hamming Distance Query Processing

Algorithm

1 University of New South Wales, Australia2 Renmin University of China, Chnia

Page 2: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Motivation• Identify Near Duplicate Webpages

0012345679ABCDEF

simhash

1012345679ABCDEF

Similar

• Chemical data

Maps in to Binary code

012345679ABCDEF0

012345679ABCDEF1

Similar

Page 3: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

More Applications• Iris recognition

• Image retrieval

• C2LSH

Page 4: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Outline• Problem Definition• Framework• HmSearch

– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement

• Conclusion• Experiment

Page 5: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Hamming Distance Query• Hamming distance

• Hamming distance query

Number of positions at which the corresponding symbols are different for two equal length vectors.

q: ABCD

v: ACCDHamming distance(R, S) = 1

Given a database V of vectors, a query vector Q (all the vectors have the same dimensionality N) and a Hamming distance threshold k,

find all vi in V, that hd (vi, Q) <= k

Page 6: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Outline• Problem Definition• Framework• HmSearch

– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement

• Conclusion• Experiment

Page 7: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Basic Idea

• General framework:1. We can do k=1 efficiently

(show later)2. So we transform larger k

problem to several small k=1 problem by partitioning

3. We do filtering by looking at each partition

4. We do verification at last

1 1 1 1

1 2 1 1v

qthe same

hd (q, v)<=1 hd(qleft, vleft)=0 or hd(qright, vright)=0So if k =1,

can be filtered by looking at each part

1 1 1 1

1 2 2 1v

q

Page 8: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

FrameworkData

Partitioning

Indexing

Index

Query

Partitioning

Candidates0

FilteringCandidates

1Verificatio

nResults

Generating Signatures

Generating Signatures

General Partitioni

ng Scheme1-variants

and 1-deletion variantsEnhanced

FilteringHierarchical Filtering

and Verificatio

n

Dimension Rearrangeme

nt

Page 9: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Outline• Problem Definition• Framework• HmSearch

– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement

• Conclusion• Experiment

Page 10: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

PartitioningLowerbound for partition strategy

Given q and v such that hd(q, v)<=k, if the N dimensions are divided into κ parts, there should be at least partitions,

such that hd(qpart, vpart)<=

In our algorithm, we choose

When k is even, m = 1When k is odd, m = 2

When k= 0 or 1, m=1, hd = 0

When k>=2, hd <= 1

Page 11: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Outline• Problem Definition• Framework• HmSearch

– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement

• Conclusion• Experiment

Page 12: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Signature Generation• 1-variants • 1-deletion-variants

Substituting each dimension with‘#’ each time

Substituting each dimension with each domain value each time (plus itself)

v=[1, 2, 3]

1-del-val(v)=[#, 2, 3], [1, #, 1], [1, 2, #]

v=[1, 2, 3] and Σ (domain) =[1, 2, 3]

1-val(v)=[1, 2, 3], [2, 2, 3], [3, 2, 3],

[1, 1, 3], [1, 3, 3], [1, 2, 1], [1, 2, 2]

We index all 1-val(v) and whenq comes in, we search q in the index

We index all 1-del-val(v) and whenq comes in, we generate 1-del-val(q), andsearch all 1-del-val(q) in the index

OR

Page 13: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Outline• Problem Definition• Framework• HmSearch

– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement

• Conclusion• Experiment

Page 14: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Enhanced Filter (Even)

v

q

If k =2, based on the formula before,

m=1, hd(vpart, qpart)=1So this v becomes a false positive

However, we find thatIf k (k>=1) is even, v is qualified for two situations:1) m=1, where hd(vpart, qpart)=0

2) m=2, where hd(vpart, qpart)<=1 Using enhanced filter,

no situation appliedso v is filtered

Based on the Formula before

When k (k>=1) is even, m = 1

Example1 2 3 4 5 6

1 2 1 4 2 3

Page 15: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Enhanced Filter (Odd)

If k =3, based on the formula before,

m=2, hd(vpart, qpart)=1So this v becomes a false positive

However, we find thatIf k (k>=1) is odd, v is qualified for two situations:1) m=2, where hd(vpart, qpart)<=1 and at least one of them = 0

2) m=3, where hd(vpart, qpart)<=1

Using enhanced filter, no situation appliedso v is filtered

Based on the Formula before

When k (k>=1) is odd, m = 2

Example

v

q 1 2 3 4 5 6

1 1 1 4 2 3

Page 16: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Outline• Problem Definition• Framework• HmSearch

– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement

• Conclusion• Experiment

Page 17: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Hierarchical Filtering and Verification Significant

bit

1st

2nd

3rd

v=[5, 0, 3, 6]

1

0

1

0

0

0

1

0

0

0

1

1

q=[5, 2, 2, 5]

1

0

1

0

1

0

0

1

0

1

1

1

Σ=|8|, k=1

So hd(v, q)>=2,filtered

More over, even if k=4

4 comparisons to calculate hd(v,q)=3

diff

0011

0110

0000

XOR

XOR

XOR

OR

OR

0111 hd(v,q)=3

We can use binary operations to do a hierarchical filtering and verification

Page 18: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Hierarchical Filtering and Verification Significant

bit

1st

2nd

3rd

v=[5, 0, 3, 6]

1

0

1

0

0

0

1

0

0

0

1

1

q=[5, 2, 3, 5]

1

0

1

0

1

0

1

1

0

1

1

1

diff cumdiff

XOR

XOR

0001

0101

0001

0000

0101

OR

OR

Number of 1In cumdiff

1

2

<=1, conti.

>1, filtered

Page 19: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Outline• Problem Definition• Framework• HmSearch

– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement

• Conclusion• Experiment

Page 20: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Impact of Data SkewnessGiven k=2, then m = 1 and k’=1

Only v1 is qualified

We propose to reset the orderand partitionLength to

improveperformance

All vectors are qualified

Dim

v2v1

q

Partition1

1

10

1

10

1

10

Partition2

1

02

0

00

0

00

v3 2 0 2 0 0 0v4 3 0 0 0 0 0

1 2 3 4 5 6 Dim

v2v1

q

Partition1

1

10

1

10

0

00

Partition2

1

02

1

10

0

00

v3 2 0 0 0 2 0v4 3 0 0 0 0 0

1 2 5 4 3 6

Page 21: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Greedy Dimension Rearrangement

Dim

v2v1

Partition1

10

10

10

Partition2

02

00

00

v3 2 0 2 0 0 0v4 3 0 0 0 0 0

1 2 3 4 5 6

MaxFreqfor Dim 1 3 3 3 4 4

MaxFreq is the Max Frequency of any values in each dimension

Dim

v2v1

Partition1

00

10

10

Partition2

00

10

02

v3 0 2 0 0 2 0v4 0 3 0 0 0 0

5 1 2 6 3 4

Our goal: Minimize the global MaxFreq

MaxFreqfor partition 4 41 211

Achieve the goal

Page 22: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Outline• Problem Definition• Framework• HmSearch

– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement

• Conclusion• Experiment

Page 23: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Conclusion1. General Partition Scheme

2. 1-variants and 1-deleltion-variants

3. Techniques help boost the performance– Enhanced Filtering– Hierarchical Filtering and Verification– Dimension Rearrangement

Page 24: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Outline• Problem Definition• Framework• HmSearch

– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement

• Conclusion• Experiment

Page 25: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Experiment Settings• Environment

– Intel Xeon X3330 2.664GHz CPU, 4GB RAM– Debian 5.0.6– AMD Operon™ 8378 2.4GHZ CPU, 96GB RAM (for Pubchem)– Ubuntu/Linaro 4.6.4-1 unbuntu5– All complied with GCC 4.1.2 with –O3

• Dataset

Page 26: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Experiment Settings• Terms

– EF, Enhanced Filtering– HB, Hierarchical Binary Filter– RD, Rearranging Dimensions

• Our algorithms1. HSD, HSV, our proposed algorithms,

the former one using 1-deleltion-variants as signatures and the latter one using 1-varitnas as signatures

2. HSD-nEB, HSV-nEB, variations that remove EF and HB

3. HSD-nB, HSV-nB, variations that remove HB

4. HSD-nR, HSV-nR, variations that remove RD

• Baseline algorithm1. Scancount (Li et. ICDE08)

• State-of-the-art algorithms1. Google (Manku et. www07)2. Hengine (Liu et. ICDE11)

Page 27: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Query time

HSV has the best performance

Page 28: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Candidate Size

HSV has the smallest candidate size

Page 29: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Effect of EF and HB

EF and HB help improve the performance

Page 30: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Effect of RD

RD boost the performance for PubChem Data

Page 31: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Index Size

HSV and HSD have a larger candidate size

Page 32: HmSearch : An Efficient Hamming Distance Query Processing Algorithm

School of Computer Science and Engineering

Thank you