HmSearch : An Efficient Hamming Distance Query Processing Algorithm

Never Stand Still Faculty of Engineering Computer Science and Engineering

Click to edit Present’s Name

Never Stand Still Faculty of Engineering Computer Science and Engineering

Xiaoyang Zhang1, Jianbin Qin1, Wei Wang1, Yifang Sun1, Jiaheng Lu2

HmSearch: An Efficient Hamming Distance Query Processing

Algorithm

1 University of New South Wales, Australia2 Renmin University of China, Chnia

School of Computer Science and Engineering

Motivation• Identify Near Duplicate Webpages

0012345679ABCDEF

simhash

1012345679ABCDEF

Similar

• Chemical data

Maps in to Binary code

012345679ABCDEF0

012345679ABCDEF1

Similar


More Applications• Iris recognition

• Image retrieval

• C2LSH


Outline• Problem Definition• Framework• HmSearch

– Partitioning Scheme– Signature Generation– Enhanced Filtering– Hierarchical Filtering and Verification – Dimension Rearrangement

• Conclusion• Experiment


Hamming Distance Query• Hamming distance

• Hamming distance query

Number of positions at which the corresponding symbols are different for two equal length vectors.

q: ABCD

v: ACCDHamming distance(R, S) = 1

Given a database V of vectors, a query vector Q (all the vectors have the same dimensionality N) and a Hamming distance threshold k,

find all vi in V, that hd (vi, Q) <= k


Basic Idea

• General framework:1. We can do k=1 efficiently

(show later)2. So we transform larger k

problem to several small k=1 problem by partitioning

3. We do filtering by looking at each partition

4. We do verification at last

1 1 1 1

1 2 1 1v

qthe same

hd (q, v)<=1 hd(qleft, vleft)=0 or hd(qright, vright)=0So if k =1,

can be filtered by looking at each part

1 1 1 1

1 2 2 1v

q


FrameworkData

Partitioning

Indexing

Index

Query

Partitioning

Candidates0

FilteringCandidates

1Verificatio

nResults

Generating Signatures

Generating Signatures

General Partitioni

ng Scheme1-variants

and 1-deletion variantsEnhanced

FilteringHierarchical Filtering

and Verificatio

n

Dimension Rearrangeme

nt


PartitioningLowerbound for partition strategy

Given q and v such that hd(q, v)<=k, if the N dimensions are divided into κ parts, there should be at least partitions,

such that hd(qpart, vpart)<=

In our algorithm, we choose

When k is even, m = 1When k is odd, m = 2

When k= 0 or 1, m=1, hd = 0

When k>=2, hd <= 1


Signature Generation• 1-variants • 1-deletion-variants

Substituting each dimension with‘#’ each time

Substituting each dimension with each domain value each time (plus itself)

v=[1, 2, 3]

1-del-val(v)=[#, 2, 3], [1, #, 1], [1, 2, #]

v=[1, 2, 3] and Σ (domain) =[1, 2, 3]

1-val(v)=[1, 2, 3], [2, 2, 3], [3, 2, 3],

[1, 1, 3], [1, 3, 3], [1, 2, 1], [1, 2, 2]

We index all 1-val(v) and whenq comes in, we search q in the index

We index all 1-del-val(v) and whenq comes in, we generate 1-del-val(q), andsearch all 1-del-val(q) in the index

OR


Enhanced Filter (Even)

v

q

If k =2, based on the formula before,

m=1, hd(vpart, qpart)=1So this v becomes a false positive

However, we find thatIf k (k>=1) is even, v is qualified for two situations:1) m=1, where hd(vpart, qpart)=0

2) m=2, where hd(vpart, qpart)<=1 Using enhanced filter,

no situation appliedso v is filtered

Based on the Formula before

When k (k>=1) is even, m = 1

Example1 2 3 4 5 6

1 2 1 4 2 3


Enhanced Filter (Odd)

If k =3, based on the formula before,

m=2, hd(vpart, qpart)=1So this v becomes a false positive

However, we find thatIf k (k>=1) is odd, v is qualified for two situations:1) m=2, where hd(vpart, qpart)<=1 and at least one of them = 0

2) m=3, where hd(vpart, qpart)<=1

Using enhanced filter, no situation appliedso v is filtered

Based on the Formula before

When k (k>=1) is odd, m = 2

Example

v

q 1 2 3 4 5 6

1 1 1 4 2 3


Hierarchical Filtering and Verification Significant

bit

1st

2nd

3rd

v=[5, 0, 3, 6]

1

0

1

0

0

0

1

0

0

0

1

1

q=[5, 2, 2, 5]

1

0

1

0

1

0

0

1

0

1

1

1

Σ=|8|, k=1

So hd(v, q)>=2,filtered

More over, even if k=4

4 comparisons to calculate hd(v,q)=3

diff

0011

0110

0000

XOR

XOR

XOR

OR

OR

0111 hd(v,q)=3

We can use binary operations to do a hierarchical filtering and verification


Hierarchical Filtering and Verification Significant

bit

1st

2nd

3rd

v=[5, 0, 3, 6]

1

0

1

0

0

0

1

0

0

0

1

1

q=[5, 2, 3, 5]

1

0

1

0

1

0

1

1

0

1

1

1

diff cumdiff

XOR

XOR

0001

0101

0001

0000

0101

OR

OR

Number of 1In cumdiff

1

2

<=1, conti.

>1, filtered


Impact of Data SkewnessGiven k=2, then m = 1 and k’=1

Only v1 is qualified

We propose to reset the orderand partitionLength to

improveperformance

All vectors are qualified

Dim

v2v1

q

Partition1

1

10

1

10

1

10

Partition2

1

02

0

00

0

00

v3 2 0 2 0 0 0v4 3 0 0 0 0 0

1 2 3 4 5 6 Dim

v2v1

q

Partition1

1

10

1

10

0

00

Partition2

1

02

1

10

0

00

v3 2 0 0 0 2 0v4 3 0 0 0 0 0

1 2 5 4 3 6


Greedy Dimension Rearrangement

Dim

v2v1

Partition1

10

10

10

Partition2

02

00

00

v3 2 0 2 0 0 0v4 3 0 0 0 0 0

1 2 3 4 5 6

MaxFreqfor Dim 1 3 3 3 4 4

MaxFreq is the Max Frequency of any values in each dimension

Dim

v2v1

Partition1

00

10

10

Partition2

00

10

02

v3 0 2 0 0 2 0v4 0 3 0 0 0 0

5 1 2 6 3 4

Our goal: Minimize the global MaxFreq

MaxFreqfor partition 4 41 211

Achieve the goal


Conclusion1. General Partition Scheme

2. 1-variants and 1-deleltion-variants

3. Techniques help boost the performance– Enhanced Filtering– Hierarchical Filtering and Verification– Dimension Rearrangement


Experiment Settings• Environment

– Intel Xeon X3330 2.664GHz CPU, 4GB RAM– Debian 5.0.6– AMD Operon™ 8378 2.4GHZ CPU, 96GB RAM (for Pubchem)– Ubuntu/Linaro 4.6.4-1 unbuntu5– All complied with GCC 4.1.2 with –O3

• Dataset


Experiment Settings• Terms

– EF, Enhanced Filtering– HB, Hierarchical Binary Filter– RD, Rearranging Dimensions

• Our algorithms1. HSD, HSV, our proposed algorithms,

the former one using 1-deleltion-variants as signatures and the latter one using 1-varitnas as signatures

2. HSD-nEB, HSV-nEB, variations that remove EF and HB

3. HSD-nB, HSV-nB, variations that remove HB

4. HSD-nR, HSV-nR, variations that remove RD

• Baseline algorithm1. Scancount (Li et. ICDE08)

• State-of-the-art algorithms1. Google (Manku et. www07)2. Hengine (Liu et. ICDE11)


Query time

HSV has the best performance


Candidate Size

HSV has the smallest candidate size


Effect of EF and HB

EF and HB help improve the performance


Effect of RD

RD boost the performance for PubChem Data


Index Size

HSV and HSD have a larger candidate size


Thank you

HmSearch : An Efficient Hamming Distance Query Processing Algorithm

Documents

Transcript of HmSearch : An Efficient Hamming Distance Query Processing Algorithm