Probabilistic Hashing for Efficient Search and Learning - Stanford

Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 1

Probabilistic Hashing for Efficient Search and Learning

Ping Li

Department of Statistical Science

Faculty of Computing and Information Science

Cornell University

Ithaca, NY 14853

July 12, 2012


BigData Everywhere

Conceptually, consider a dataset as a matrix of size n × D.

In modern applications, # examples n = 106 is common and n = 109 is not

rare, for example, images, documents, spams, search click-through data.

High-dimensional (image, text, biological) data are common: D = 106 (million),

D = 109 (billion), D = 1012 (trillion), D = 264 or even higher. In a sense, D

can be arbitrarily high by considering pairwise, 3-way or higher interactions.


Examples of BigData Challenges: Linear Learning

Binary classification: Dataset (xi, yi)ni=1, xi ∈ R

D , yi ∈ −1, 1.

One can fit an L2-regularized linear SVM:

minw

1

2w

Tw + C

n∑

i=1

max

1 − yiwTxi, 0

,

or the L2-regularized logistic regression:

minw

1

2w

Tw + C

n∑

i=1

log(

1 + e−yiwTxi

)

,

where C > 0 is the penalty (regularization) parameter.

The weight vector w has length D, the same as the data dimensionality.


Challenges of Learning with Massive High-dimensional Data

• The data often can not fit in memory (even when the data are sparse).

• Data loading takes too long. For example, online algorithms require less

memory but often need to iterate over the data for several (or many) passes to

achieve sufficient accuracies. Data loading in general dominates the cost.

• Training can be very expensive, even for simple linear models such as logistic

regression and linear SVM.

• Testing may be too slow to meet the demand, especially crucial for

applications in search or interactive data visual analytics.

• The model itself can be too large to store, for example, we can not really store

a vector of weights for logistic regression on data of 264 dimensions.


A Popular Solution Based on Normal Random Projections

Random Projections : Replace original data matrix A by B = A × R

A R = B

R ∈ RD×k: a random matrix, with i.i.d. entries sampled from N(0, 1).

B ∈ Rn×k : projected matrix, also random.

B approximately preserves the Euclidean distance and inner products between

any two rows of A. In particular, E (BBT) = AA

T.

Therefore, we can simply feed B into (e.g.,) SVM or logistic regression solvers.


Experiments on Classification, Clustering, and Regression

Dataset # Examples (n) # Dims (D) Avg # nonzeros Train v.s. Test

Webspam 350,000 16,609,143 3,728 4/5 v.s. 1/5

MNIST-PW 70,000 193,816 12,346 6/7 v.s. 1/7

PEMS (UCI) 440 138,672 138,672 1/2 v.s. 1/2

CT (UCI) 53,500 384 384 4/5 v.s. 1/5

MNIST-PW = Original MNIST + all pairwise features.

CT = Regression task (reducing number of examples instead of dimensions).

———————

Instead of using dense (normal) projections, we will experiment with

very sparse random projections.


Very Sparse Random Projections

The projection matrix: R = rij ∈ RD×k. Instead of sampling from normals,

we sample from a sparse distribution parameterized by s ≥ 1:

rij =

−1 with prob. 12s

0 with prob. 1 − 1s

1 with prob. 12s

If s = 100, then on average, 99% of the entries are zero.

If s = 10000, then on average, 99.99% of the entries are zero.

Usually, s =√

D or s = k is a good choice.

——————-

Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.


Linear SVM Test Accuracies on Webspam

Red dashed curves: results based on the original data

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096

Webspam: SVM (s = 1)

C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


CC

lass

ifica

tion

Acc

(%

)Observations:

• We need a large number of projections (e.g., k ≥ 4096) for high accuracy.

• The sparsity parameter s matters little unless k is small.


Linear SVM Test Accuracies on Webspam


10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


CC

lass

ifica

tion

Acc

(%

)

As long as k is large (necessary for high accuracy), the projection matrix can be

extremely sparse, even with s = 10000.


Linear SVM Test Accuracies on MNIST-PW


10−2

10−1

100

101

102

60

70

80

90

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096

MNIST−PW: SVM (s = 1)

C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

60

70

80

90

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


CC

lass

ifica

tion

Acc

(%

)

As long as k is large (necessary for high accuracy), the projection matrix can be

extremely sparse. For this data, we probably need k > 104 for high accuracy.


Linear SVM Test Accuracies

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

60

70

80

90

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

60

70

80

90

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

60

70

80

90

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

60

70

80

90

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)


Logistic Regression Test Accuracies

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096

Webspam: Logit (s = 1)

C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

75

80

85

90

95

100

k = 32

k = 64

k = 128

k = 256

k = 512k = 1024k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

60

70

80

90

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024

k = 4096

MNIST−PW: Logit (s = 1)

C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

60

70

80

90

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024

k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

60

70

80

90

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024

k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

10−2

10−1

100

101

102

60

70

80

90

100

k = 32

k = 64

k = 128

k = 256k = 512k = 1024

k = 4096


C

Cla

ssifi

catio

n A

cc (

%)

Again, as long as k is large (necessary for high accuracy), the projection matrix

can be extremely sparse (i.e., very large s).


SVM and Logistic Regression Test Accuracies on PEMS

10−2

100

102

104

0

20

40

60

80

100

k = 32

k = 64

k = 128 k = 256k = 4096

PEMS: SVM (s = 1)

C

Cla

ssifi

catio

n A

cc (

%)

10−2

100

102

104

0

20

40

60

80

100

k = 32

k = 64

k = 128

k = 4096

PEMS: SVM (s = 10000)

C

Cla

ssifi

catio

n A

cc (

%)

10−2

100

102

104

0

20

40

60

80

100

k = 32k = 64

k = 128

k = 4096

PEMS: Logit (s = 1)

C

Cla

ssifi

catio

n A

cc (

%)

10−2

100

102

104

0

20

40

60

80

100

k = 32

k = 64

k = 128

k = 4096

PEMS: Logit (s = 10000)

C

Cla

ssifi

catio

n A

cc (

%)

Again, as long as k is large, the projection matrix can be extremely sparse.


K-Means Clustering Accuracies on MNIST-PW

Random projections were applied before the data were fed to k-means clustering.

As the classification labels are known, we simply use the classification accuracy

to assess the clustering quality.

101

102

103

104

30

40

50

60

70

s = 10000

k

Clu

ster

ing

Acc

(%

)

MNIST−PW: K−means

s = 1,10,100

s = 1000

Again, s does not really seem to matter much, as long as k is not small.


Ridge Regression on CT

Random projections were applied to reduce the number of examples before ridge

regression. The task becomes estimating the original regression coefficients.

101

102

103

104

5

10

15

20

k

Reg

ress

ion

Err

or

CT (Train): Regression Error

101

102

103

104

5

10

15

20

CT (Test): Regression Error

kR

egre

ssio

n E

rror

Different curves are for different s values. Again, the sparsity parameter s does

not really seem to matter much, as long as k is not small.


Advantages of Very Sparse Random Projections (Large s)

Consequences of using very sparse projection matrix R:

• Matrix multiplication A × R becomes much faster.

• Much easier to store R if necessary.

• Much easier to generate (and re-generate) R on the fly.

• Sparse original data =⇒ sparse projected data. Average number of 0’s of the

projected data vector would be

k ×(

1 − 1

s

)f

where f is the number of nonzero elements in the original data vector.

——————-

Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.


Disadvantages of Random Projections (and Variants)

Inaccurate, especially on binary data.


Variance Analysis for Inner Product Estimates

A R = B

First two rows in A: u1, u2 ∈ RD (D is very large):

u1 = u1,1, u1,2, ..., u1,i, ..., u1,Du2 = u2,1, u2,2, ..., u2,i, ..., u2,D

First two rows in B: v1, v2 ∈ Rk (k is small):

v1 = v1,1, v1,2, ..., v1,j, ..., v1,kv2 = v2,1, v2,2, ..., v2,j, ..., v2,k


a =1

k

k∑

j=1

v1,jv2,j , (which is also an inner product)

E(a) = a

V ar(a) =1

k

(

m1m2 + a2)

m1 =

D∑

i=1

|u1,i|2, m2 =

D∑

i=1

|u2,i|2

———

Random projections may not be good for inner products because the variance is

dominated by marginal l2 norms m1m2, especially when a ≈ 0.

For real-world datasets, most pairs are often close to be orthogonal (a ≈ 0).

——————-

Ref: Li, Hastie, and Church, Very Sparse Random Projections, KDD’06.

Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11


b-Bit Minwise Hashing

• Simple algorithm designed specifically for massive binary data.

• Much more accurate than random projections for estimating inner products.

• Much smaller space requirement than the original minwise hashing algorithm.

• Capable of estimating 3-way similarity, while random projections can not.

• Useful for large-scale linear learning (and kernel learning of course).

• Useful for sub-linear time near neighbor search.


Massive, High-dimensional, Sparse, and Binary Data

Binary sparse data are very common in the real-world :

• For many applications (such as text), binary sparse data are very natural.

• Many datasets can be quantized/thresholded to be binary without hurting the

prediction accuracy.

• In some cases, even when the “original” data are not too sparse, they often

become sparse when considering pariwise and higher-order interactions.


An Example of Binary (0/1) Sparse Massive Data

A set S ⊆ Ω = 0, 1, ..., D− 1 can be viewed as 0/1 vector in D dimensions.

Shingling : Each document (Web page) can be viewed as a set of w-shingles.

For example, after parsing, a sentence “today is a nice day” becomes

• w = 1: “today”, “is”, “a”, “nice”, “day”

• w = 2: “today is”, “is a”, “a nice”, “nice day”

• w = 3: “today is a”, “is a nice”, “a nice day”

Previous studies used w ≥ 5, as single-word (unit-gram) model is not sufficient.

Shingling generates extremely high dimensional vectors, e.g., D = (105)w .

(105)5 = 1025 = 283, although in current practice, it seems D = 264 suffices.


Notation

A binary (0/1) vector can be equivalently viewed as a set (locations of nonzeros).

Consider two sets S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1 (e.g., D = 264)

f2

f1 a

f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|.

The resemblance R is a popular measure of set similarity

R =|S1 ∩ S2||S1 ∪ S2|

=a

f1 + f2 − a


Minwise Hashing: Stanford Algorithm in the Context of Searc h

Suppose a random permutation π is performed on Ω, i.e.,

π : Ω −→ Ω, where Ω = 0, 1, ..., D − 1.

An elementary probability argument shows that

Pr (min(π(S1)) = min(π(S2))) =|S1 ∩ S2|

|S1 ∪ S2|= R.


An Example

D = 5. S1 = 0, 3, 4, S2 = 1, 2, 3, R = |S1∩S2||S1∪S2|

= 15 .

One realization of the permutation π can be

0 =⇒ 3

1 =⇒ 2

2 =⇒ 0

3 =⇒ 4

4 =⇒ 1

π(S1) = 3, 4, 1 = 1, 3, 4, π(S2) = 2, 0, 4 = 0, 2, 4

In this example, min(π(S1)) 6= min(π(S2)).


Minwise Hashing Estimator

After k permutations, π1, π2, ..., πk , one can estimate R without bias:

RM =1

k

k∑

j=1

1min(πj(S1)) = min(πj(S2)),

Var(

RM

)

=1

kR(1 − R).

—————————-

We recently realized that this estimator could be written as an inner product in

264 × k (assuming D = 264) dimensions). This means one can potentially use

it for linear learning, although the dimensionality would be excessively high.

Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11


b-Bit Minwise Hashing: the Intuition

Basic idea : Only store the lowest b-bits of each hashed value, for small b.

Intuition:

• When two sets are identical, then their lowest b-bits of the hashed values are

of course also equal. b = 1 only stores whether a number is even or odd.

• When two sets are similar, then their lowest b-bits of the hashed values

“should be” also similar (True?? Need a proof).

• Therefore, hopefully we do not need many bits to obtain useful information,

especially when real applications care about pairs with reasonably high

resemblance values (e.g., 0.5).


The Basic Collision Probability Result

Consider two sets, S1 and S2,

S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1,f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|

Define the minimum values under π : Ω → Ω to be z1 and z2:

z1 = min (π (S1)) , z2 = min (π (S2)) .

and their b-bit versions

z(b)1 = The lowest b bits of z1, z

(b)2 = The lowest b bits of z2

Example: if z1 = 7(= 111 in binary), then z(1)1 = 1, z

(2)1 = 3.


Collision probability : Assume D is large (which is virtually always true)

Pb = Pr

(

z(b)1 = z

(b)2

)

= C1,b + (1 − C2,b) R

——————–

Recall, (assuming infinite precision, or as many digits as needed), we have

Pr (z1 = z2) = R


Collision probability : Assume D is large (which is virtually always true)

Pb = Pr

(

z(b)1 = z

(b)2

)

= C1,b + (1 − C2,b) R

r1 =f1

D, r2 =

f2

D, f1 = |S1|, f2 = |S2|,

C1,b = A1,br2

r1 + r2+ A2,b

r1

r1 + r2,

C2,b = A1,br1

r1 + r2+ A2,b

r2

r1 + r2,

A1,b =r1 [1 − r1]

2b−1

1 − [1 − r1]2b

, A2,b =r2 [1 − r2]

2b−1

1 − [1 − r2]2b

.

——————–

Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.


The closed-form collision probability is remarkably accurate even for small D.

The absolute errors (approximate - exact) are very small even for D = 20.

0 0.2 0.4 0.6 0.8 10

0.002

0.004

0.006

0.008

0.01

f2 = 2

f2 = 4

a / f2

App

roxi

mat

ion

Err

or

D = 20, f1 = 4, b = 1

0 0.2 0.4 0.6 0.8 10

1

2

3

4x 10

−4

D = 500, f1 = 50, b = 1

f2 = 2

f2 = 25

f2 = 50

a / f2

App

roxi

mat

ion

Err

or0 0.2 0.4 0.6 0.8 1

0

0.002

0.004

0.006

0.008

0.01

f2 = 2

f2 = 5

f2 = 10

a / f2

App

roxi

mat

ion

Err

or

D = 20, f1 = 10, b = 1

0 0.2 0.4 0.6 0.8 10

1

2

3

4x 10

−4

a / f2

App

roxi

mat

ion

Err

or

D = 500, f1 = 250, b = 1

—————–

Ref: Li and Konig, Theory and Applications of b-Bit Minwise Hashing, CACM

Research Highlights, 2011 .


An Unbiased Estimator

An unbiased estimator Rb for R:

Rb =Pb − C1,b

1 − C2,b,

Pb =1

k

k∑

j=1

z(b)1,πj

= z(b)2,πj

,

Var(

Rb

)

=Var

(

Pb

)

[1 − C2,b]2 =

1

k

Pb(1 − Pb)

[1 − C2,b]2


The Variance-Space Trade-off

Smaller b =⇒ smaller space for each sample. However,

smaller b =⇒ larger variance.

B(b; R, r1, r2) characterizes the var-space trade-off:

B(b; R, r1, r2) = b × Var(

Rb

)

Lower B(b) is better.


The ratio,

B(b1; R, r1, r2)

B(b2; R, r1, r2)

measures the improvement of using b = b2 (e.g., b2 = 1) over using b = b1

(e.g., b1 = 64).

B(64)B(b) = 20 means, to achieve the same accuracy (variance), using b = 64 bits

per hashed value will require 20 times more space (in bits) than using b = 1.


B(64)B(b) , relative improvement of using b = 1, 2, 3, 4 bits, compared to 64 bits.

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

Resemblance (R)

Impr

ovem

ent

r1 = r

2 = 10−10 b = 1

b = 2

b = 3

b = 4

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

Resemblance (R)

Impr

ovem

ent

r1 = r

2 = 0.1

b = 1

b = 2

b = 3

b = 4

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

Resemblance (R)

Impr

ovem

ent

r1 = r

2 = 0.5

b = 1

b = 2

b = 3

b = 4

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

Resemblance (R)

Impr

ovem

ent

r1 = r

2 = 0.9

b = 1

b = 2

b = 3

b = 4


Relative Improvements in the Least Favorable Situation

If r1, r2 → 0 (least favorable situation; there is a proof), then

B(64)

B(1)= 64

R

R + 1

If R = 0.5, then the improvement will be 643 = 21.3-fold.


Experiments: Duplicate detection on Microsoft news articl es

The dataset was crawled as part of the BLEWS project at Microsoft. We

computed pairwise resemblances for all documents and retrieved documents

pairs with resemblance R larger than a threshold R0.

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.3

Sample size (k)

Pre

cisi

on

b=1b=2

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.4

Sample size (k)

Pre

cisi

on

2

b=1

b=1b=2b=4M


0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.5

Sample size (k)

Pre

cisi

on

b=1

2

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.6

Sample size (k)

Pre

cisi

on

2

b=1

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.7

Sample size (k)

Pre

cisi

on

b=1

2

b=1b=2b=4M

0 100 200 300 400 5000

0.10.20.30.40.50.60.70.80.9

1

R0 = 0.8

Sample size (k)

Pre

cisi

on

2

b=1

b=1b=2b=4M


1/2 Bit Suffices for High Similarity

Practical applications are sometimes interested in pairs of very high similarities.

For these data, one can XOR two bits from two permutations, into just one bit.

The new estimator is denoted by R1/2. Compared to the 1-bit estimator R1:

• A good idea for highly similar data :

limR→1

Var(

R1

)

Var(

R1/2

) = 2.

• Not so good idea when data are not very similar :

Var(

R1

)

< Var(

R1/2

)

, if R < 0.5774, Assuming sparse data

—————

Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.


b-Bit Minwise Hashing for Estimating 3-Way Similarities

Notation for 3-way set intersections

a12

f1 a

a23

f3

a13

f2

r1

r3

s12

ss23

r2

s13

3-Way collision probability : Assume D is large.

Pr (lowest b bits of 3 hashed values are equal) =Z

u+ R123 =

Z + s

u,

where u = r1 + r2 + r3 − s12 − s13 − s23 + s, and ...


Z =(s12 − s)A3,b +(r3 − s13 − s23 + s)

r1 + r2 − s12

s12G12,b

+(s13 − s)A2,b +(r2 − s12 − s23 + s)

r1 + r3 − s13

s13G13,b

+(s23 − s)A1,b +(r1 − s12 − s13 + s)

r2 + r3 − s23

s23G23,b

+[

(r2 − s23)A3,b + (r3 − s23)A2,b

] (r1 − s12 − s13 + s)

r2 + r3 − s23

G23,b

+[

(r1 − s13)A3,b + (r3 − s13)A1,b

] (r2 − s12 − s23 + s)

r1 + r3 − s13

G13,b

+[

(r1 − s12)A2,b + (r2 − s12)A1,b

] (r3 − s13 − s23 + s)

r1 + r2 − s12

G12,b,

Aj,b =rj(1 − rj)2

b−1

1 − (1 − rj)2b

,

Gij,b =(ri + rj − sij)(1 − ri − rj + sij)2

b−1

1 − (1 − ri − rj + sij)2b

, i, j ∈ 1, 2, 3, i 6= j.

—————

Ref: Li, Konig, Gui, b-Bit Minwise Hashing for Estimating 3-Way Similarities,

NIPS’10.


Useful messages :

1. Substantial improvement over using 64 bits, just like in the 2-way case.

2. Must use b ≥ 2 bits for estimating 3-way similarities.


b-Bit Minwise Hashing for Large-Scale linear Learning

Linear learning algorithms require the estimators to be inner products.

If we look carefully, the estimator of minwise hashing is indeed an inner product of

two (extremely sparse) vectors in D × k dimensions (infeasible when D = 264):

RM =1

k

k∑

j=1

1min(πj(S1)) = min(πj(S2))

because, as z1 = min(π(S1)), z2 = min(π(S2)) ∈ Ω = 0, 1, ..., D − 1.

1z1 = z2 =D−1∑

i=0

1z1 = i × z2 = i

———–

Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11.


Consider D = 5. We can expand numbers into vectors of length 5.

0 =⇒ [0, 0, 0, 0, 1], 1 =⇒ [0, 0, 0, 1, 0], 2 =⇒ [0, 0, 1, 0, 0]

3 =⇒ [0, 1, 0, 0, 0], 4 =⇒ [1, 0, 0, 0, 0].

———————

If z1 = 2, z2 = 3, then

0 = 1z1 = z2 = inner product between [0, 0, 1, 0, 0] and [0, 1, 0, 0, 0].

If z1 = 2, z2 = 2, then

1 = 1z1 = z2 = inner product between [0, 0, 1, 0, 0] and [0, 0, 1, 0, 0].


Linear Learning Algorithms

Linear algorithms such as linear SVM and logistic regression have become very

powerful and extremely popular. Representative software packages include

SVMperf, Pegasos, Bottou’s SGD SVM, and LIBLINEAR.

Given a dataset (xi, yi)ni=1, xi ∈ R

D , yi ∈ −1, 1, the L2-regularized

linear SVM solves the following optimization problem:

minw

1

2w

Tw + C

n∑

i=1

max

1 − yiwTxi, 0

,

and the L2-regularized logistic regression solves a similar problem:

minw

1

2w

Tw + C

n∑

i=1

log(

1 + e−yiwTxi

)

.

Here C > 0 is an important penalty (regularization) parameter.


Integrating b-Bit Minwise Hashing for (Linear) Learning

Very simple :

1. We apply k independent random permutations on each (binary) feature

vector xi and store the lowest b bits of each hashed value. This way, we

obtain a new dataset which can be stored using merely nbk bits.

2. At run-time, we expand each new data point into a 2b × k-length vector, i.e.

we concatenate the k vectors (each of length 2b). The new feature vector has

exactly k 1’s.


An example with k = 3 and b = 2

For set (vector) S1:

Original hashed values (k = 3) : 12013 25964 20191

Original binary representations :

010111011101101 110010101101100 100111011011111

Lowest b = 2 binary digits : 01 00 11

Corresponding decimal values : 1 0 3

Expanded 2b = 4 binary digits : 0010 0001 1000

New feature vector fed to a solver : 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0

Same procedures on sets S2, S3, ...


Datasets and Solver for Linear Learning

This talk presents the experiments on two text datasets.

Dataset # Examples (n) # Dims (D) Avg # Nonzeros Train / Test

Webspam (24 GB) 350,000 16,609,143 3728 80% / 20%

Rcv1 (200 GB) 781,265 1,010,017,424 12062 50% / 50%

To generate the Rcv1 dataset, we used the original features + all pairwise

features + 1/30 3-way features.

We chose LIBLINEAR as the basic solver for linear learning. Note that our method

is purely statistical/probabilistic, independent of the underlying procedures.

All experiments were conducted on workstations with Xeon(R) CPU

([email protected]) and 48GB RAM, under Windows 7 System.


Experimental Results on Linear SVM and Webspam

• We conducted our extensive experiments for a wide range of regularization C

values (from 10−3 to 102) with fine spacings in [0.1, 10].

• We experimented with k = 30 to k = 500, and b = 1, 2, 4, 8, and 16.


Testing Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 6,8,10,16

b = 1

svm: k = 200

b = 24

b = 4

Spam: Accuracy

• Solid: b-bit hashing. Dashed (red) the original data

• Using b ≥ 8 and k ≥ 200 achieves about the same test accuracies as using

the original data.

• The results are averaged over 50 runs.


10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 1

b = 2

b = 4

b = 6b = 8,10,16

svm: k = 50Spam: Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

CA

ccur

acy

(%)

Spam: Accuracy

svm: k = 100

b = 1

b = 2

b = 4b = 8,10,16

4

6

6

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

svm: k = 150

Spam: Accuracy

b = 1

b = 2

b = 4b = 6,8,10,16

4

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

b = 6,8,10,16

b = 1

svm: k = 200

b = 24

b = 4

Spam: Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

Spam: Accuracy

svm: k = 300

b = 1

b = 24

b = 4b = 6,8,10,16

10−3

10−2

10−1

100

101

102

80828486889092949698

100

b = 1

b = 24

b = 6,8,10,16

C

Acc

urac

y (%

)

svm: k = 500

Spam: Accuracy

b = 4


Stability of Testing Accuracy (Standard Deviation)

Our method produces very stable results, especially b ≥ 4

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

b = 1

b = 2

b = 4

b = 6b = 8

b = 1610

svm: k = 50Spam accuracy (std)

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

b = 1

b = 2

b = 4

b = 6b = 8

b = 10,16svm: k = 100Spam accuracy (std)

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

svm: k = 150

Spam:Accuracy (std)

b = 1

b = 2

b = 4

b = 8b = 16

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

b = 1

b = 2

b = 4

b = 6

b = 8,10,16svm: k = 200Spam accuracy (std)

10−3

10−2

10−1

100

101

102

10−2

10−1

100

C

Acc

urac

y (s

td %

)

Spam:Accuracy (std)

b = 1

b = 2

b = 4b = 8b = 16svm: k = 300

10−3

10−2

10−1

100

101

102

10−2

10−1

100

b = 1

b = 2

b = 4

b = 6,8,10,16Spam accuracy (std)

C

Acc

urac

y (s

td %

)

svm: k = 500


Training Time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

svm: k = 50Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

svm: k =100Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

svm: k = 150

Spam:Training Time

b = 16

b = 1

b=8

b = 4b = 1

2416

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

b = 16

svm: k = 200Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

b = 16

b = 1

2

416

b=8

b = 1

svm: k = 300

Spam:Training Time10

−310

−210

−110

010

110

210

0

101

102

103

Spam: Training time

b = 10

b = 16

C

Tra

inin

g tim

e (s

ec)

svm: k = 500

• They did not include data loading time (which is small for b-bit hashing)

• The original training time is about 100 seconds.

• b-bit minwise hashing needs about 3 ∼ 7 seconds (3 seconds when b = 8).


Testing Time

10−3

10−2

10−1

100

101

102

12

10

100

1000

C

Tes

ting

time

(sec

)

svm: k = 50Spam: Testing time

10−3

10−2

10−1

100

101

102

12

10

100

1000

C

Tes

ting

time

(sec

)


10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tes

ting

time

(sec

)

svm: k = 150Spam:Testing Time

10−3

10−2

10−1

100

101

102

12

10

100

1000

C

Tes

ting

time

(sec

)


10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tes

ting

time

(sec

) Spam:Testing Timesvm: k = 300

10−3

10−2

10−1

100

101

102

12

10

100

1000

C

Tes

ting

time

(sec

)



Experimental Results on L2-Regularized Logistic Regression

Testing Accuracy

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

logit: k = 50Spam: Accuracy

b = 1

b = 2

b = 4

b = 6b = 8,10,16

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)logit: k = 100

Spam: Accuracy

b = 1

b = 2

b = 4b = 6

b = 8,10,16

10−3

10−2

10−1

100

101

102

80

85

90

95

100

C

Acc

urac

y (%

)

Spam:Accuracylogit: k = 150

b = 1

b = 2

b = 4b = 8

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

Spam: Accuracy

logit: k = 200

b = 6,8,10,16

b = 1

b = 2

b = 4

10−3

10−2

10−1

100

101

102

80

90

95

100

C

Acc

urac

y (%

)

logit: k = 300Spam:Accuracy

b = 1

b = 2

b = 4b = 8

10−3

10−2

10−1

100

101

102

80828486889092949698

100

C

Acc

urac

y (%

)

logit: k = 500

Spam: Accuracy

b = 1

b = 2b = 4

4

b = 6,8,10,16


Training Time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 50

Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 100

Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 150Spam:Training Time

b = 16b = 1

2 b = 84

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

logit: k = 200

Spam: Training time

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

b = 16

b = 1

b = 482

logit: k = 300Spam:Training Time

10−3

10−2

10−1

100

101

102

100

101

102

103

CT

rain

ing

time

(sec

) b = 16

logit: k = 500

Spam: Training time


Comparisons with Other Algorithms

We conducted extensive experiments with the VW algorithm (Weinberger et. al.,

ICML’09, not the VW online learning platform), which has the same variance as

random projections.

Consider two sets S1, S2, f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|,R = |S1∩S2|

|S1∪S2|= a

f1+f2−a . Then, from their variances

V ar (aV W ) ≈ 1

k

(

f1f2 + a2)

V ar(

RMINWISE

)

=1

kR (1 − R)

we can immediately see the significant advantages of minwise hashing, especially

when a ≈ 0 (which is common in practice).


Comparing b-Bit Minwise Hashing with VW

8-bit minwise hashing (dashed, red) with k ≥ 200 (and C ≥ 1) achieves about

the same test accuracy as VW with k = 104 ∼ 106.

101

102

103

104

105

106

80828486889092949698

100

C = 0.01

1,10,100

0.1

k

Acc

urac

y (%

)

svm: VW vs b = 8 hashing

C = 0.01

C = 0.1C = 1

10,100

Spam: Accuracy

101

102

103

104

105

106

80828486889092949698

100

k

Acc

urac

y (%

)

1

C = 0.01

C = 0.1

C = 110

C = 0.01

C = 0.1

10,100

logit: VW vs b = 8 hashingSpam: Accuracy

100


8-bit hashing is substantially faster than VW (to achieve the same accuracy).

102

103

104

105

106

100

101

102

103

k

Tra

inin

g tim

e (s

ec) C = 100

C = 10

C = 1,0.1,0.01C = 100

C = 10

C = 1,0.1,0.01

Spam: Training time

svm: VW vs b = 8 hashing

102

103

104

105

106

100

101

102

103

kT

rain

ing

time

(sec

)

C = 0.01

10,1.0,0.1

100

logit: VW vs b = 8 hashingSpam: Training time

C = 0.1,0.01

C = 100,10,1


Experimental Results on Rcv1 (200GB)

Test accuracy using linear SVM (Can not train the original data)

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

svm: k =50 b = 16

b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)rcv1: Accuracy

svm: k =100 b = 16

b = 12b = 8

b = 4b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

svm: k =150

rcv1: Accuracy

b = 16b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

svm: k =200

rcv1: Accuracy

b = 16

b = 12b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

svm: k =300 b = 16b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

b = 1

b = 2

b = 4

b = 8b = 12

b = 16

CA

ccur

acy

(%)

svm: k =500

rcv1: Accuracy


Training time using linear SVM

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=50

12

16

b = 8

12

10−3

10−2

10−1

100

101

102

100

101

102

103

CT

rain

ing

time

(sec

)

rcv1: Train Time

svm: k=100

b = 16

12

b = 8

12

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=150

b = 1612

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=200

b = 1612

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

svm: k=300

b = 16

12

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

svm: k=500

b = 8

12

b = 16

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time


Test accuracy using logistic regression

10−2

100

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

logit: k =50 b = 16

b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

CA

ccur

acy

(%)

rcv1: Accuracyb = 1

b = 2

b = 4

b = 8

b = 12

b = 16logit: k =100

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

logit: k =150 b = 16

b = 12b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

rcv1: Accuracy

logit: k =200 b = 16b = 12

b = 8

b = 4

b = 2

b = 1

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

b = 16b = 12

b = 8

b = 4

b = 2

b = 1

logit: k =300

rcv1: Accuracy

10−3

10−2

10−1

100

101

102

50

60

70

80

90

100

C

Acc

urac

y (%

)

b = 16

b = 12b = 8

b = 4

b = 2

b = 1

rcv1: Accuracy

logit: k =500


Training time using logistic regression

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=50

b = 168

b = 1

b = 4

b = 2

b = 12

10−3

10−2

10−1

100

101

102

100

101

102

103

CT

rain

ing

time

(sec

)

rcv1: Train Time

logit: k=100

b = 8

b = 16

b = 12

b = 1

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=150

b = 16

128

b = 4

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=200

b = 16

128

b = 1

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

logit: k=300b = 16

12

b = 8

10−3

10−2

10−1

100

101

102

100

101

102

103

C

Tra

inin

g tim

e (s

ec)

rcv1: Train Time

b = 1612

b = 8

logit: k=500


Comparisons with VW on Rcv1

Test accuracy using linear SVM

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

svm: VW vs hashing

rcv1: Accuracy

C = 0.01

C = 0.1,1,10

C = 0.01,0.1,1,10

1−bit

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)rcv1: Accuracy

svm: VW vs hashing2−bit

C = 0.01,0.1,1,10 C = 0.01

C = 0.1,1,10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

rcv1: Accuracysvm: VW vs hashing4−bit

C = 0.01

C = 0.1,1,10

C = 0.01

C = 10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

rcv1: Accuracy


C = 0.1,1,10

C = 0.01C = 0.01

C = 10

101

102

103

104

50

60

70

80

90

100

C = 0.01

K

Acc

urac

y (%

)

C = 0.01C = 0.1,1,10

rcv1: Accuracysvm: VW vs hashing12−bit

C = 10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

C = 0.1,1,10

C = 0.01

rcv1: Accuracy


C = 10

C = 0.01


Test accuracy using logistic regression

101

102

103

104

50

60

70

80

90

100

C = 0.01

C = 0.01,0.1,1,10

C = 0.1,1,10

K

Acc

urac

y (%

)

logit: VW vs hashingrcv1: Accuracy

1−bit

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

C = 1logit: VW vs hashing2−bit

rcv1: Accuracy

C = 0.1,1,10

C = 0.01C = 0.01,0.1,1,10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)

C = 0.1,1,10

C = 0.01

C = 0.1,1,10

C = 0.01

rcv1: Accuracylogit: VW vs hashing4−bit

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)


C = 0.01

C = 0.1,1,10

C = 0.01

C = 0.1,1,10

101

102

103

104

50

60

70

80

90

100

K

Acc

urac

y (%

)


C = 0.01

C = 0.1,1,10

C = 0.1,1,10

C = 0.01

101

102

103

104

50

60

70

80

90

100

KA

ccur

acy

(%)

C = 0.01

C = 0.1,1,10

C = 0.01

C = 10

logit: VW vs hashing16−bit


Training time

101

102

103

104

100

101

102

103

C = 0.01

C = 0.1

C = 10

8−bit

K

Tra

inin

g tim

e (s

ec)

C = 0.01

C = 0.1

C = 1

rcv1: Train Time

svm: VW vs hashing

C = 1

C = 10

101

102

103

104

100

101

102

103

KT

rain

ing

time

(sec

)

rcv1: Train Time

logit: VW vs hashing8−bit

C = 10

C = 1

C = 0.1

C = 0.01

C = 0.01

C = 0.1

C = 1

C = 10


b-Bit Minwise Hashing for Efficient Near Neighbor Search

Near neighbor search is a much more frequent operation than training an SVM.

The bits from b-bit minwise hashing can be directly used to build hash tables to

enable sub-linear time near neighbor search.

This is an instance of the general family of locality-sensitive hashing.


An Example of b-Bit Hashing for Near Neighbor Search

We use b = 2 bits and k = 2 permutations to build a hash table indexed from

0000 to 1111, i.e., the table size is 22×2 = 16.

00 10

11 1011 11

00 0000 01

Index Data Points

11 01

8, 13, 251 5, 14, 19, 29(empty)

33, 174, 3153 7, 24, 156

61, 342

00 10

11 1011 11

00 0000 01

Index Data Points

11 01

8

17, 36, 1292, 19, 83

7, 198

56, 989,9, 156, 879

4, 34, 52, 796

Then, the data points are placed in the buckets according to their hashed values.

Look for near neighbors in the bucket which matches the hash value of the query.

Replicate the hash table (twice in this case) for missed and good near neighbors.

Final retrieved data points (before re-ranking) are the union of the buckets.


Experiments on the Webspam Dataset

B = b × k: table size L: number of tables.

Fractions of retrieved data points (before re-ranking)

8 12 16 20 2410

−2

10−1

100

Webspam: L = 25

B (bits)

Fra

ctio

n E

valu

ated

b = 1

b = 2

b = 4

SRPb−bit

8 12 16 20 2410

−2

10−1

100

B (bits)

Fra

ctio

n E

valu

ated

b = 1

b = 2

b = 4

Webspam: L = 50

SRPb−bit

8 12 16 20 2410

−2

10−1

100

Webspam: L = 100

B (bits)

Fra

ctio

n E

valu

ated

b = 1

b = 2

b = 4

SRPb−bit

SRP (sign random projections) and b-bit hashing (with b = 1, 2) retrieved similar

numbers of data points.

————————

Ref: Shrivastava and Li, Fast Near Neighbor Search in High-Dimensional Binary

Data, ECML’12


Precision-Recall curves (the higher the better) for retrieving top-T data points.

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 24 L= 100Webspam: T = 5

b = 1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 24 L= 100Webspam: T = 10

b = 1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 24 L= 100Webspam: T = 50

b = 1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 16 L= 100Webspam: T = 5

1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 16 L= 100Webspam: T = 10

1

b = 4

SRPb−bit

0 10 20 30 40 50 60 70 80 90 1000

102030405060708090

100

Recall (%)

Pre

cisi

on (

%)

B= 16 L= 100Webspam: T = 50

1

b = 4

SRPb−bit

Using L = 100 tables, b-bit hashing considerably outperformed SRP (sign

random projections), for table sizes B = 24 and B = 16. Note that there are

many LSH schemes, which we did not compare exhaustively.


Simulating Permutations with Universal (2U and 4U) Hashing

10−3

10−2

10−1

100

101

102

70

80

90

100

b = 1

b = 2

b = 4

b = 8

C

Acc

urac

y (%

)

SVM: k = 10

Spam: Accuracy

Perm2U4U

10−3

10−2

10−1

100

101

102

85

90

95

100

b = 1

b = 2

b = 4

b = 8

C

Acc

urac

y (%

)

SVM: k = 100Spam: Accuracy

Perm2U4U

10−3

10−2

10−1

100

101

102

85

90

95

100

b = 1

b = 2

b = 4

b = 8

C

Acc

urac

y (%

)


SVM: k = 200Spam: AccuracySVM: k = 200Spam: Accuracy

Perm2U4U

10−3

10−2

10−1

100

101

102

85

90

95

100

b = 1

b = 2

4b = 8

C

Acc

urac

y (%

)


SVM: k = 500Spam: AccuracySVM: k = 500Spam: Accuracy

Perm2U4U

Both 2U and 4U perform well except when b = 1 or small k.

———————–

Ref: Li, Shrivastava, Konig, b-Bit Minwise Hashing in Practice, arXiv:1205.2958


Speed Up PreProcessing Using GPUs

CPU: Data loading and preprocessing (for k = 500 permutations) times (in

seconds). Note that we measured the loading times of LIBLINEAR which used a

plain text data format. Using binary data will speed up data loading.

Dataset Loading Permu 2U 4U (Mod) 4U (Bit)

Webspam 9.7 × 102 6.1 × 103 4.1 × 103 4.4 × 104 1.4 × 104

Rcv1 1.0 × 104 – 3.0 × 104 – –

GPU: Data loading and preprocessing (for k = 500 permutations) times

Dataset Loading GPU 2U GPU 4U (Mod) GPU 4U (Bit)

Webspam 9.7 × 102 51 5.2 × 102 1.2 × 102

Rcv1 1.0 × 104 1.4 × 103 1.5 × 104 3.2 × 103

Note that the modular operations in 4U can be replaced by bit operations.


Breakdowns of the overhead : A batch-based GPU implementation.

100

101

102

103

104

105

10−2

100

102

104

Batch size

Tim

e (s

ec) GPU Kernel

CPU −−> GPU

GPU −−> CPUSpam: GPU Profiling, 2U

100

101

102

103

104

105

10−2

100

102

104

Batch size

Tim

e (s

ec)

GPU Kernel

CPU −−> GPU

GPU −−> CPU

Rcv1: GPU Profiling, 2U

100

101

102

103

104

105

10−2

100

102

104

Batch size

Tim

e (s

ec)

Spam: GPU Profiling, 4U−Bit

GPU Kernel

CPU −−> GPU

GPU −−> CPU

100

101

102

103

104

105

10−2

10−1

100

101

102

103

104

105

Batch size

Tim

e (s

ec)

Rcv1: GPU Profiling, 4U−Bit

GPU −−> CPU

CPU −−> GPU

GPU Kernel

GPU kernel times dominate the cost. The cost is not sensitive to the batch size.

————————


Ref: Li, Shrivastava, Konig, GPU-Based Minwise Hashing, WWW’12 Poster


Online Learning (SGD) with Hashing

b-Bit minwise hashing can also be highly beneficial for online learning in that it

significantly reduces the data loading time which is usually the bottleneck for

online algorithms, especially if many epochs are required.

10−10

10−8

10−6

10−4

10−2

80

85

90

95

100

b = 1b = 2

b = 4

λ

Acc

urac

y (%

)

SGD SVM: k = 50Spam: Accuracy

b = 6

10−10

10−8

10−6

10−4

10−2

80

85

90

95

100

b = 1

b = 2

b = 4

λA

ccur

acy

(%)

SGD SVM: k = 200Spam: Accuracy

λ: 1/C , following the convention in online learning literature.

Red dashed curve: Results based on the original data

————————



Conclusion

• BigData everywhere. For many operations including clustering, classification,

near neighbor search, etc. exact answers are often not necessary.

• Very sparse random projections (KDD 2006) work as well as dense random

projections. They however require a large number of projections.

• Minwise hashing is a standard procedure in the context of search. b-bit

minwise hashing is a substantial improvement by using only b bits per hashed

value instead of 64 bits (e.g., a 20-fold reduction in space).

• b-bit minwise hashing provides a very simple strategy for efficient linear

learning, by expanding the bits into vectors of length 2b × k.

• We can directly build hash tables from b-bit minwise hashing for sub-linear

time near neighbor search.

• It can be substantially more accurate than random projections (and variants).


• Major drawbacks of b-bit minwise hashing:

– It was designed (mostly) for binary static data.

– It needs a fairly high-dimensional representation (2b × k) after expansion.

– Expensive preprocessing and consequently expensive testing phrase for

unprocessed data. Parallelization solutions based on (e.g.,) GPUs offer

very good performance in terms of time but they are not energy-efficient.

Probabilistic Hashing for Efficient Search and Learning - Stanford

Documents

Transcript of Probabilistic Hashing for Efficient Search and Learning - Stanford