Probabilistic Hashing for Efficient Search and Learning - Stanford
Transcript of Probabilistic Hashing for Efficient Search and Learning - Stanford
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 1
Probabilistic Hashing for Efficient Search and Learning
Ping Li
Department of Statistical Science
Faculty of Computing and Information Science
Cornell University
Ithaca, NY 14853
July 12, 2012
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 2
BigData Everywhere
Conceptually, consider a dataset as a matrix of size n × D.
In modern applications, # examples n = 106 is common and n = 109 is not
rare, for example, images, documents, spams, search click-through data.
High-dimensional (image, text, biological) data are common: D = 106 (million),
D = 109 (billion), D = 1012 (trillion), D = 264 or even higher. In a sense, D
can be arbitrarily high by considering pairwise, 3-way or higher interactions.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 3
Examples of BigData Challenges: Linear Learning
Binary classification: Dataset (xi, yi)ni=1, xi ∈ R
D , yi ∈ −1, 1.
One can fit an L2-regularized linear SVM:
minw
1
2w
Tw + C
n∑
i=1
max
1 − yiwTxi, 0
,
or the L2-regularized logistic regression:
minw
1
2w
Tw + C
n∑
i=1
log(
1 + e−yiwTxi
)
,
where C > 0 is the penalty (regularization) parameter.
The weight vector w has length D, the same as the data dimensionality.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 4
Challenges of Learning with Massive High-dimensional Data
• The data often can not fit in memory (even when the data are sparse).
• Data loading takes too long. For example, online algorithms require less
memory but often need to iterate over the data for several (or many) passes to
achieve sufficient accuracies. Data loading in general dominates the cost.
• Training can be very expensive, even for simple linear models such as logistic
regression and linear SVM.
• Testing may be too slow to meet the demand, especially crucial for
applications in search or interactive data visual analytics.
• The model itself can be too large to store, for example, we can not really store
a vector of weights for logistic regression on data of 264 dimensions.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 5
A Popular Solution Based on Normal Random Projections
Random Projections : Replace original data matrix A by B = A × R
A R = B
R ∈ RD×k: a random matrix, with i.i.d. entries sampled from N(0, 1).
B ∈ Rn×k : projected matrix, also random.
B approximately preserves the Euclidean distance and inner products between
any two rows of A. In particular, E (BBT) = AA
T.
Therefore, we can simply feed B into (e.g.,) SVM or logistic regression solvers.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 6
Experiments on Classification, Clustering, and Regression
Dataset # Examples (n) # Dims (D) Avg # nonzeros Train v.s. Test
Webspam 350,000 16,609,143 3,728 4/5 v.s. 1/5
MNIST-PW 70,000 193,816 12,346 6/7 v.s. 1/7
PEMS (UCI) 440 138,672 138,672 1/2 v.s. 1/2
CT (UCI) 53,500 384 384 4/5 v.s. 1/5
MNIST-PW = Original MNIST + all pairwise features.
CT = Regression task (reducing number of examples instead of dimensions).
———————
Instead of using dense (normal) projections, we will experiment with
very sparse random projections.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 7
Very Sparse Random Projections
The projection matrix: R = rij ∈ RD×k. Instead of sampling from normals,
we sample from a sparse distribution parameterized by s ≥ 1:
rij =
−1 with prob. 12s
0 with prob. 1 − 1s
1 with prob. 12s
If s = 100, then on average, 99% of the entries are zero.
If s = 10000, then on average, 99.99% of the entries are zero.
Usually, s =√
D or s = k is a good choice.
——————-
Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 8
Linear SVM Test Accuracies on Webspam
Red dashed curves: results based on the original data
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 1)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 1000)
CC
lass
ifica
tion
Acc
(%
)Observations:
• We need a large number of projections (e.g., k ≥ 4096) for high accuracy.
• The sparsity parameter s matters little unless k is small.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 9
Linear SVM Test Accuracies on Webspam
Red dashed curves: results based on the original data
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 1)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 10000)
CC
lass
ifica
tion
Acc
(%
)
As long as k is large (necessary for high accuracy), the projection matrix can be
extremely sparse, even with s = 10000.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 10
Linear SVM Test Accuracies on MNIST-PW
Red dashed curves: results based on the original data
10−2
10−1
100
101
102
60
70
80
90
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
MNIST−PW: SVM (s = 1)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
60
70
80
90
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
MNIST−PW: SVM (s = 10000)
CC
lass
ifica
tion
Acc
(%
)
As long as k is large (necessary for high accuracy), the projection matrix can be
extremely sparse. For this data, we probably need k > 104 for high accuracy.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 11
Linear SVM Test Accuracies
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 1)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 100)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 1000)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 10000)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
60
70
80
90
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
MNIST−PW: SVM (s = 1)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
60
70
80
90
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
MNIST−PW: SVM (s = 100)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
60
70
80
90
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
MNIST−PW: SVM (s = 1000)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
60
70
80
90
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
MNIST−PW: SVM (s = 10000)
C
Cla
ssifi
catio
n A
cc (
%)
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 12
Logistic Regression Test Accuracies
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: Logit (s = 1)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: Logit (s = 100)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: Logit (s = 1000)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256
k = 512k = 1024k = 4096
Webspam: Logit (s = 10000)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
60
70
80
90
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024
k = 4096
MNIST−PW: Logit (s = 1)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
60
70
80
90
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024
k = 4096
MNIST−PW: Logit (s = 100)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
60
70
80
90
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024
k = 4096
MNIST−PW: Logit (s = 1000)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
60
70
80
90
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024
k = 4096
MNIST−PW: Logit (s = 10000)
C
Cla
ssifi
catio
n A
cc (
%)
Again, as long as k is large (necessary for high accuracy), the projection matrix
can be extremely sparse (i.e., very large s).
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 13
SVM and Logistic Regression Test Accuracies on PEMS
10−2
100
102
104
0
20
40
60
80
100
k = 32
k = 64
k = 128 k = 256k = 4096
PEMS: SVM (s = 1)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
100
102
104
0
20
40
60
80
100
k = 32
k = 64
k = 128
k = 4096
PEMS: SVM (s = 10000)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
100
102
104
0
20
40
60
80
100
k = 32k = 64
k = 128
k = 4096
PEMS: Logit (s = 1)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
100
102
104
0
20
40
60
80
100
k = 32
k = 64
k = 128
k = 4096
PEMS: Logit (s = 10000)
C
Cla
ssifi
catio
n A
cc (
%)
Again, as long as k is large, the projection matrix can be extremely sparse.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 14
K-Means Clustering Accuracies on MNIST-PW
Random projections were applied before the data were fed to k-means clustering.
As the classification labels are known, we simply use the classification accuracy
to assess the clustering quality.
101
102
103
104
30
40
50
60
70
s = 10000
k
Clu
ster
ing
Acc
(%
)
MNIST−PW: K−means
s = 1,10,100
s = 1000
Again, s does not really seem to matter much, as long as k is not small.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 15
Ridge Regression on CT
Random projections were applied to reduce the number of examples before ridge
regression. The task becomes estimating the original regression coefficients.
101
102
103
104
5
10
15
20
k
Reg
ress
ion
Err
or
CT (Train): Regression Error
101
102
103
104
5
10
15
20
CT (Test): Regression Error
kR
egre
ssio
n E
rror
Different curves are for different s values. Again, the sparsity parameter s does
not really seem to matter much, as long as k is not small.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 16
Advantages of Very Sparse Random Projections (Large s)
Consequences of using very sparse projection matrix R:
• Matrix multiplication A × R becomes much faster.
• Much easier to store R if necessary.
• Much easier to generate (and re-generate) R on the fly.
• Sparse original data =⇒ sparse projected data. Average number of 0’s of the
projected data vector would be
k ×(
1 − 1
s
)f
where f is the number of nonzero elements in the original data vector.
——————-
Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 17
Disadvantages of Random Projections (and Variants)
Inaccurate, especially on binary data.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 18
Variance Analysis for Inner Product Estimates
A R = B
First two rows in A: u1, u2 ∈ RD (D is very large):
u1 = u1,1, u1,2, ..., u1,i, ..., u1,Du2 = u2,1, u2,2, ..., u2,i, ..., u2,D
First two rows in B: v1, v2 ∈ Rk (k is small):
v1 = v1,1, v1,2, ..., v1,j, ..., v1,kv2 = v2,1, v2,2, ..., v2,j, ..., v2,k
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 19
a =1
k
k∑
j=1
v1,jv2,j , (which is also an inner product)
E(a) = a
V ar(a) =1
k
(
m1m2 + a2)
m1 =
D∑
i=1
|u1,i|2, m2 =
D∑
i=1
|u2,i|2
———
Random projections may not be good for inner products because the variance is
dominated by marginal l2 norms m1m2, especially when a ≈ 0.
For real-world datasets, most pairs are often close to be orthogonal (a ≈ 0).
——————-
Ref: Li, Hastie, and Church, Very Sparse Random Projections, KDD’06.
Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 20
b-Bit Minwise Hashing
• Simple algorithm designed specifically for massive binary data.
• Much more accurate than random projections for estimating inner products.
• Much smaller space requirement than the original minwise hashing algorithm.
• Capable of estimating 3-way similarity, while random projections can not.
• Useful for large-scale linear learning (and kernel learning of course).
• Useful for sub-linear time near neighbor search.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 21
Massive, High-dimensional, Sparse, and Binary Data
Binary sparse data are very common in the real-world :
• For many applications (such as text), binary sparse data are very natural.
• Many datasets can be quantized/thresholded to be binary without hurting the
prediction accuracy.
• In some cases, even when the “original” data are not too sparse, they often
become sparse when considering pariwise and higher-order interactions.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 22
An Example of Binary (0/1) Sparse Massive Data
A set S ⊆ Ω = 0, 1, ..., D− 1 can be viewed as 0/1 vector in D dimensions.
Shingling : Each document (Web page) can be viewed as a set of w-shingles.
For example, after parsing, a sentence “today is a nice day” becomes
• w = 1: “today”, “is”, “a”, “nice”, “day”
• w = 2: “today is”, “is a”, “a nice”, “nice day”
• w = 3: “today is a”, “is a nice”, “a nice day”
Previous studies used w ≥ 5, as single-word (unit-gram) model is not sufficient.
Shingling generates extremely high dimensional vectors, e.g., D = (105)w .
(105)5 = 1025 = 283, although in current practice, it seems D = 264 suffices.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 23
Notation
A binary (0/1) vector can be equivalently viewed as a set (locations of nonzeros).
Consider two sets S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1 (e.g., D = 264)
f2
f1 a
f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|.
The resemblance R is a popular measure of set similarity
R =|S1 ∩ S2||S1 ∪ S2|
=a
f1 + f2 − a
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 24
Minwise Hashing: Stanford Algorithm in the Context of Searc h
Suppose a random permutation π is performed on Ω, i.e.,
π : Ω −→ Ω, where Ω = 0, 1, ..., D − 1.
An elementary probability argument shows that
Pr (min(π(S1)) = min(π(S2))) =|S1 ∩ S2|
|S1 ∪ S2|= R.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 25
An Example
D = 5. S1 = 0, 3, 4, S2 = 1, 2, 3, R = |S1∩S2||S1∪S2|
= 15 .
One realization of the permutation π can be
0 =⇒ 3
1 =⇒ 2
2 =⇒ 0
3 =⇒ 4
4 =⇒ 1
π(S1) = 3, 4, 1 = 1, 3, 4, π(S2) = 2, 0, 4 = 0, 2, 4
In this example, min(π(S1)) 6= min(π(S2)).
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 26
Minwise Hashing Estimator
After k permutations, π1, π2, ..., πk , one can estimate R without bias:
RM =1
k
k∑
j=1
1min(πj(S1)) = min(πj(S2)),
Var(
RM
)
=1
kR(1 − R).
—————————-
We recently realized that this estimator could be written as an inner product in
264 × k (assuming D = 264) dimensions). This means one can potentially use
it for linear learning, although the dimensionality would be excessively high.
Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 27
b-Bit Minwise Hashing: the Intuition
Basic idea : Only store the lowest b-bits of each hashed value, for small b.
Intuition:
• When two sets are identical, then their lowest b-bits of the hashed values are
of course also equal. b = 1 only stores whether a number is even or odd.
• When two sets are similar, then their lowest b-bits of the hashed values
“should be” also similar (True?? Need a proof).
• Therefore, hopefully we do not need many bits to obtain useful information,
especially when real applications care about pairs with reasonably high
resemblance values (e.g., 0.5).
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 28
The Basic Collision Probability Result
Consider two sets, S1 and S2,
S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1,f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|
Define the minimum values under π : Ω → Ω to be z1 and z2:
z1 = min (π (S1)) , z2 = min (π (S2)) .
and their b-bit versions
z(b)1 = The lowest b bits of z1, z
(b)2 = The lowest b bits of z2
Example: if z1 = 7(= 111 in binary), then z(1)1 = 1, z
(2)1 = 3.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 29
Collision probability : Assume D is large (which is virtually always true)
Pb = Pr
(
z(b)1 = z
(b)2
)
= C1,b + (1 − C2,b) R
——————–
Recall, (assuming infinite precision, or as many digits as needed), we have
Pr (z1 = z2) = R
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 30
Collision probability : Assume D is large (which is virtually always true)
Pb = Pr
(
z(b)1 = z
(b)2
)
= C1,b + (1 − C2,b) R
r1 =f1
D, r2 =
f2
D, f1 = |S1|, f2 = |S2|,
C1,b = A1,br2
r1 + r2+ A2,b
r1
r1 + r2,
C2,b = A1,br1
r1 + r2+ A2,b
r2
r1 + r2,
A1,b =r1 [1 − r1]
2b−1
1 − [1 − r1]2b
, A2,b =r2 [1 − r2]
2b−1
1 − [1 − r2]2b
.
——————–
Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 31
The closed-form collision probability is remarkably accurate even for small D.
The absolute errors (approximate - exact) are very small even for D = 20.
0 0.2 0.4 0.6 0.8 10
0.002
0.004
0.006
0.008
0.01
f2 = 2
f2 = 4
a / f2
App
roxi
mat
ion
Err
or
D = 20, f1 = 4, b = 1
0 0.2 0.4 0.6 0.8 10
1
2
3
4x 10
−4
D = 500, f1 = 50, b = 1
f2 = 2
f2 = 25
f2 = 50
a / f2
App
roxi
mat
ion
Err
or0 0.2 0.4 0.6 0.8 1
0
0.002
0.004
0.006
0.008
0.01
f2 = 2
f2 = 5
f2 = 10
a / f2
App
roxi
mat
ion
Err
or
D = 20, f1 = 10, b = 1
0 0.2 0.4 0.6 0.8 10
1
2
3
4x 10
−4
a / f2
App
roxi
mat
ion
Err
or
D = 500, f1 = 250, b = 1
—————–
Ref: Li and Konig, Theory and Applications of b-Bit Minwise Hashing, CACM
Research Highlights, 2011 .
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 32
An Unbiased Estimator
An unbiased estimator Rb for R:
Rb =Pb − C1,b
1 − C2,b,
Pb =1
k
k∑
j=1
z(b)1,πj
= z(b)2,πj
,
Var(
Rb
)
=Var
(
Pb
)
[1 − C2,b]2 =
1
k
Pb(1 − Pb)
[1 − C2,b]2
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 33
The Variance-Space Trade-off
Smaller b =⇒ smaller space for each sample. However,
smaller b =⇒ larger variance.
B(b; R, r1, r2) characterizes the var-space trade-off:
B(b; R, r1, r2) = b × Var(
Rb
)
Lower B(b) is better.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 34
The ratio,
B(b1; R, r1, r2)
B(b2; R, r1, r2)
measures the improvement of using b = b2 (e.g., b2 = 1) over using b = b1
(e.g., b1 = 64).
B(64)B(b) = 20 means, to achieve the same accuracy (variance), using b = 64 bits
per hashed value will require 20 times more space (in bits) than using b = 1.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 35
B(64)B(b) , relative improvement of using b = 1, 2, 3, 4 bits, compared to 64 bits.
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
Resemblance (R)
Impr
ovem
ent
r1 = r
2 = 10−10 b = 1
b = 2
b = 3
b = 4
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
Resemblance (R)
Impr
ovem
ent
r1 = r
2 = 0.1
b = 1
b = 2
b = 3
b = 4
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
Resemblance (R)
Impr
ovem
ent
r1 = r
2 = 0.5
b = 1
b = 2
b = 3
b = 4
0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
Resemblance (R)
Impr
ovem
ent
r1 = r
2 = 0.9
b = 1
b = 2
b = 3
b = 4
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 36
Relative Improvements in the Least Favorable Situation
If r1, r2 → 0 (least favorable situation; there is a proof), then
B(64)
B(1)= 64
R
R + 1
If R = 0.5, then the improvement will be 643 = 21.3-fold.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 37
Experiments: Duplicate detection on Microsoft news articl es
The dataset was crawled as part of the BLEWS project at Microsoft. We
computed pairwise resemblances for all documents and retrieved documents
pairs with resemblance R larger than a threshold R0.
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.3
Sample size (k)
Pre
cisi
on
b=1b=2
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.4
Sample size (k)
Pre
cisi
on
2
b=1
b=1b=2b=4M
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 38
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.5
Sample size (k)
Pre
cisi
on
b=1
2
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.6
Sample size (k)
Pre
cisi
on
2
b=1
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.7
Sample size (k)
Pre
cisi
on
b=1
2
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.8
Sample size (k)
Pre
cisi
on
2
b=1
b=1b=2b=4M
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 39
1/2 Bit Suffices for High Similarity
Practical applications are sometimes interested in pairs of very high similarities.
For these data, one can XOR two bits from two permutations, into just one bit.
The new estimator is denoted by R1/2. Compared to the 1-bit estimator R1:
• A good idea for highly similar data :
limR→1
Var(
R1
)
Var(
R1/2
) = 2.
• Not so good idea when data are not very similar :
Var(
R1
)
< Var(
R1/2
)
, if R < 0.5774, Assuming sparse data
—————
Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 40
b-Bit Minwise Hashing for Estimating 3-Way Similarities
Notation for 3-way set intersections
a12
f1 a
a23
f3
a13
f2
r1
r3
s12
ss23
r2
s13
3-Way collision probability : Assume D is large.
Pr (lowest b bits of 3 hashed values are equal) =Z
u+ R123 =
Z + s
u,
where u = r1 + r2 + r3 − s12 − s13 − s23 + s, and ...
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 41
Z =(s12 − s)A3,b +(r3 − s13 − s23 + s)
r1 + r2 − s12
s12G12,b
+(s13 − s)A2,b +(r2 − s12 − s23 + s)
r1 + r3 − s13
s13G13,b
+(s23 − s)A1,b +(r1 − s12 − s13 + s)
r2 + r3 − s23
s23G23,b
+[
(r2 − s23)A3,b + (r3 − s23)A2,b
] (r1 − s12 − s13 + s)
r2 + r3 − s23
G23,b
+[
(r1 − s13)A3,b + (r3 − s13)A1,b
] (r2 − s12 − s23 + s)
r1 + r3 − s13
G13,b
+[
(r1 − s12)A2,b + (r2 − s12)A1,b
] (r3 − s13 − s23 + s)
r1 + r2 − s12
G12,b,
Aj,b =rj(1 − rj)2
b−1
1 − (1 − rj)2b
,
Gij,b =(ri + rj − sij)(1 − ri − rj + sij)2
b−1
1 − (1 − ri − rj + sij)2b
, i, j ∈ 1, 2, 3, i 6= j.
—————
Ref: Li, Konig, Gui, b-Bit Minwise Hashing for Estimating 3-Way Similarities,
NIPS’10.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 42
Useful messages :
1. Substantial improvement over using 64 bits, just like in the 2-way case.
2. Must use b ≥ 2 bits for estimating 3-way similarities.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 43
b-Bit Minwise Hashing for Large-Scale linear Learning
Linear learning algorithms require the estimators to be inner products.
If we look carefully, the estimator of minwise hashing is indeed an inner product of
two (extremely sparse) vectors in D × k dimensions (infeasible when D = 264):
RM =1
k
k∑
j=1
1min(πj(S1)) = min(πj(S2))
because, as z1 = min(π(S1)), z2 = min(π(S2)) ∈ Ω = 0, 1, ..., D − 1.
1z1 = z2 =D−1∑
i=0
1z1 = i × z2 = i
———–
Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 44
Consider D = 5. We can expand numbers into vectors of length 5.
0 =⇒ [0, 0, 0, 0, 1], 1 =⇒ [0, 0, 0, 1, 0], 2 =⇒ [0, 0, 1, 0, 0]
3 =⇒ [0, 1, 0, 0, 0], 4 =⇒ [1, 0, 0, 0, 0].
———————
If z1 = 2, z2 = 3, then
0 = 1z1 = z2 = inner product between [0, 0, 1, 0, 0] and [0, 1, 0, 0, 0].
If z1 = 2, z2 = 2, then
1 = 1z1 = z2 = inner product between [0, 0, 1, 0, 0] and [0, 0, 1, 0, 0].
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 45
Linear Learning Algorithms
Linear algorithms such as linear SVM and logistic regression have become very
powerful and extremely popular. Representative software packages include
SVMperf, Pegasos, Bottou’s SGD SVM, and LIBLINEAR.
Given a dataset (xi, yi)ni=1, xi ∈ R
D , yi ∈ −1, 1, the L2-regularized
linear SVM solves the following optimization problem:
minw
1
2w
Tw + C
n∑
i=1
max
1 − yiwTxi, 0
,
and the L2-regularized logistic regression solves a similar problem:
minw
1
2w
Tw + C
n∑
i=1
log(
1 + e−yiwTxi
)
.
Here C > 0 is an important penalty (regularization) parameter.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 46
Integrating b-Bit Minwise Hashing for (Linear) Learning
Very simple :
1. We apply k independent random permutations on each (binary) feature
vector xi and store the lowest b bits of each hashed value. This way, we
obtain a new dataset which can be stored using merely nbk bits.
2. At run-time, we expand each new data point into a 2b × k-length vector, i.e.
we concatenate the k vectors (each of length 2b). The new feature vector has
exactly k 1’s.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 47
An example with k = 3 and b = 2
For set (vector) S1:
Original hashed values (k = 3) : 12013 25964 20191
Original binary representations :
010111011101101 110010101101100 100111011011111
Lowest b = 2 binary digits : 01 00 11
Corresponding decimal values : 1 0 3
Expanded 2b = 4 binary digits : 0010 0001 1000
New feature vector fed to a solver : 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0
Same procedures on sets S2, S3, ...
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 48
Datasets and Solver for Linear Learning
This talk presents the experiments on two text datasets.
Dataset # Examples (n) # Dims (D) Avg # Nonzeros Train / Test
Webspam (24 GB) 350,000 16,609,143 3728 80% / 20%
Rcv1 (200 GB) 781,265 1,010,017,424 12062 50% / 50%
To generate the Rcv1 dataset, we used the original features + all pairwise
features + 1/30 3-way features.
We chose LIBLINEAR as the basic solver for linear learning. Note that our method
is purely statistical/probabilistic, independent of the underlying procedures.
All experiments were conducted on workstations with Xeon(R) CPU
([email protected]) and 48GB RAM, under Windows 7 System.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 49
Experimental Results on Linear SVM and Webspam
• We conducted our extensive experiments for a wide range of regularization C
values (from 10−3 to 102) with fine spacings in [0.1, 10].
• We experimented with k = 30 to k = 500, and b = 1, 2, 4, 8, and 16.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 50
Testing Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 6,8,10,16
b = 1
svm: k = 200
b = 24
b = 4
Spam: Accuracy
• Solid: b-bit hashing. Dashed (red) the original data
• Using b ≥ 8 and k ≥ 200 achieves about the same test accuracies as using
the original data.
• The results are averaged over 50 runs.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 51
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6b = 8,10,16
svm: k = 50Spam: Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
CA
ccur
acy
(%)
Spam: Accuracy
svm: k = 100
b = 1
b = 2
b = 4b = 8,10,16
4
6
6
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
svm: k = 150
Spam: Accuracy
b = 1
b = 2
b = 4b = 6,8,10,16
4
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 6,8,10,16
b = 1
svm: k = 200
b = 24
b = 4
Spam: Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
Spam: Accuracy
svm: k = 300
b = 1
b = 24
b = 4b = 6,8,10,16
10−3
10−2
10−1
100
101
102
80828486889092949698
100
b = 1
b = 24
b = 6,8,10,16
C
Acc
urac
y (%
)
svm: k = 500
Spam: Accuracy
b = 4
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 52
Stability of Testing Accuracy (Standard Deviation)
Our method produces very stable results, especially b ≥ 4
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
b = 1
b = 2
b = 4
b = 6b = 8
b = 1610
svm: k = 50Spam accuracy (std)
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
b = 1
b = 2
b = 4
b = 6b = 8
b = 10,16svm: k = 100Spam accuracy (std)
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
svm: k = 150
Spam:Accuracy (std)
b = 1
b = 2
b = 4
b = 8b = 16
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
b = 1
b = 2
b = 4
b = 6
b = 8,10,16svm: k = 200Spam accuracy (std)
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
Spam:Accuracy (std)
b = 1
b = 2
b = 4b = 8b = 16svm: k = 300
10−3
10−2
10−1
100
101
102
10−2
10−1
100
b = 1
b = 2
b = 4
b = 6,8,10,16Spam accuracy (std)
C
Acc
urac
y (s
td %
)
svm: k = 500
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 53
Training Time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
svm: k = 50Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
svm: k =100Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
svm: k = 150
Spam:Training Time
b = 16
b = 1
b=8
b = 4b = 1
2416
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
b = 16
svm: k = 200Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
b = 16
b = 1
2
416
b=8
b = 1
svm: k = 300
Spam:Training Time10
−310
−210
−110
010
110
210
0
101
102
103
Spam: Training time
b = 10
b = 16
C
Tra
inin
g tim
e (s
ec)
svm: k = 500
• They did not include data loading time (which is small for b-bit hashing)
• The original training time is about 100 seconds.
• b-bit minwise hashing needs about 3 ∼ 7 seconds (3 seconds when b = 8).
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 54
Testing Time
10−3
10−2
10−1
100
101
102
12
10
100
1000
C
Tes
ting
time
(sec
)
svm: k = 50Spam: Testing time
10−3
10−2
10−1
100
101
102
12
10
100
1000
C
Tes
ting
time
(sec
)
svm: k = 100Spam: Testing time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tes
ting
time
(sec
)
svm: k = 150Spam:Testing Time
10−3
10−2
10−1
100
101
102
12
10
100
1000
C
Tes
ting
time
(sec
)
svm: k = 200Spam: Testing time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tes
ting
time
(sec
) Spam:Testing Timesvm: k = 300
10−3
10−2
10−1
100
101
102
12
10
100
1000
C
Tes
ting
time
(sec
)
svm: k = 500Spam: Testing time
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 55
Experimental Results on L2-Regularized Logistic Regression
Testing Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
logit: k = 50Spam: Accuracy
b = 1
b = 2
b = 4
b = 6b = 8,10,16
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)logit: k = 100
Spam: Accuracy
b = 1
b = 2
b = 4b = 6
b = 8,10,16
10−3
10−2
10−1
100
101
102
80
85
90
95
100
C
Acc
urac
y (%
)
Spam:Accuracylogit: k = 150
b = 1
b = 2
b = 4b = 8
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
Spam: Accuracy
logit: k = 200
b = 6,8,10,16
b = 1
b = 2
b = 4
10−3
10−2
10−1
100
101
102
80
90
95
100
C
Acc
urac
y (%
)
logit: k = 300Spam:Accuracy
b = 1
b = 2
b = 4b = 8
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
logit: k = 500
Spam: Accuracy
b = 1
b = 2b = 4
4
b = 6,8,10,16
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 56
Training Time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 50
Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 100
Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 150Spam:Training Time
b = 16b = 1
2 b = 84
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 200
Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
b = 16
b = 1
b = 482
logit: k = 300Spam:Training Time
10−3
10−2
10−1
100
101
102
100
101
102
103
CT
rain
ing
time
(sec
) b = 16
logit: k = 500
Spam: Training time
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 57
Comparisons with Other Algorithms
We conducted extensive experiments with the VW algorithm (Weinberger et. al.,
ICML’09, not the VW online learning platform), which has the same variance as
random projections.
Consider two sets S1, S2, f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|,R = |S1∩S2|
|S1∪S2|= a
f1+f2−a . Then, from their variances
V ar (aV W ) ≈ 1
k
(
f1f2 + a2)
V ar(
RMINWISE
)
=1
kR (1 − R)
we can immediately see the significant advantages of minwise hashing, especially
when a ≈ 0 (which is common in practice).
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 58
Comparing b-Bit Minwise Hashing with VW
8-bit minwise hashing (dashed, red) with k ≥ 200 (and C ≥ 1) achieves about
the same test accuracy as VW with k = 104 ∼ 106.
101
102
103
104
105
106
80828486889092949698
100
C = 0.01
1,10,100
0.1
k
Acc
urac
y (%
)
svm: VW vs b = 8 hashing
C = 0.01
C = 0.1C = 1
10,100
Spam: Accuracy
101
102
103
104
105
106
80828486889092949698
100
k
Acc
urac
y (%
)
1
C = 0.01
C = 0.1
C = 110
C = 0.01
C = 0.1
10,100
logit: VW vs b = 8 hashingSpam: Accuracy
100
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 59
8-bit hashing is substantially faster than VW (to achieve the same accuracy).
102
103
104
105
106
100
101
102
103
k
Tra
inin
g tim
e (s
ec) C = 100
C = 10
C = 1,0.1,0.01C = 100
C = 10
C = 1,0.1,0.01
Spam: Training time
svm: VW vs b = 8 hashing
102
103
104
105
106
100
101
102
103
kT
rain
ing
time
(sec
)
C = 0.01
10,1.0,0.1
100
logit: VW vs b = 8 hashingSpam: Training time
C = 0.1,0.01
C = 100,10,1
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 60
Experimental Results on Rcv1 (200GB)
Test accuracy using linear SVM (Can not train the original data)
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
svm: k =50 b = 16
b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)rcv1: Accuracy
svm: k =100 b = 16
b = 12b = 8
b = 4b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
svm: k =150
rcv1: Accuracy
b = 16b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
svm: k =200
rcv1: Accuracy
b = 16
b = 12b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
svm: k =300 b = 16b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
b = 1
b = 2
b = 4
b = 8b = 12
b = 16
CA
ccur
acy
(%)
svm: k =500
rcv1: Accuracy
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 61
Training time using linear SVM
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=50
12
16
b = 8
12
10−3
10−2
10−1
100
101
102
100
101
102
103
CT
rain
ing
time
(sec
)
rcv1: Train Time
svm: k=100
b = 16
12
b = 8
12
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=150
b = 1612
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=200
b = 1612
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=300
b = 16
12
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
svm: k=500
b = 8
12
b = 16
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 62
Test accuracy using logistic regression
10−2
100
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
logit: k =50 b = 16
b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
CA
ccur
acy
(%)
rcv1: Accuracyb = 1
b = 2
b = 4
b = 8
b = 12
b = 16logit: k =100
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
logit: k =150 b = 16
b = 12b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
logit: k =200 b = 16b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
b = 16b = 12
b = 8
b = 4
b = 2
b = 1
logit: k =300
rcv1: Accuracy
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
b = 16
b = 12b = 8
b = 4
b = 2
b = 1
rcv1: Accuracy
logit: k =500
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 63
Training time using logistic regression
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=50
b = 168
b = 1
b = 4
b = 2
b = 12
10−3
10−2
10−1
100
101
102
100
101
102
103
CT
rain
ing
time
(sec
)
rcv1: Train Time
logit: k=100
b = 8
b = 16
b = 12
b = 1
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=150
b = 16
128
b = 4
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=200
b = 16
128
b = 1
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=300b = 16
12
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
b = 1612
b = 8
logit: k=500
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 64
Comparisons with VW on Rcv1
Test accuracy using linear SVM
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
svm: VW vs hashing
rcv1: Accuracy
C = 0.01
C = 0.1,1,10
C = 0.01,0.1,1,10
1−bit
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)rcv1: Accuracy
svm: VW vs hashing2−bit
C = 0.01,0.1,1,10 C = 0.01
C = 0.1,1,10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracysvm: VW vs hashing4−bit
C = 0.01
C = 0.1,1,10
C = 0.01
C = 10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracy
svm: VW vs hashing8−bit
C = 0.1,1,10
C = 0.01C = 0.01
C = 10
101
102
103
104
50
60
70
80
90
100
C = 0.01
K
Acc
urac
y (%
)
C = 0.01C = 0.1,1,10
rcv1: Accuracysvm: VW vs hashing12−bit
C = 10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
C = 0.1,1,10
C = 0.01
rcv1: Accuracy
svm: VW vs hashing16−bit
C = 10
C = 0.01
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 65
Test accuracy using logistic regression
101
102
103
104
50
60
70
80
90
100
C = 0.01
C = 0.01,0.1,1,10
C = 0.1,1,10
K
Acc
urac
y (%
)
logit: VW vs hashingrcv1: Accuracy
1−bit
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
C = 1logit: VW vs hashing2−bit
rcv1: Accuracy
C = 0.1,1,10
C = 0.01C = 0.01,0.1,1,10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
C = 0.1,1,10
C = 0.01
C = 0.1,1,10
C = 0.01
rcv1: Accuracylogit: VW vs hashing4−bit
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracylogit: VW vs hashing8−bit
C = 0.01
C = 0.1,1,10
C = 0.01
C = 0.1,1,10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracylogit: VW vs hashing12−bit
C = 0.01
C = 0.1,1,10
C = 0.1,1,10
C = 0.01
101
102
103
104
50
60
70
80
90
100
KA
ccur
acy
(%)
C = 0.01
C = 0.1,1,10
C = 0.01
C = 10
logit: VW vs hashing16−bit
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 66
Training time
101
102
103
104
100
101
102
103
C = 0.01
C = 0.1
C = 10
8−bit
K
Tra
inin
g tim
e (s
ec)
C = 0.01
C = 0.1
C = 1
rcv1: Train Time
svm: VW vs hashing
C = 1
C = 10
101
102
103
104
100
101
102
103
KT
rain
ing
time
(sec
)
rcv1: Train Time
logit: VW vs hashing8−bit
C = 10
C = 1
C = 0.1
C = 0.01
C = 0.01
C = 0.1
C = 1
C = 10
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 67
b-Bit Minwise Hashing for Efficient Near Neighbor Search
Near neighbor search is a much more frequent operation than training an SVM.
The bits from b-bit minwise hashing can be directly used to build hash tables to
enable sub-linear time near neighbor search.
This is an instance of the general family of locality-sensitive hashing.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 68
An Example of b-Bit Hashing for Near Neighbor Search
We use b = 2 bits and k = 2 permutations to build a hash table indexed from
0000 to 1111, i.e., the table size is 22×2 = 16.
00 10
11 1011 11
00 0000 01
Index Data Points
11 01
8, 13, 251 5, 14, 19, 29(empty)
33, 174, 3153 7, 24, 156
61, 342
00 10
11 1011 11
00 0000 01
Index Data Points
11 01
8
17, 36, 1292, 19, 83
7, 198
56, 989,9, 156, 879
4, 34, 52, 796
Then, the data points are placed in the buckets according to their hashed values.
Look for near neighbors in the bucket which matches the hash value of the query.
Replicate the hash table (twice in this case) for missed and good near neighbors.
Final retrieved data points (before re-ranking) are the union of the buckets.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 69
Experiments on the Webspam Dataset
B = b × k: table size L: number of tables.
Fractions of retrieved data points (before re-ranking)
8 12 16 20 2410
−2
10−1
100
Webspam: L = 25
B (bits)
Fra
ctio
n E
valu
ated
b = 1
b = 2
b = 4
SRPb−bit
8 12 16 20 2410
−2
10−1
100
B (bits)
Fra
ctio
n E
valu
ated
b = 1
b = 2
b = 4
Webspam: L = 50
SRPb−bit
8 12 16 20 2410
−2
10−1
100
Webspam: L = 100
B (bits)
Fra
ctio
n E
valu
ated
b = 1
b = 2
b = 4
SRPb−bit
SRP (sign random projections) and b-bit hashing (with b = 1, 2) retrieved similar
numbers of data points.
————————
Ref: Shrivastava and Li, Fast Near Neighbor Search in High-Dimensional Binary
Data, ECML’12
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 70
Precision-Recall curves (the higher the better) for retrieving top-T data points.
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 24 L= 100Webspam: T = 5
b = 1
b = 4
SRPb−bit
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 24 L= 100Webspam: T = 10
b = 1
b = 4
SRPb−bit
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 24 L= 100Webspam: T = 50
b = 1
b = 4
SRPb−bit
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 16 L= 100Webspam: T = 5
1
b = 4
SRPb−bit
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 16 L= 100Webspam: T = 10
1
b = 4
SRPb−bit
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 16 L= 100Webspam: T = 50
1
b = 4
SRPb−bit
Using L = 100 tables, b-bit hashing considerably outperformed SRP (sign
random projections), for table sizes B = 24 and B = 16. Note that there are
many LSH schemes, which we did not compare exhaustively.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 71
Simulating Permutations with Universal (2U and 4U) Hashing
10−3
10−2
10−1
100
101
102
70
80
90
100
b = 1
b = 2
b = 4
b = 8
C
Acc
urac
y (%
)
SVM: k = 10
Spam: Accuracy
Perm2U4U
10−3
10−2
10−1
100
101
102
85
90
95
100
b = 1
b = 2
b = 4
b = 8
C
Acc
urac
y (%
)
SVM: k = 100Spam: Accuracy
Perm2U4U
10−3
10−2
10−1
100
101
102
85
90
95
100
b = 1
b = 2
b = 4
b = 8
C
Acc
urac
y (%
)
SVM: k = 200Spam: Accuracy
SVM: k = 200Spam: AccuracySVM: k = 200Spam: Accuracy
Perm2U4U
10−3
10−2
10−1
100
101
102
85
90
95
100
b = 1
b = 2
4b = 8
C
Acc
urac
y (%
)
SVM: k = 500Spam: Accuracy
SVM: k = 500Spam: AccuracySVM: k = 500Spam: Accuracy
Perm2U4U
Both 2U and 4U perform well except when b = 1 or small k.
———————–
Ref: Li, Shrivastava, Konig, b-Bit Minwise Hashing in Practice, arXiv:1205.2958
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 72
Speed Up PreProcessing Using GPUs
CPU: Data loading and preprocessing (for k = 500 permutations) times (in
seconds). Note that we measured the loading times of LIBLINEAR which used a
plain text data format. Using binary data will speed up data loading.
Dataset Loading Permu 2U 4U (Mod) 4U (Bit)
Webspam 9.7 × 102 6.1 × 103 4.1 × 103 4.4 × 104 1.4 × 104
Rcv1 1.0 × 104 – 3.0 × 104 – –
GPU: Data loading and preprocessing (for k = 500 permutations) times
Dataset Loading GPU 2U GPU 4U (Mod) GPU 4U (Bit)
Webspam 9.7 × 102 51 5.2 × 102 1.2 × 102
Rcv1 1.0 × 104 1.4 × 103 1.5 × 104 3.2 × 103
Note that the modular operations in 4U can be replaced by bit operations.
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 73
Breakdowns of the overhead : A batch-based GPU implementation.
100
101
102
103
104
105
10−2
100
102
104
Batch size
Tim
e (s
ec) GPU Kernel
CPU −−> GPU
GPU −−> CPUSpam: GPU Profiling, 2U
100
101
102
103
104
105
10−2
100
102
104
Batch size
Tim
e (s
ec)
GPU Kernel
CPU −−> GPU
GPU −−> CPU
Rcv1: GPU Profiling, 2U
100
101
102
103
104
105
10−2
100
102
104
Batch size
Tim
e (s
ec)
Spam: GPU Profiling, 4U−Bit
GPU Kernel
CPU −−> GPU
GPU −−> CPU
100
101
102
103
104
105
10−2
10−1
100
101
102
103
104
105
Batch size
Tim
e (s
ec)
Rcv1: GPU Profiling, 4U−Bit
GPU −−> CPU
CPU −−> GPU
GPU Kernel
GPU kernel times dominate the cost. The cost is not sensitive to the batch size.
————————
Ref: Li, Shrivastava, Konig, b-Bit Minwise Hashing in Practice, arXiv:1205.2958
Ref: Li, Shrivastava, Konig, GPU-Based Minwise Hashing, WWW’12 Poster
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 74
Online Learning (SGD) with Hashing
b-Bit minwise hashing can also be highly beneficial for online learning in that it
significantly reduces the data loading time which is usually the bottleneck for
online algorithms, especially if many epochs are required.
10−10
10−8
10−6
10−4
10−2
80
85
90
95
100
b = 1b = 2
b = 4
λ
Acc
urac
y (%
)
SGD SVM: k = 50Spam: Accuracy
b = 6
10−10
10−8
10−6
10−4
10−2
80
85
90
95
100
b = 1
b = 2
b = 4
λA
ccur
acy
(%)
SGD SVM: k = 200Spam: Accuracy
λ: 1/C , following the convention in online learning literature.
Red dashed curve: Results based on the original data
————————
Ref: Li, Shrivastava, Konig, b-Bit Minwise Hashing in Practice, arXiv:1205.2958
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 75
Conclusion
• BigData everywhere. For many operations including clustering, classification,
near neighbor search, etc. exact answers are often not necessary.
• Very sparse random projections (KDD 2006) work as well as dense random
projections. They however require a large number of projections.
• Minwise hashing is a standard procedure in the context of search. b-bit
minwise hashing is a substantial improvement by using only b bits per hashed
value instead of 64 bits (e.g., a 20-fold reduction in space).
• b-bit minwise hashing provides a very simple strategy for efficient linear
learning, by expanding the bits into vectors of length 2b × k.
• We can directly build hash tables from b-bit minwise hashing for sub-linear
time near neighbor search.
• It can be substantially more accurate than random projections (and variants).
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 76
• Major drawbacks of b-bit minwise hashing:
– It was designed (mostly) for binary static data.
– It needs a fairly high-dimensional representation (2b × k) after expansion.
– Expensive preprocessing and consequently expensive testing phrase for
unprocessed data. Parallelization solutions based on (e.g.,) GPUs offer
very good performance in terms of time but they are not energy-efficient.