Data Science Research Day (Talk)
-
Upload
sean-moran -
Category
Data & Analytics
-
view
147 -
download
0
description
Transcript of Data Science Research Day (Talk)
![Page 1: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/1.jpg)
Variable Bit Quantisation for Large Scale Search
Sean Moran
Final year PhD studentInstitute of Language, Cognition and Computation
12th September 2014
![Page 2: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/2.jpg)
Variable Bit Quantisation for Large Scale Search
Nearest Neighbour Search
Variable Bit Quantisation for LSH
Evaluation
Summary
![Page 3: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/3.jpg)
Variable Bit Quantisation for Large Scale Search
Nearest Neighbour Search
Variable Bit Quantisation for LSH
Evaluation
Summary
![Page 4: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/4.jpg)
Nearest Neighbour Search
I Given a query q find the nearest neighbour NN(q) fromX = {x1, x2 . . . , xN} where NN(q) = argminx∈Xdist(q, x)
I dist(q,x) is a distance measure - e.g. Euclidean, Cosine etc
q ?
y
x
1-NN
1-NN
q
x
y
I Generalised variant: K-nearest neighbour search (KNN(q))
I Compare query to all N database items - O(N) query time
![Page 5: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/5.jpg)
Example: First Story Detection in Twitter
I Real-time detection of first stories in the Twitter stream
I Haiti earthquake struck 21:53, first story at 22:17 UTC
I 22:17:43 justinholtweb NOT expecting Tsunami on east coastafter haiti earthquake. good news.
I State-of-the-art FSD uses NN search under the bonnet
I Problems: dimensionality (1 million+) and data volume(250Gb/day)
I Hashing-based approximate NN operates in O(1) time [1]
[1] Real-Time Detection, Tracking and Monitoring of Discovered Events inSocial Media. S. Moran et al. In ACL’14.
![Page 6: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/6.jpg)
Example: First Story Detection in Twitter
Time
Volume
T 1
Rapper Lil Wayne ends up in hospital after a skateboarding accident
![Page 7: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/7.jpg)
Example: First Story Detection in Twitter
Time
Thor Hushovd wins stage 13 of Tour de France
T 2
Volume
![Page 8: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/8.jpg)
Example: First Story Detection in Twitter
Time
T 3
Volume
Magnitude 5.4 earthquake hits western Japan
![Page 9: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/9.jpg)
Example: First Story Detection in Twitter
Time
T 8
USGS and Nat Weather Service NOT expecting Tsunami on east coast after haiti earthquake.
Volume
![Page 10: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/10.jpg)
Hashing-based approximate NN search
Tweets from - T 1 T 7
H Database
![Page 11: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/11.jpg)
Hashing-based approximate NN search
110101
010111
Hash Table
010101
111101
.....
H Database
![Page 12: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/12.jpg)
Hashing-based approximate NN search
H
Query
Tweet from T 8
110101
010111
Hash Table
010101
111101
.....
H Database
![Page 13: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/13.jpg)
Hashing-based approximate NN search
H
Query
H Database
110101
010111
Hash Table
010101
111101
.....
![Page 14: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/14.jpg)
Hashing-based approximate NN search
H
Query ComputeSimilarity
Query
H Database
110101
010111
010101
111101
.....
Hash Table
![Page 15: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/15.jpg)
Hashing-based approximate NN search
H
Query
NearestNeighbours
H Database
110101
010111
010101
111101
.....
ComputeSimilarity
Query
Hash Table
![Page 16: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/16.jpg)
Hashing-based approximate NN search
H
Query
H Database
110101
010111
010101
111101
.....
ComputeSimilarity
Query
First Story!
Hash Table
![Page 17: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/17.jpg)
Locality Sensitive Hashing (LSH)
x
y
![Page 18: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/18.jpg)
Locality Sensitive Hashing (LSH)
x
y
n2
n1
h1
h2
11
0100
1011
![Page 19: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/19.jpg)
Locality Sensitive Hashing (LSH)
x
y
n2
n1
h1
h2
11
0100
10
h1(q) (q)h2
11
q
![Page 20: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/20.jpg)
Step 1: Projection
n2
n1
q
![Page 21: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/21.jpg)
Step 2: Single Bit Quantisation (SBQ)
0 1
t
n2
q
I Threshold typically zero (sign function): sgn(n2.q)
I Generate full 2 bit hash key (bitcode) by concatenation:
g(q) = h1(q)⊕ h2(q) = sgn(n1.q)⊕ sgn(n2.q)= 1⊕ 1 = 11
![Page 22: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/22.jpg)
Many more methods exist...
I Very active area of research:
I Kernel methods [3]
I Spectral methods [4] [5]
I Neural networks [6]
I Loss based methods [7]
I Commonality: all use single bit quantisation (SBQ)
[3] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09.[4] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08.[5] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12.[6] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08.[7] B. Kulis and T. Darrell. Learning to Hash with Binary Reconstructive Embeddings. NIPS ’09.
![Page 23: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/23.jpg)
Variable Bit Quantisation for Large Scale Search
Nearest Neighbour Search
Variable Bit Quantisation for LSH
Evaluation
Summary
![Page 24: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/24.jpg)
Problem 1: SBQ leads to high quantisation errors
I Threshold at zero can separate many related Tweets:
−1.5 −1 −0.5 0 0.5 1 1.50
1000
2000
3000
4000
5000
6000
Projected Value
Co
un
t
![Page 25: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/25.jpg)
Problem 2: some hyperplanes are better than others
n1
n2Projected Dimension 2
Projected Dimension 1
x
y
n2
n1
h1
h2
11
0100
10
![Page 26: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/26.jpg)
Problem 2: some hyperplanes are better than others
n1
n2Projected Dimension 2
Projected Dimension 1
x
y
n2
n1
h1
h2
11
0100
10
![Page 27: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/27.jpg)
Problem 2: some hyperplanes are better than others
n1
n2Projected Dimension 2
Projected Dimension 1
x
y
n2
n1
h1
h2
11
0100
10
![Page 28: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/28.jpg)
Threshold Positioning
I Multiple bits per hyperplane requires multiple thresholds [8]
I F -score optimisation: maximise # related tweets falling insidethe same thresholded regions:
F1 score: 1.00
00 01 10 11
t1 t2 t3
n2
F1 score: 0.44
00 01 10 11
t1 t2 t3
n2
[8] Neighbourhood Preserving Quantisation for LSH. S. Moran et al. In
SIGIR’13
![Page 29: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/29.jpg)
Variable Bit Allocation
I F -score is a measure of the neighbourhood preservation [9]:
F1 score: 1.00
F1 score: 0.25
00 01 10 11
t1 t2 t3
0 bits assigned: 0 thresholds do just as well as one or more thresholds
n1
n2
2 bits assigned: 3 thresholds perfectly preserve the neighbourhood structure
I Compute bit allocation that maximises the cumulative F -score
I Bit allocation solved as a binary integer linear program (BILP)
[9] Variable Bit Quantisation for LSH. S. Moran et al. In ACL’13
![Page 30: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/30.jpg)
Variable Bit Allocation
max ‖F ◦ Z‖subject to ‖Zh‖ = 1 h ∈ {1 . . . B}
‖Z ◦D‖ ≤ B
Z is binary
I F contains the F scores per hyperplane, per bit count
I Z is an indicator matrix specifying the bit allocation
I D is a constraint matrix
I B is the bit budget
I ‖.‖ denotes the Frobenius L1 norm
I ◦ the Hadamard product
[9] Variable Bit Quantisation for LSH. S. Moran et al. In ACL’13
![Page 31: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/31.jpg)
Variable Bit Allocation
max ‖F ◦ Z‖subject to ‖Zh‖ = 1 h ∈ {1 . . . B}
‖Z ◦D‖ ≤ B
Z is binary
I F contains the F scores per hyperplane, per bit count
I Z is an indicator matrix specifying the bit allocation
I D is a constraint matrix
I B is the bit budget
I ‖.‖ denotes the Frobenius L1 norm
I ◦ the Hadamard product
[9] Variable Bit Quantisation for LSH. S. Moran et al. In ACL’13
![Page 32: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/32.jpg)
Variable Bit Allocation
max ‖F ◦ Z‖subject to ‖Zh‖ = 1 h ∈ {1 . . . B}
‖Z ◦D‖ ≤ B
Z is binary
I F contains the F scores per hyperplane, per bit count
I Z is an indicator matrix specifying the bit allocation
I D is a constraint matrix
I B is the bit budget
I ‖.‖ denotes the Frobenius L1 norm
I ◦ the Hadamard product
[9] Variable Bit Quantisation for LSH. S. Moran et al. In ACL’13
![Page 33: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/33.jpg)
Variable Bit Allocation
max ‖F ◦ Z‖subject to ‖Zh‖ = 1 h ∈ {1 . . . B}
‖Z ◦D‖ ≤ B
Z is binary
I F contains the F scores per hyperplane, per bit count
I Z is an indicator matrix specifying the bit allocation
I D is a constraint matrix
I B is the bit budget
I ‖.‖ denotes the Frobenius L1 norm
I ◦ the Hadamard product
[9] Variable Bit Quantisation for LSH. S. Moran et al. In ACL’13
![Page 34: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/34.jpg)
Variable Bit Allocation
max ‖F ◦ Z‖subject to ‖Zh‖ = 1 h ∈ {1 . . . B}
‖Z ◦D‖ ≤ B
Z is binary
F h1 h2
b0 0.25 0.25b1 0.35 0.50b2 0.40 1.00
D
0 01 12 2
Z
1 00 00 1
I Sparse solution: lower quality hyperplanes discarded [9].
[9] Variable Bit Quantisation for LSH. S. Moran et al. In ACL’13
![Page 35: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/35.jpg)
Variable Bit Quantisation for Large Scale Search
Nearest Neighbour Search
Variable Bit Quantisation for LSH
Evaluation
Summary
![Page 36: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/36.jpg)
Evaluation Protocol
I Task: Text and image retrieval
I Projections: LSH [2], Shift-invariant kernel hashing (SIKH)[3], Spectral Hashing (SH) [4] and PCA-Hashing (PCAH) [5].
I Baselines: Single Bit Quantisation (SBQ), ManhattanHashing (MQ)[10], Double-Bit quantisation (DBQ) [11].
I Evaluation: how well do we retrieve the NN of queries?
[2] P. Indyk and R. Motwani. Approximate nearest neighbors: removing the curse of dimensionality. In STOC ’98.[3] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09.[4] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08.[9] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12.[10] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image retrieval. SIGIR ’12.[11] W. Kong and W. Li. Double Bit Quantisation for Hashing. AAAI ’12.
![Page 37: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/37.jpg)
AUPRC across different projections (variable # bits)
Dataset Images (32 bits) Text (128 bits)
SBQ MQ DBQ VBQ SBQ MQ DBQ VBQ
SIKH 0.042 0.046 0.047 0.161 0.102 0.112 0.087 0.389
LSH 0.119 0.091 0.066 0.207 0.276 0.201 0.175 0.538
SH 0.051 0.144 0.111 0.202 0.033 0.028 0.030 0.154
PCAH 0.036 0.132 0.107 0.219 0.095 0.034 0.027 0.154
I Variable bit allocation yields substantial gains in retrievalaccuracy
I VBQ is an effective multimodal quantisation scheme
![Page 38: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/38.jpg)
AUPRC for LSH across a broad bit range
32 48 64 96 128 2560
0.1
0.2
0.3
0.4
0.5
0.6
0.7
SBQ MQ VBQ
Number of Bits
AU
PR
C
8 16 24 32 40 48 56 640
0.05
0.1
0.15
0.2
0.25
0.3
0.35
SBQ MQ VBQ
Number of Bits
AU
PR
C
(a) Text (b) Images
I VBQ is effective for both long and short bit codes (hash keys)
![Page 39: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/39.jpg)
Variable Bit Quantisation for Large Scale Search
Nearest Neighbour Search
Variable Bit Quantisation for LSH
Evaluation
Summary
![Page 40: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/40.jpg)
Summary
I Proposed a novel data-driven scheme (VBQ) to adaptivelyassign variable bits per LSH hyperplane
I Hyperplanes better preserving the neighbourhood structureare afforded more bits from budget
I VBQ substantially increased LSH retrieval performance acrosstext and image datasets
I Current work: method to couple the quantisation andprojection stages of LSH
![Page 41: Data Science Research Day (Talk)](https://reader034.fdocuments.in/reader034/viewer/2022052620/55761a60d8b42a4e1c8b477a/html5/thumbnails/41.jpg)
Thank you for your attention!
I FSD live system: goo.gl/Q7WQOk [1]
I Running over live Twitter stream in real time
I Sub-second detection latency via ANN search (1 CPU!)
I Papers: www.seanjmoran.com [1][8][9]
I Contact: [email protected]
[1] Real-Time Detection, Tracking and Monitoring of Discovered Events inSocial Media. S. Moran et al. In ACL’14.
[8] Variable Bit Quantisation for LSH. S. Moran et al. In ACL’13
[9] Neighbourhood Preserving Quantisation for LSH. S. Moran et al. In
SIGIR’13