Approximate Nearest Neighbor Queries with a Tiny Index
description
Transcript of Approximate Nearest Neighbor Queries with a Tiny Index
NICTA Machine Learning Research Group Seminar 1
APPROXIMATE NEAREST NEIGHBOR QUERIES WITH A TINY INDEXWei Wang, University of New South Wales6/26/14
NICTA Machine Learning Research Group Seminar
2
Outline Overview of Our Research SRS: c-Approximate Nearest Neighbor
with a tiny index [PVLDB 2015]
Conclusions
6/26/14
NICTA Machine Learning Research Group Seminar
3
Research Projects Similarity Query Processing Keyword search on (Semi-) Structured
Data Graph Succinct Data Structures
6/26/14
NICTA Machine Learning Research Group Seminar
4
NN and c-ANN Queries
Definitions A set of points, D = ∪i=1
n {Oi}, in d-dimensional Euclidean space (d is large, e.g., hundreds)
Given a query point q, find the closest point, O*, in D Relaxed version:
Return a c-ANN point: i.e., its distance to q is at most c*Dist(O*, q) May return a c-ANN point with at least constant probability
NN = a
c-ANN = x x D = {a, b, x}
aka. (1+ε)-ANN
6/26/14
NICTA Machine Learning Research Group Seminar
5
Applications and Challenges Applications
Feature vectors: Data Mining, Multimedia DB Fundamental geometric problem: “post-office
problem” Quantization in coding/compression …
Challenges Curse of Dimensionality / Concentration of Measure
Hard to find algorithms sub-linear in n and polynomial in d
Large data size: 1KB for a single point with 256 dims
6/26/14
NICTA Machine Learning Research Group Seminar
6
Existing Solutions NN:
O(d5log(n)) query time, O(n2d+δ) space O(dn1-ε(d)) query time, O(dn) space Linear scan: O(dn/B) I/Os, O(dn) space
(1+ε)-ANN O(log(n) + 1/ε(d-1)/2) query time, O(n*log(1/ε)) space Probabilistic test remove exponential
dependency on d Fast JLT: O(d*log(d) + ε-3log2(n)) query time, O(nmax(2, ε^-2))
space LSH-based: Õ(dnρ+o(1)) query time, Õ(n1+ρ+o(1) + nd) space
ρ= 1/(1+ε) + oc(1)
LSH is the best approach using sub-quadratic space
Linear scan is (practically) the best approach using linear space & time
6/26/14
NICTA Machine Learning Research Group Seminar
7
Approximate NN for Multimedia Retrieval
Cover-tree Spill-tree Reduce to NN search with Hamming
distance Dimensionality reduction (e.g., PCA) Quantization-based approaches (e.g., CK-
Means)
6/26/14
NICTA Machine Learning Research Group Seminar
8
Locality Sensitive Hashing (LSH) Equality search
Index: store o into bucket h(o) Query: retrieve every o in the
bucket h(q), verify if o = q LSH
∀h∈LSH-family, Pr[ h(q) = h(o) ] ∝ 1/Dist(q, o) h :: Rd Z technically, dependent on r
“Near-by” points (blue) have more chance of colliding with q than “far-away” points (red)
LSH is the best approach using sub-quadratic space
6/26/14
NICTA Machine Learning Research Group Seminar
9
LSH: Indexing & Query Processing Index
For a fixed r sig(o) = ⟨h1(o), h2(o), …, hk(o)⟩ store o into bucket sig(o)
Iteratively increase r Query
Search with a fixed r Retrieve and “verify” points in
the bucket sig(q) Repeat this L times (boosting)
Galloping search to find the first good r
Reduce Query Cost
Const Succ.Prob.
Incurs additional cost + only c2 quality guarantee
6/26/14
NICTA Machine Learning Research Group Seminar
10
Locality Sensitive Hashing (LSH) Standard LSH
c2-ANN binary search on R(ci, ci+1)-NN problems
LSH on external memory LSB-forest [SIGMOD’09, TODS’10]:
A different reduction from c2-ANN to a R(ci, ci+1)-NN problem
C2LSH [SIGMOD’12]: Do not use composite hash keys Perform fine-granular counting number of collisions
in m LSH projections
O((dn/B)0.5) query, O((dn/B)1.5) space
O(n*log(n)/B) query,O(n*log(n)/B) space
O(n/B) query,O(n/B) space
SRS (Ours) 6/26/14
NICTA Machine Learning Research Group Seminar
11
Weakness of Ext-Memory LSH Methods
Existing methods uses super-linear space Thousands (or more) of hash tables needed if
rigorous People resorts to hashing into binary code (and using
Hamming distance) for multimedia retrieval Can only handle c, where c = x2, for integer x ≥ 2
To enable reusing the hash table (merging buckets) Valuable information lost (due to quantization) Update? (changes to n, and c)
Dataset Size
LSB-forest
C2LSH+ SRS (Ours)
Audio, 40MB
1500 MB 127 MB 2 MB6/26/14
NICTA Machine Learning Research Group Seminar
12
SRS: Our Proposed Method Solving c-ANN queries with O(n) query time and O(n)
space with constant probability Constants hidden in O() is very small Early-termination condition is provably effective
Advantages: Small index Rich-functionality Simple
Central idea: c-ANN query in d dims kNN query in m-dims with
filtering Model the distribution of m “stable random projections”
6/26/14
NICTA Machine Learning Research Group Seminar
13
2-stable Random Projection Let D be the 2-stable random projection
= standard Normal distribution N(0, 1) For two i.i.d. random variables A ~ D, B
~ D, then x*A + y*B ~ (x2+y2)1/2 * D Illustration
Vr1
V.r1 ~ N(0,
ǁvǁ)
V.r2 ~ N(0,
ǁvǁ)r2
V
6/26/14
NICTA Machine Learning Research Group Seminar
14
Dist(O) and ProjDist(O) and Their Relationship
z1≔⟨V, r1⟩ ~ N(0, ǁvǁ) z2≔⟨V, r2⟩ ~ N(0, ǁvǁ) z1
2+z22 ~ ǁvǁ2 *χ2
m i.e., scaled Chi-squared distribution of m degrees of freedom Ψm(x): cdf of the standardχ2
m distribution
O in d dims
(z1, … zm) in m dims
m 2-stable random projections
V=Dist(
O) r1
⟨V.r1⟩ ~ N(0,
ǁvǁ)
Q
O
⟨V.r2 ⟩ ~ N(0,
ǁvǁ)r2
V=Dist
(O)
Q
O
Proj(O)
Proj(Q)
ProjDist(O)
Dist(O)
ProjDist(O)
6/26/14
NICTA Machine Learning Research Group Seminar
15
LSH-like Property Intuitive idea:
If Dist(O1) ≪ Dist(O2) then ProjDist(O1) < ProjDist(O2) with high probability
But the inverse is NOT true NN object in the projected space is most likely not the NN
object in the original space with few projections, as Many far-away objects projected before the NN/cNN objects But we can bound the expected number of such cases! (say T)
Solution Perform incremental k-NN search on the projected space
till accessing T objects + Early termination test
6/26/14
NICTA Machine Learning Research Group Seminar
16
Indexing Finding the minimum m
Input n, c, T ≔ max # of points to access by the algorithm
Output m : # of 2-stable random projections T’ ≤ T: a better bound on T m = O(n/T). We use T = O(n), so m = O(1) to achieve linear
space index Generate m 2-stable random projections n
projected points in a m-dimensional space Index these projections using any index that
supports incremental kNN search, e.g., R-tree Space cost: O(m * n) = O(n)
6/26/14
NICTA Machine Learning Research Group Seminar
17
SRS-αβ(T, c, pτ) Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T
Compute Dist(Ok) Maintain Omin = argmin1≤i≤k Dist(Oi) If early-termination test (c, pτ) = TRUE
BREAK Return Omin
// stopping condition α
// stopping condition β
c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, pτ=0.18Index = 0.0059n, Query = 0.0084n, succ prob = 0.13
Early-termination test:
€
ψmc 2 * ProjDist2(ok )
Dist 2(omin )
⎛ ⎝ ⎜
⎞ ⎠ ⎟> pτ
6/26/14Main Theorem: SRS-αβ returns a c-NN point with probability pτ-f(m,c) with O(n) I/O cost
NICTA Machine Learning Research Group Seminar
18
Variations of SRS-αβ(T, c, pτ) Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T
Compute Dist(Ok) Maintain Omin = argmin1≤i≤k Dist(Oi) If early-termination test (c, pτ) = TRUE
BREAK Return Omin
// stopping condition α
// stopping condition β
1. SRS-α2. SRS-β3. SRS-αβ(T, c’, pτ)
Better quality; query cost is O(T)Best quality; query cost bounded by O(n);
handles c = 1 Better quality; query cost
bounded by O(T) 6/26/14All with success probability at least
pτ
NICTA Machine Learning Research Group Seminar
19
Other Results Can be easily extended to support top-k
c-ANN queries (k > 1) No previous known guarantee on the
correctness of returned results We guarantee the correctness with
probability at least pτ, if SRS-αβ stops due to early-termination condition ≈100% in practice (97% in theory)
6/26/14
NICTA Machine Learning Research Group Seminar
20
Analysis
6/26/14
NICTA Machine Learning Research Group Seminar
21
Stopping Condition α “near” point: the NN point its distance ≕ r “far” points: points whose distance > c * r Then for any κ > 0 and any o:
Pr[ProjDist(o)≤κ*r | o is a near point] ≥ ψm(κ2) Pr[ProjDist(o)≤κ*r | o is a far point] ≤ ψm(κ2/c2)
Both because ProjDist2(o)/Dist2(o) ~ χ2m
Pr[the NN point projected beforeκ*r] ≥ ψm(κ2) Pr[# of bad points projected before κ*r < T] > (1 - ψm(κ2/c2)) * (n/T) Choose κ such that P1 + P2 – 1 > 0
Feasible due to good concentration bound for χ2m
P1P2
6/26/14
NICTA Machine Learning Research Group Seminar
22
Choosing κ
Let c = 4 Mode = m – 2 Blue: 4 Red: 4*(c2) =
64
κ*r6/26/14
NICTA Machine Learning Research Group Seminar
23
ProjDist(OT): Case I
ProjDist(OT)
ProjDist(o) in m-dims
Omin = the NN point
Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability
6/26/14
NICTA Machine Learning Research Group Seminar
24
ProjDist(OT): Case II
ProjDist(OT)
ProjDist(o) in m-dims
Omin = a cNN point
Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability
6/26/14
NICTA Machine Learning Research Group Seminar
25
Early-termination Condition (β) Omit the proof here
Also relies on the fact that the squared sum of m projected distances follows a scaled χ2
m distribution
Key to Handle the case where c = 1
Returns the NN point with guaranteed probability Impossible to handle by LSH-based methods
Guarantees the correctness of top-k cANN points returned when stopped by this condition No such guarantee by any previous method
6/26/14
NICTA Machine Learning Research Group Seminar
26
Experiment Setup Algorithms
LSB-forest [SIGMOD’09, TODS’10] C2LSH [SIGMOD’12] SRS-* [VLDB’15]
Data Measures
Index size, query cost, result quality, success probability
6/26/14
NICTA Machine Learning Research Group Seminar
27
Datasets
5.6PB
369GB
16GB
6/26/14
NICTA Machine Learning Research Group Seminar
28
Tiny Image Dataset (8M pts, 384 dims)
Fastest: SRS-αβ, Slowest: C2LSH Quality the other
way around SRS-αhas
comparable quality with C2LSH yet has much lower cost.
SRS-* dominates LSB-forest
6/26/14
NICTA Machine Learning Research Group Seminar
29
Approximate Nearest Neighbor
Empirically better than the theoretic guarantee
With 15% I/Os of linear scan, returns NN with probability 71%
With 62% I/Os of linear scan, returns NN with probability 99.7%
6/26/14
NICTA Machine Learning Research Group Seminar
30
Large Dataset (0.45 Billion)
6/26/14
NICTA Machine Learning Research Group Seminar
31
Summary c-ANN queries in arbitrarily high dim
space kNN query in low dim space Our index size is approximately d/m of
the size of the data file Opens up a new direction in c-ANN
queries in high-dimensional space Find efficient solution to kNN problem in 6-
10 dimensional space
6/26/14
NICTA Machine Learning Research Group Seminar
32
Q&A
Similarity Query Processing Project Homepage: http://www.cse.unsw.edu.au/~weiw/project/simjoin.html
6/26/14