Approximate nearest neighbor methods and vector models – NYC ML meetup
-
Upload
erik-bernhardsson -
Category
Engineering
-
view
5.388 -
download
16
Transcript of Approximate nearest neighbor methods and vector models – NYC ML meetup
![Page 1: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/1.jpg)
Approximate nearest neighbors & vector
models
![Page 2: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/2.jpg)
I’m Erik
• @fulhack
• Author of Annoy, Luigi
• Currently CTO of Better
• Previously 5 years at Spotify
![Page 3: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/3.jpg)
What’s nearest neighbor(s)
• Let’s say you have a bunch of points
![Page 4: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/4.jpg)
Grab a bunch of points
![Page 5: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/5.jpg)
5 nearest neighbors
![Page 6: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/6.jpg)
20 nearest neighbors
![Page 7: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/7.jpg)
100 nearest neighbors
![Page 8: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/8.jpg)
…But what’s the point?
• vector models are everywhere
• lots of applications (language processing, recommender systems, computer vision)
![Page 9: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/9.jpg)
MNIST example• 28x28 = 784-dimensional dataset
• Define distance in terms of pixels:
![Page 10: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/10.jpg)
MNIST neighbors
![Page 11: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/11.jpg)
…Much better approach
1. Start with high dimensional data
2. Run dimensionality reduction to 10-1000 dims
3. Do stuff in a small dimensional space
![Page 12: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/12.jpg)
Deep learning for food• Deep model trained on a GPU on 6M random pics
downloaded from Yelp15
6x15
6x32
154x
154x
32
152x
152x
32
76x7
6x64
74x7
4x64
72x7
2x64
36x3
6x12
8
34x3
4x12
8
32x3
2x12
8
16x1
6x25
6
14x1
4x25
6
12x1
2x25
6
6x6x
512
4x4x
512
2x2x
512
2048
2048
128
1244
3x3 convolutions
2x2 maxpoolfully
connected with dropout
bottleneck layer
![Page 13: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/13.jpg)
Distance in smaller space1. Run image through the network
2. Use the 128-dimensional bottleneck layer as an item vector
3. Use cosine distance in the reduced space
![Page 14: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/14.jpg)
Nearest food pics
![Page 15: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/15.jpg)
Vector methods for text
• TF-IDF (old) – no dimensionality reduction
• Latent Semantic Analysis (1988)
• Probabilistic Latent Semantic Analysis (2000)
• Semantic Hashing (2007)
• word2vec (2013), RNN, LSTM, …
![Page 16: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/16.jpg)
Represent documents and/or words as f-dimensional vector
Late
nt fa
ctor
1
Latent factor 2
banana
apple
boat
![Page 17: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/17.jpg)
Vector methods for collaborative filtering
• Supervised methods: See everything from the Netflix Prize
• Unsupervised: Use NLP methods
![Page 18: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/18.jpg)
CF vectors – examplesIPMF item item:
P (i ! j) = exp(bTj bi)/Zi =
exp(bTj bi)P
k exp(bTk bi)
VECTORS:pui = aTubi
simij = cos(bi,bj) =bTi bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P (i ! j) = exp(bTj bi)/Zi =
exp(� |bj � bi|2)Pk exp(� |bk � bi|2)
simij = � |bj � bi|2
(u, i, count)
@L
@au
7
![Page 19: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/19.jpg)
Geospatial indexing• Ping the world: https://github.com/erikbern/ping
• k-NN regression using Annoy
![Page 20: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/20.jpg)
Nearest neighbors the brute force way
• we can always do an exhaustive search to find the nearest neighbors
• imagine MySQL doing a linear scan for every query…
![Page 21: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/21.jpg)
Using word2vec’s brute force search
$ time echo -e "chinese river\nEXIT\n" | ./distance GoogleNews-vectors-negative300.bin! Qiantang_River 0.597229 Yangtse 0.587990 Yangtze_River 0.576738 lake 0.567611 rivers 0.567264 creek 0.567135 Mekong_river 0.550916 Xiangjiang_River 0.550451 Beas_river 0.549198 Minjiang_River 0.548721real2m34.346suser1m36.235ssys0m16.362s
![Page 22: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/22.jpg)
Introducing Annoy
• https://github.com/spotify/annoy
• mmap-based ANN library
• Written in C++, with Python and R bindings
• 585 stars on Github
![Page 23: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/23.jpg)
Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 100000 Yangtse 0.907756 Yangtze_River 0.920067 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Huangpu_River 0.951850 Ganges 0.959261 Thu_Bon 0.960545 Yangtze 0.966199 Yangtze_river 0.978978real0m0.470suser0m0.285ssys0m0.162s
![Page 24: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/24.jpg)
Using Annoy’s search$ time echo -e "chinese river\nEXIT\n" | python nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-negative300.bin 1000000 Qiantang_River 0.897519 Yangtse 0.907756 Yangtze_River 0.920067 lake 0.929934 rivers 0.930308 creek 0.930447 Mekong_river 0.947718 Xiangjiang_River 0.948208 Beas_river 0.949528 Minjiang_River 0.950031real0m2.013suser0m1.386ssys0m0.614s
![Page 25: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/25.jpg)
(performance)
![Page 26: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/26.jpg)
1. Building an Annoy index
![Page 27: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/27.jpg)
Start with the point set
![Page 28: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/28.jpg)
Split it in two halves
![Page 29: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/29.jpg)
Split again
![Page 30: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/30.jpg)
Again…
![Page 31: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/31.jpg)
…more iterations later
![Page 32: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/32.jpg)
Side note: making trees small
• Split until K items in each leaf (K~100)
• Takes (n/K) memory instead of n
![Page 33: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/33.jpg)
Binary tree
![Page 34: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/34.jpg)
2. Searching
![Page 35: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/35.jpg)
Nearest neighbors
![Page 36: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/36.jpg)
Searching the tree
![Page 37: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/37.jpg)
![Page 38: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/38.jpg)
Problemo
• The point that’s the closest isn’t necessarily in the same leaf of the binary tree
• Two points that are really close may end up on different sides of a split
• Solution: go to both sides of a split if it’s close
![Page 39: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/39.jpg)
![Page 40: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/40.jpg)
![Page 41: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/41.jpg)
![Page 42: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/42.jpg)
Trick 1: Priority queue
• Traverse the tree using a priority queue
• sort by min(margin) for the path from the root
![Page 43: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/43.jpg)
Trick 2: many trees
• Construct trees randomly many times
• Use the same priority queue to search all of them at the same time
![Page 44: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/44.jpg)
![Page 45: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/45.jpg)
heap + forest = best
• Since we use a priority queue, we will dive down the best splits with the biggest distance
• More trees always helps!
• Only constraint is more trees require more RAM
![Page 46: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/46.jpg)
Annoy query structure
1. Use priority queue to search all trees until we’ve found k items
2. Take union and remove duplicates (a lot)
3. Compute distance for remaining items
4. Return the nearest n items
![Page 47: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/47.jpg)
Find candidates
![Page 48: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/48.jpg)
Take union of all leaves
![Page 49: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/49.jpg)
Compute distances
![Page 50: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/50.jpg)
Return nearest neighbors
![Page 51: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/51.jpg)
“Curse of dimensionality”
![Page 52: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/52.jpg)
![Page 53: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/53.jpg)
![Page 54: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/54.jpg)
Are we screwed?
• Would be nice if the data is has a much smaller “intrinsic dimension”!
![Page 55: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/55.jpg)
![Page 56: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/56.jpg)
Improving the algorithm
Que
ries/
s
1-NN accuracy
more accurate
faster
![Page 57: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/57.jpg)
• https://github.com/erikbern/ann-benchmarks
ann-benchmarks
![Page 58: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/58.jpg)
perf/accuracy tradeoffs
Que
ries/
s
1-NN accuracy
search more nodes
more trees
![Page 59: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/59.jpg)
Things that work
• Smarter plane splitting
• Priority queue heuristics
• Search more nodes than number of results
• Align nodes closer together
![Page 60: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/60.jpg)
Things that don’t work
• Use lower-precision arithmetic
• Priority queue by other heuristics (number of trees)
• Precompute vector norms
![Page 61: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/61.jpg)
Things for the future
• Use a optimization scheme for tree building
• Add more distance functions (eg. edit distance)
• Use a proper KV store as a backend (eg. LMDB) to support incremental adds, out-of-core, arbitrary keys: https://github.com/Houzz/annoy2
![Page 62: Approximate nearest neighbor methods and vector models – NYC ML meetup](https://reader031.fdocuments.in/reader031/viewer/2022012323/586fc52d1a28aba24c8b590f/html5/thumbnails/62.jpg)
Thanks!• https://github.com/spotify/annoy
• https://github.com/erikbern/ann-benchmarks
• https://github.com/erikbern/ann-presentation
• erikbern.com
• @fulhack