https://www.slideshare.net/erikbern/approximate-nearest-neighbor-methods-and-vector-models-nyc-ml-meetup
출처: spotify engineering lead slideshare
- Start with high dimensional data
- Run dimension reduction to 10-1000 dims
- Do stuff in a small dimensional space
Annoy
-
Buliding an Annoy index
- start with the point set
- split it in 2 halves
- split again (until k items in each leaf, takes n/k memory instead n)
- binary tree
-
Search
- the closest isn't necessarily in the same leaf of the binary tree
- 2 points that are really close may end up on different sides of split
→ Go both sides of a split if it's close
- query structure
- use priority queue to search all trees until we've k items
- take union and remove dupliates
- compute distance for remaining items
- return NN items
pip install --user annoy
pip install pynndescent