벡터간의 유사도 계산은 엄청난게 오래걸리고, 이를 실시간으로 처리하기에는 너무 오래걸린다
벡터를 색인한다는 것은 유사 벡터를 빠르게 찾을 수 있는 데이터 구조를 구축하는 것을 의미
해싱 기반의 방법
트리 기반의 방법
네트워크 기반의 방법
Network based NN indexer
Scaling nearest neighbors search with approximate methods
Tree-based data structures
: K-dimensional trees 는 binary search 트리의 컨셉을 다차원으로 일반화 한 것
A toy 2-dimensional example is visualized below.
We can view how the two-dimensional vectors are partitioned at each level of the k-d tree in the figure below.
In order to see the usefulness of this tree,
let's now consider how we could use this data structure to perform an approximate nearest neighbor query. As we walk down the tree, notice how the highlighted area (the area in vector space that we're interested in) shrinks down to a small subset of the original space. (I'll use the level 4 subplot for this example.)
→ K-d trees are popular due to their simplicity, however, this technique struggles to perform well when dealing with high dimensional data.
→ 지금까지의 예시는 나뉘어진 cell의 중간에 위치했을 경우를 살펴보고 있지만, cell의 경계에 위치하였을 때는 어떻게 해야할까? we miss out on vectors which lie just outside of the cell.
We can then maintain an inverted list to keep track of all of the original objects in relation to which centroid represents the quantized vector.
Similar to before, let's now look at how we can use this method to perform a query. For a given query vector, we'll calculate the distances between the query vector and each centroid in order to find the closest centroid. We can then look up the centroid in our inverted list in order to find all of the nearest vectors.
Unfortunately, in order to get good performance using quantization, you typically need to use very large numbers of centroids for quantization; this impedes on original goal of alleviating the computational burden of calculating too many distances.
Product quantization는 문제를 이렇게 풀어낸다 by first subdividing the original vectors into subcomponents and then quantizing (ie. running K-means on) each subcomponent separately.
A single vector is now represented by a collection of centroids, one for each subcomponent.
→ In the 8D case, you can see how our vector is divided into subcomponents and each subcomponent is represented by some centroid value.
→ However, the 2D example shows us the benefit of this approach. In this case, we can only split our 2D vector into a maximum of two components.
We'll then quantize each dimension separately, squashing all of the data onto the horizontal axis and running k-means and then squashing all of the data onto the vertical axis and running k-means again.
We find 3 centroids for each subcomponent with a total of 6 centroids. However, the total set of all possible quantized states for the overall vector is the Cartesian product of the subcomponent centroids.
In other words, if we divide our vector into m subcomponents and find k centroids, we can represent k^m possible quantizations using only km vectors! The chart below shows how many centroids are needed in order to get 90% of the top 5 search results correct for an approximate nearest neighbors query.
→ represent 3^2 quantization using 3*2 vectors
Notice how using product quantization (m>1m>1) vastly reduces the number of centroids needed to represent our data. One of the reasons why I love this idea so much is that we've effectively turned the curse of dimensionality into something highly beneficial!
Handling multi-modal data
Product quantization 는 데이터가 벡터공간에 고르게 퍼져있을 때 잘 작동한다
However, in reality our data is usually multi-modal.
→ To handle this, a common technique involves first training a coarse quantizer to roughly "slice" up the vector-space, and then we'll run product quantization on each individual coarse cell.
Pro-tip: If you want to scale to really large datasets you can use product quantization as both the coarse quantizer and the fine-grained quantizer within each coarse cell. See this paper for the details.
Locally optimized product quantization
The ideal goal for quantization is to develop a codebook which is (1) concise and (2) highly representative of our data. More specifically, we'd like all of the vectors in our codebook to represent dense regions of our data in vector-space. A centroid in a low-density area of our data is inefficient at representing data and introduces high distortion error for any vectors which fall in its Voronoi cell.
One potential way we can attempt to avoid these inefficient centroids is to add an alignment step to our product quantization. This allows for our product quantizers to better cover the local data for each coarse Voronoi cell.
We can do this by applying a transformation to our data such that we minimize our quantization distortion error. One simple way to minimize this quantization distortion error is to simply apply PCA in order to mean-center the data and rotate it such that the axes capture most of the variance within the data.
Recall my earlier example where we ran product quantization on a toy 2D dataset. In doing so, we effectively squashed all of the data onto the horizontal axis and ran k-means and then repeated this for the vertical axis. By rotating the data such that the axes capture most of the variance, we can more effectively cover our data when using product quantization.
This technique is known as locally optimized product quantization, since we're manipulating the local data within each coarse Voronoi cell in order to optimize the product quantization performance. The authors who introduced this technique have a great illustrative example of how this technique can better fit a given set of vectors.
This blog post glances over (c) Optimized Product Quantization which is the same idea of aligning our data for better product quantization performance, but the alignment is performed globally instead of aligning local data in each Voronoi cell independently. Image credit
The authors who introduced product quantization noted that the technique works best when the vector subcomponents had similar variance. A nice side effect of doing PCA alignment is that during the process we get a matrix of eigenvalues which describe the variance of each principal component. We can use this to our advantage by allocating principal components into buckets of equal variance.
Vector Search는 point가 매우 많아지면 곤란한 상황이 된다.
그래서 Approximated 계산이 필요해진다.
기본적인 아이디어는 벡터를 공간상에 분리하거나, 해싱 트릭으로 Bucketizing 하는 것이다. 어쨌든 데이터를 큰 그룹으로 나눠놓는 게 목표가 된다.
결국 특정 벡터와 가까운 벡터를 찾으려면, 그룹 내 벡터들과의 거리만 계산하면 된다.
이걸 Real-time으로 활용하기 위해 색인 과정까지 거치면 된다. 검색이니까 역색인을 활용하면 더 좋다. 비유하자면 학생들한테 "너랑 가장 친한 친구가 누구인지 물어볼테니까 미리 결정해놔" 같은 작업이다.