Is Cosine-Similarity of Embeddings Really About Similarity?

임재석·2024년 3월 15일
0

paper-study

목록 보기
13/23

1. Introduction

  • Discrete Entities are embedded to dense real-valued vectors

    • word embedding for LLM
    • recommender system
  • The embedding vector can be used as input to other models

  • Also, they can provide a data-driven notion of similarity between entities

  • Cosine Similarity has become a very popular measure of semantic similarity

    • the norm of the embedding vectors is not as important as the directional alignment between the embedding vectors
    • unnormalized dot-product is not worked
  • Cosine similarity of the learned embeddings can in fact yield arbitrary results

    • learned embeddings have a DoF that can render arbitrary cosine-similarities even though their dot-products are well-defined and unique
  • propose linear Matrix Factorization models as analytical solutions

2. Matrix Factorization Models

  • focus on linear models as they follow for closed-form solutions

  • matrix XRn×pX \in \mathbb{R}^{n \times p}, conatining nn data points and pp features

  • the goal is to estimate a low-rank matrix ABRp×pAB^{\top} \in \Reals ^{p \times p} where A,BRp×kA, B \in \Reals^{p \times k} with kpk \le p such that XABXAB^{\top} is a good approximation of XXABX \approx XAB^{\top}

  • XX is a user-item matrix

    • the row bi\vec{b_i} of BB : item-embeddings (kk-dimensional)
    • the row xuA\vec{x_u} \cdot A of XAXA : user-embeddings
    • the embedding of user uu is the sum of the item-embeddings aj\vec{a_j} that the user has consumed
  • this is defined in terms of the unnormalized dot-product between two embeddings

    • (XAB)u,i=xuA,bi(XAB^{\top})_{u,i} = \lang \vec{x_u} \cdot A, \vec{b_i} \rang

    • once it has been learned, it is common to consider

      • two items cosine similarity
      • two users cosine similarity
      • user-item cosine similarity
    • this can lead to arbitrary results and they may not even be unique

2.1 Training

  • the key factor affecting the utility of cosine similarity is the regularization employed when learning the embeddings in A,BA, B

    • minA,BXXABF2+λABF2\min_{A, B} || X - XAB^{\top} || _F ^2 + \lambda ||AB^{\top}||_F^2
    • minA,BXXABF2+λ(XAF2+BF2)\min_{A, B} || X - XAB^{\top} || _F ^2 + \lambda (||XA||_F^2 + ||B||_F^2)
  • First one applies ABF2||AB^{\top}||_F^2 to their product

    • in Linear models, this L2-regularization is equivalent to learning with denoising (drop-out in the input layer)
    • the resulting prediction accuracy os test data was superior to the second objective
    • denoising/drop-out is better than weight decay (second one)
  • Second one is equivalent to the usual matrix factorization objective

    • XPQF2+λ(PF2+QF2)|| X - PQ^{\top} || _F ^2 + \lambda (||P||_F^2 + ||Q||_F^2)
    • regularizing PP and QQ separately is similar to weight decay in deep learning
  • if A^,B^\hat{A}, \hat{B} are solutions of objective, for an arbitrary rotation matrix RRk×kR \in \Reals^{k \times k} are the solutions as well

  • cosine similarity is invariant under rotation RR

  • only the first objetive is invariant to rescalings of the columns of AA and BB (different latent dimensions of the embeddings)

    • if A^B^\hat{A}\hat{B}^{\top} is the solution of the first objective, for an arbitrary diagonal matrix DRk×kD \in \Reals^{k \times k}, A^DD1B^\hat{A}DD^{-1}\hat{B}^{\top} is the solution also

    • Then devine a new solution as a function of DD as

      A^(D):=A^DB^(D):=B^D1\begin{aligned} \hat{A}^{(D)} &:= \hat{A}D \\ \hat{B}^{(D)} &:= \hat{B}D^{-1} \end{aligned}
    • this diagonal matrix DD affects the normalization of the learned user and item embeddings (rows)

      (XA^(D))(normalized)=ΩAXA^(D)=ΩAXA^DB^(normalized)(D)=ΩBB^(D)=ΩBB^D1\begin{aligned} (X\hat{A}^{(D)})_{(\text{normalized})} &= \Omega_A X\hat{A}^{(D)} = \Omega_A X\hat{A}D \\ \hat{B}^{(D)}_{\text{(normalized)}} &= \Omega_B \hat{B}^{(D)} = \Omega_B \hat{B}D^{-1} \end{aligned}

      where Ω\Omega is appropriate diagonal matrices to normalize each learned embedding to unit Euclidian norm

    • a different choice for DD cannot be compensated by the Ω\Omega

    • they depend on DD so they can be shown as ΩA(D),ΩB(D)\Omega_A(D), \Omega_B(D)

    • cosing similarities of the embeddings depend on this arbitrary matrix DD

  • the cosine similarity becomes

    • item - item
      cosSim(B^(D),B^(D))=ΩB(D)B^D2B^ΩB(D)\text{cosSim}(\hat{B}^{(D)}, \hat{B}^{(D)}) = \Omega_B(D) \cdot \hat{B} \cdot D^{-2} \cdot \hat{B}^{\top} \cdot \Omega_B(D)
    • user-user
      cosSim(XA^(D),XA^(D))=ΩA(D)XA^D2(XA^)ΩA(D)\text{cosSim}(X\hat{A}^{(D)}, X\hat{A}^{(D)}) = \Omega_A(D) \cdot X\hat{A} \cdot D^{2} \cdot (X\hat{A})^{\top} \cdot \Omega_A(D)
    • user-item
      cosSim(XA^(D),B^(D))=ΩA(D)XA^B^ΩB(D)\text{cosSim}(X\hat{A}^{(D)}, \hat{B}^{(D)}) = \Omega_A(D) \cdot X\hat{A} \cdot \hat{B}^{\top} \cdot \Omega_B(D)
  • These cosine similarities all depend on arbitrary matrix DD

  • user-user and item-item is directly depend on DD while user-item is indirectly depend on DD due to its effect to normalizing matrices

2.2 Details on First Objective

  • The closed-form solution of the first objective is A^(1)B^(1)=VkdMat(...,11+λ/σi2,...)kVk,X=:UΣV,Σ=dMat(...,σi,...)\hat{A}_{(1)}\hat{B}_{(1)} = V_k \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)_k \cdot V_k^{\top}, \quad X =: U\Sigma V^{\top}, \quad \Sigma = \text{dMat}(..., \sigma_i, ...) and VkV_k is truncated matrices of rank kk

  • Sine DD is arbitrary, w.l.o.g. we may define A^(1)=B^(1):=VkdMat(...,11+λ/σi2,...)k12\hat{A}_{(1)} = \hat{B}_{(1)} := V_k \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)_k^{{1 \over 2}}

  • when we think of the special case of a full-rank MF model, this would be two cases

    • choose D=dMat(...,11+λ/σi2,...)12D = \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)^{{1 \over 2}}

      • A(1)(D)=A^(1)D=VdMat(...,11+λ/σi2,...)A_{(1)}^{(D)} = \hat{A}_{(1)} \cdot D = V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)
      • B(1)(D)=B^(1)D1=VB_{(1)}^{(D)} = \hat{B}_{(1)} \cdot D^{-1} = V
      • given the matrix of normalized singular vectors VV, the normalization ΩB=I\Omega_B = I
      • Then cosSim(B^(1)(D),B^(1)(D))=VV=I\text{cosSim}(\hat{B}_{(1)}^{(D)}, \hat{B}_{(1)}^{(D)}) = VV^{\top} = I
      • Cosine similarity between any pair of item-embedding is zero
      • cosSim(XA^(1)(D),B^(1)(D))=ΩAXVdMat(...,11+λ/σi2,...)V=ΩAXA^(1)B^(1)\text{cosSim}(X\hat{A}_{(1)}^{(D)}, \hat{B}_{(1)}^{(D)}) = \Omega_A \cdot X \cdot V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...) \cdot V^{\top} = \Omega_A \cdot X \cdot \hat{A}_{(1)}\hat{B}_{(1)}^{\top}
      • the only difference in user-item embeddings is the normalization \rightarrow the same ranking (ΩA\Omega_A is irrelevant)
    • choose D=dMat(...,11+λ/σi2,...)12D = \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)^{-{1 \over 2}}

      • similar to previous case
      • A(1)(D)=A^(1)D1=VA_{(1)}^{(D)} = \hat{A}_{(1)} \cdot D^{-1} = V
      • B(1)(D)=B^(1)D=VdMat(...,11+λ/σi2,...)B_{(1)}^{(D)} = \hat{B}_{(1)} \cdot D = V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)
      • cosSim(XA^(1)(D),XA^(1)(D))=ΩAXXΩA\text{cosSim}(X\hat{A}_{(1)}^{(D)}, X\hat{A}_{(1)}^{(D)}) = \Omega_A \cdot X \cdot X^{\top} \cdot \Omega_A
      • for user-user similarities, it is based on the raw data-matrix
      • it doesn't uses the learned embeddings
      • cosSim(XA^(1)(D),B^(1)(D))=ΩAXA^(1)B^(1)ΩB\text{cosSim}(X\hat{A}_{(1)}^{(D)}, \hat{B}_{(1)}^{(D)}) = \Omega_A \cdot X \cdot \hat{A}_{(1)} \cdot \hat{B}_{(1)}^{\top} \cdot \Omega_B
      • ΩB\Omega_B normalizes the rows of BB but this is again the same rankings
      • cosSim(B^(1)(D),B^(1)(D))=ΩBVdMat(...,11+λ/σi2,...)2VΩB\text{cosSim}(\hat{B}_{(1)}^{(D)}, \hat{B}_{(1)}^{(D)}) = \Omega_B \cdot V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)^{2} \cdot V_{\top} \cdot \Omega_B
      • this is very different from the previous choice
    • Hence, the different choice of DD result in different cosine-similarities even though the learned model A^(1)(D)B^(1)(D)=A^(1)B^(1)\hat{A}_{(1)}^{(D)}\hat{B}_{(1)}^{(D)\top} = \hat{A}_{(1)}\hat{B}_{(1)}^{\top} is invariant to DD

    • the results of cosine-similarity are arbitrary and not unique for this model

2.3 Details on Second Objective

  • The solution of the second objective is

    • A^(2)=VkdMat(...,1σi(1λσi)+,...)k\hat{A}_{(2)} = V_k \cdot \text{dMat}(..., \sqrt{{1 \over \sigma_i} \cdot (1 - {\lambda \over \sigma_i})_+}, ...)_k
    • B^(2)=VkdMat(...,σi(1λσi)+,...)k\hat{B}_{(2)} = V_k \cdot \text{dMat}(..., \sqrt{\sigma_i \cdot (1 - {\lambda \over \sigma_i})_+}, ...)_k
    • (y)+=max(0,y)(y)_+ = \max(0, y)
  • If we use usual notation of MF P=XAP = XA and Q=BQ = B,

    • we get P^=XA^(2)=UkdMat(...,1σi(1λσi)+,...)k\hat{P} = X\hat{A}_{(2)} = U_k \cdot \text{dMat}(..., \sqrt{{1 \over \sigma_i} \cdot (1 - {\lambda \over \sigma_i})_+}, ...)_k
    • this diagonal matrix is same for user and item embeddings due to its symmetry in the L2-norm regularization
    • this solution is unique \rightarrow there is no way to choose arbitrary matrix DD
  • In this case, the cosine-similarity yields unique results

  • is this matrix dMat(...,1σi(1λσi)+,...)k\text{dMat}(..., \sqrt{{1 \over \sigma_i} \cdot (1 - {\lambda \over \sigma_i})_+}, ...)_k the best possible semantic similarities?

    • comparing this case with 2.2 gives the arbitrary diagonal matrix DD in 2.2 may be chosen as D=dMat(...,1σi,...)kD = \text{dMat}(..., \sqrt{{1 \over \sigma_i}}, ...)_k

3. Remedies and Alternatives to Cosine-Similarity

  • when a model is trained w.r.t. the dot-product, its effect on cosine-similarity can be opaque and sometimes not even unique
    • train model on cosine-similarity \rightarrow use layer normalization

    • project the embedding back into the original space \rightarrow cosine-similarity works

      • view XA^B^X\hat{A}\hat{B}^{\top} as the raw data's smoothed version and the rows of XA^B^X\hat{A}\hat{B}^{\top} as the users' embeddings in the original space
  • in cosine-similarity, normalization is applied after the embeddings have been learned
    • this can reduce the result similarities compare to applying some normalization or reduction of popularity bias before of during learning
  • To resolve this,
    • standardize data XX (zero mean, unit variance)

    • negative sampling, inverse propensity scaling to account for the different item popularities

      • word2vec is trained by sampling negatives with a probability proportional to their frequency

4. Experiments

  • illustrate these findings for low-rank embeddings

  • Not aware of a good metric for semantic similarity \rightarrow experiments on simulated data \rightarrow ground-truths are known (clustered items data)

  • generated interactions between 20000 users and 1000 items assigned to 5 clusters with probability pcp_c

  • sampled the powerlaw-exponent for each cluster cc, βcUniform(βmin(item),βmin(item))\beta_c \sim \text{Uniform}(\beta_{min}^{(item)}, \beta_{min}^{(item)})

    • where βmin(item)=0.25,βmin(item)=1.5\beta_{min}^{(item)} = 0.25, \beta_{min}^{(item)} = 1.5
  • assigned a baseline popularity to each item ii according to the powerlaw pi=PowerLaw(βc)p_i = \text{PowerLaw}(\beta_c)

  • then generated the items that each user uu had interacted with

    • firstly, randomly sampled user-cluster preferences pucp_{uc}
    • compute the user-item probabilities pui=pucipiipucipip_{ui} = {p_{uc_i}p_i \over \sum _i p_{uc_i}p_i}
    • sampled the number of items for this user kuPowerLaw(β(user))k_u \sim \text{PowerLaw}(\beta^{(user)}) (used β(user)=0.5\beta^{(user)} = 0.5 and sampled kuk_u items with puip_{ui})
  • Learned the matrices A,BA, B with two training objective (λ=10000\lambda = 10000 and λ=100\lambda = 100)

    • low-rank constraint k=50k=50, p=1000p=1000 to complement the analytical result for the full-rank case above

  • Left one is ground-truth item-item similarities

  • training with first objective and chose three re-scaling of the singular vectors in VkV_k (middle three)

  • Right one is trained with second objective \rightarrow unique solution

  • see how vastly different the resulting cosine-similarities can be even for reasonable choice of re-scaling (not used extreme case)

5. Conclusions

  • cosine similarities are heavily dependent on the method and regularization technique

  • in some cases, it can be rendered even meaningless

  • cosine-similarity with the embeddings in deep models to be plagued by similar problems

    • deep model's different layers may be subject to different regularization \rightarrow may affect DD

6. Comment

  • 맹목적으로 사용하는 코사인 유사도에 대한 고찰. 하지만 너무 제한적인 상황에서 테스트를 해본 것 같지만 그냥 의심을 한번 해보자는 취지에서는 괜찮았던 것 같음.

0개의 댓글

관련 채용 정보