Discrete Entities are embedded to dense real-valued vectors
The embedding vector can be used as input to other models
Also, they can provide a data-driven notion of similarity between entities
Cosine Similarity has become a very popular measure of semantic similarity
Cosine similarity of the learned embeddings can in fact yield arbitrary results
propose linear Matrix Factorization models as analytical solutions
focus on linear models as they follow for closed-form solutions
matrix , conatining data points and features
the goal is to estimate a low-rank matrix where with such that is a good approximation of
is a user-item matrix
this is defined in terms of the unnormalized dot-product between two embeddings
once it has been learned, it is common to consider
this can lead to arbitrary results and they may not even be unique
the key factor affecting the utility of cosine similarity is the regularization employed when learning the embeddings in
First one applies to their product
Second one is equivalent to the usual matrix factorization objective
if are solutions of objective, for an arbitrary rotation matrix are the solutions as well
cosine similarity is invariant under rotation
only the first objetive is invariant to rescalings of the columns of and (different latent dimensions of the embeddings)
if is the solution of the first objective, for an arbitrary diagonal matrix , is the solution also
Then devine a new solution as a function of as
this diagonal matrix affects the normalization of the learned user and item embeddings (rows)
where is appropriate diagonal matrices to normalize each learned embedding to unit Euclidian norm
a different choice for cannot be compensated by the
they depend on so they can be shown as
cosing similarities of the embeddings depend on this arbitrary matrix
the cosine similarity becomes
These cosine similarities all depend on arbitrary matrix
user-user and item-item is directly depend on while user-item is indirectly depend on due to its effect to normalizing matrices
The closed-form solution of the first objective is and is truncated matrices of rank
Sine is arbitrary, w.l.o.g. we may define
when we think of the special case of a full-rank MF model, this would be two cases
choose
choose
Hence, the different choice of result in different cosine-similarities even though the learned model is invariant to
the results of cosine-similarity are arbitrary and not unique for this model
The solution of the second objective is
If we use usual notation of MF and ,
In this case, the cosine-similarity yields unique results
is this matrix the best possible semantic similarities?
train model on cosine-similarity use layer normalization
project the embedding back into the original space cosine-similarity works
standardize data (zero mean, unit variance)
negative sampling, inverse propensity scaling to account for the different item popularities
illustrate these findings for low-rank embeddings
Not aware of a good metric for semantic similarity experiments on simulated data ground-truths are known (clustered items data)
generated interactions between 20000 users and 1000 items assigned to 5 clusters with probability
sampled the powerlaw-exponent for each cluster ,
assigned a baseline popularity to each item according to the powerlaw
then generated the items that each user had interacted with
Learned the matrices with two training objective ( and )
Left one is ground-truth item-item similarities
training with first objective and chose three re-scaling of the singular vectors in (middle three)
Right one is trained with second objective unique solution
see how vastly different the resulting cosine-similarities can be even for reasonable choice of re-scaling (not used extreme case)
cosine similarities are heavily dependent on the method and regularization technique
in some cases, it can be rendered even meaningless
cosine-similarity with the embeddings in deep models to be plagued by similar problems