Metric learning is the task of learning a distance function over objects.
A model takes two or more examples and outputs a (non-negative) score.
The distance may have different meaning depending on the data.
The model will learn the relationship in the training data, whatever it actually means.
Data
(weakly) Supervised learning
But, often with free of human labor
Relative similarity data is easier to collect than traditional labeled one.
Photos taken at the same place
Videos watched in the same YouTube session
Data augmentation
Ordinal regression
Map to an ordered set of classes
17.1. Learning to Rank
Training data consists of lists of items with some partial order specified between items in each list. The ranking model purposes to rank, i.e., producing a permutation of items in new, unseen lists in a similar way to rankings in the training data.
Problem Formulation
Point-wise
Pair-wise
Each training example is an ordered of two items for a query.
We train the model to predict a score per item for a given query, where the order between the two items is preserved.
Usually, the goal is to minimize the average number of inversions in ranking.
List-wise
Representation Learning
We optimize the model to discriminate or order items as much as in the training set, hoping such ability can generalize to unseen items. By observing which item is more similar than another, the model can learn general understanding.
reli=1 if the i-th output item is actually relevant to the query, 0 otherwise. Or it is also possible to use actual releveance scores.
NDCG is DCG divided by the maximum possible DCG score for that query. The higher, the better.
17.2. Triplet Loss
L(A,P,N)=max(∥f(A)−f(P)∥2−∥f(A)−f(N)∥2+α,0)
The distance from the anchor to the positive is minimized, and the distance from the anchor to the negative input is maximized.
Data
Random Netgative
The anchor and positive is usually collected from positive pairs (either by human labeling or by implicit data collection)
Negative example is usually not explicitly collected, so random negative assignment is common.
Since random negatives are often too easy to distinguish, after a few iterations, the model learns nothing.
Online Negative Mining
To resolve this problem, online negative mining look for hard negatives from the current batch, instead of using one initally assigned.
Practically, large batch size is required for good performance, since there are few hard negative with small batch size. However, online negative mining takes O(B2) time for k-NN.
Semi-hard Negative Mining
Always using the hardest negative can be dangerous. Hard negative can make f(x)=0 to reduce L, paying α.
So it is important to pick a negative whose current anchor-negative distance is just above the anchor-positive distance. (Semi-hard negative)
Only one term (where yi=1) alives in cross-entropy, but softmax depends on all other probabilities due to the denominator.
This makes the loss depend on every output in the network, which means every network parameter will have a non-zero gradient. So the model needs to update for every training example.
Can we just sample some negatives, instead of computing all of them?