contrastive loss is a hardness-aware loss function.(hardness-aware : 애매하지 않고 딱 구분되게 분류할 수 있도록 만들어 주는 것 - inter 멀고 intra에서 가깝고)
temperature controls the strength of penalties.
uniformity to learn separable features, but excessive pursuit breaks underlying semantic structure.
tolerance to the closeness of semantically similar samples
→ uniformity-tolerance dilemma : tau may be helpful to compromise
→ loss same but a is better.
: contrastive loss aims to learn a general feature function which maps typical data into hypersphere space to attract positive pairs attracted and negative pairs separated.
: contrastive loss is a hardness-aware loss function which concentrates on optimizing the hard negative samples.
: temperature plays a role in controlling the strength of penalties on the hard negative samples.
:If the contrastive loss is equipped with very small temperature, the loss function will give very large penalties to the nearest neighbours which are very likely to share similar semantical contents with the anchor point
we observe that embeddings trained with τ = 0.07 are more uniformly distributed, however the embeddings trained with τ = 0.2 present a more reasonable distribution which is locally clustered and globally separated.
CPC, CMC, SimCLR,
We emphasize the significance of the temperature τ , and use it as a proxy to analyze some intriguing phenomenons of the contrastive learning.
the softmax-based contrastive loss is a hardnessaware loss function, which automatically concentrates on separating more informative negative samples to make the embedding distribution more uniform
the magnitude of positive gradient is equal to the sum of negative gradients.
The temperature controls the distribution of negative gradients. Smaller temperature tends to concentrate more on the nearest neighbours of the anchor point, which plays a role in controlling the hardness-aware sensitivity
The magnitude
of gradient with respect to positive sample is equal to the
sum of gradients with respect to all negative samples,
which can define a probabilistic
distribution to help understand the role of temperature τ .
if τ down to 0, see only nearest sample
if τ up to infinite, almost become uniform
truncates the gradients with respect to the uninformative negative samples.
-1.0 ~ 경계값, 경계값 ~ 1.0 으로 구간을 나눠서 경계값 보다 작은 구간은 0으로 만듬
→ hard negative sampling
→ 이런 경우 simple
이것이 기존의 contrastive loss 식보다 성능이 좋다.
In this section, we study two properties: uniformity of the embedding distribution and the tolerance to semantically similar samples. The two properties are both important to the feature quality
contrastive loss - positive features to be aligned(조정), embeddings to match a uniform distribution in hypersphere
kernel 함수 : 수학적으로 커널함수는 원점을 중심으로 대칭이면서 적분값이 1인 non-negative 함수로 정의
when the temperature is small, the contrastive loss tends to separate the positive samples close to the anchor sample, which makes the local distribution be sparse → tends to be uniform
The objective of contrastive learning is to learn the augmentation alignment and instance discriminative embedding.
when the temperature τ is very small, the penalties to the nearest neighbours will be strengthened, which will push the semantically similar samples strongly to break the semantic structure of the embedding distribution
where l(x) represents the supervised label of image x. I_l(x)=l(y) is an indicator function, having the value of 1 for l(x) = l(y) and the value of 0 for l(x) != l(y).
the tolerance is positively related to the temperature τ
💡 the tolerance can not directly reflect the feature quality. For example, when all the samples reside in a single point of the hypersphere, then the tolerance is maximized, while the feature quality is badThe hard contrastive loss deals better with the uniformity-tolerance dilemma
Relative large temperature can help be more tolerant to the potential positive samples without decreasing too much uniformity
pretraining - evaluation
As the τ decreases, the gap between samples become larger → small τ push hard negative samples more, concentrate most penalties, more uniform
As the τ increases, positive similarities tend to be closer to 1 → the positive samples are more aligned, tends to learn features more invariant to the data augmentations.
: same content as above
the learned models with L_simple perform much worse than models trained with ordinary
contrastive loss (74.83 vs 83.27 on CIFAR10, 39.31 vs 56.44 on CIFAR100, 70.83 vs 95.47 on SVHN, 48.09 vs 75.10 on ImageNet100). However, when the negative samples of the L_simple are drawn from the nearest neighbours, the trained models achieve competitive results on all three datasets. This shows that the hardness-aware property is the core to the success of the contrastive loss.
the hardness-aware property is significant to the success of the contrastive loss. Besides, the temperature plays a key role in controlling the local separation and global uniformity of the embedding distributions.