reveiw-understanding contrastive loss

hyukhun koh·2021년 12월 31일
0

Abstract

contrastive loss is a hardness-aware loss function.(hardness-aware : 애매하지 않고 딱 구분되게 분류할 수 있도록 만들어 주는 것 - inter 멀고 intra에서 가깝고)

temperature controls the strength of penalties.

uniformity to learn separable features, but excessive pursuit breaks underlying semantic structure.

tolerance to the closeness of semantically similar samples

→ uniformity-tolerance dilemma : tau may be helpful to compromise

Introduction

→ loss same but a is better.

: contrastive loss aims to learn a general feature function which maps typical data into hypersphere space to attract positive pairs attracted and negative pairs separated.

: contrastive loss is a hardness-aware loss function which concentrates on optimizing the hard negative samples.

: temperature plays a role in controlling the strength of penalties on the hard negative samples.

:If the contrastive loss is equipped with very small temperature, the loss function will give very large penalties to the nearest neighbours which are very likely to share similar semantical contents with the anchor point

we observe that embeddings trained with τ = 0.07 are more uniformly distributed, however the embeddings trained with τ = 0.2 present a more reasonable distribution which is locally clustered and globally separated.

CPC, CMC, SimCLR,

We emphasize the significance of the temperature τ , and use it as a proxy to analyze some intriguing phenomenons of the contrastive learning.

Hardness-aware property

the softmax-based contrastive loss is a hardnessaware loss function, which automatically concentrates on separating more informative negative samples to make the embedding distribution more uniform

1. gradient analysis

the magnitude of positive gradient is equal to the sum of negative gradients.

The temperature controls the distribution of negative gradients. Smaller temperature tends to concentrate more on the nearest neighbours of the anchor point, which plays a role in controlling the hardness-aware sensitivity

The magnitude
of gradient with respect to positive sample is equal to the
sum of gradients with respect to all negative samples,


which can define a probabilistic
distribution to help understand the role of temperature τ .

2. the role of temperature

if τ down to 0, see only nearest sample

if τ up to infinite, almost become uniform

3. explicit hard negative sampling

truncates the gradients with respect to the uninformative negative samples.

-1.0 ~ 경계값, 경계값 ~ 1.0 으로 구간을 나눠서 경계값 보다 작은 구간은 0으로 만듬

→ hard negative sampling

→ 이런 경우 simple

이것이 기존의 contrastive loss 식보다 성능이 좋다.

Uniformity-tolerance dilemma

In this section, we study two properties: uniformity of the embedding distribution and the tolerance to semantically similar samples. The two properties are both important to the feature quality

1. embedding uniformity(균일)


contrastive loss - positive features to be aligned(조정), embeddings to match a uniform distribution in hypersphere

kernel 함수 : 수학적으로 커널함수는 원점을 중심으로 대칭이면서 적분값이 1인 non-negative 함수로 정의

when the temperature is small, the contrastive loss tends to separate the positive samples close to the anchor sample, which makes the local distribution be sparse → tends to be uniform

  1. Tolerance to potential positive samples


The objective of contrastive learning is to learn the augmentation alignment and instance discriminative embedding.

when the temperature τ is very small, the penalties to the nearest neighbours will be strengthened, which will push the semantically similar samples strongly to break the semantic structure of the embedding distribution

where l(x) represents the supervised label of image x. I_l(x)=l(y) is an indicator function, having the value of 1 for l(x) = l(y) and the value of 0 for l(x) != l(y).

the tolerance is positively related to the temperature τ

💡 the tolerance can not directly reflect the feature quality. For example, when all the samples reside in a single point of the hypersphere, then the tolerance is maximized, while the feature quality is bad

3. ETC

The hard contrastive loss deals better with the uniformity-tolerance dilemma

Relative large temperature can help be more tolerant to the potential positive samples without decreasing too much uniformity

Results

1. experiment details

pretraining - evaluation

2. local separation

As the τ decreases, the gap between samples become larger → small τ push hard negative samples more, concentrate most penalties, more uniform

As the τ increases, positive similarities tend to be closer to 1 → the positive samples are more aligned, tends to learn features more invariant to the data augmentations.

3. Feature Quality, uniformity and tolerance

: same content as above

5.substitution of contrastive loss

the learned models with L_simple perform much worse than models trained with ordinary
contrastive loss (74.83 vs 83.27 on CIFAR10, 39.31 vs 56.44 on CIFAR100, 70.83 vs 95.47 on SVHN, 48.09 vs 75.10 on ImageNet100). However, when the negative samples of the L_simple are drawn from the nearest neighbours, the trained models achieve competitive results on all three datasets. This shows that the hardness-aware property is the core to the success of the contrastive loss.

6. conclusion

the hardness-aware property is significant to the success of the contrastive loss. Besides, the temperature plays a key role in controlling the local separation and global uniformity of the embedding distributions.

profile
NLP Researcher : https://hyukhunkoh-ai.github.io/

0개의 댓글