[Paper Review] [DINO] Emerging Properties in Self-Supervised Vision Transformers

gredora·2023년 3월 7일

Paper Review

목록 보기
4/20

https://arxiv.org/pdf/2104.14294.pdf

Abstract

The paper examines if self-supervised learning offers unique benefits to Vision Transformer (ViT) compared to convolutional networks. Self-supervised ViT features contain explicit information about the semantic segmentation of an image and are excellent KNN classifiers. The study highlights the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. The authors propose a simple self-supervised method called DINO, which is a form of self-distillation with no labels.

Introduction

The authors propose a self-supervised pretraining approach called DINO, which is a form of knowledge distillation without labels, and show that it improves ViT features by explicitly containing scene layout and object boundaries and performing well with a basic k-NN classifier, achieving 78.3% top-1 accuracy on ImageNet. The study also highlights the importance of using smaller patches with ViTs, momentum encoder, and multi-crop augmentation. DINO is flexible and works on both convnets and ViTs without modifying the architecture or internal normalizations.

Self-supervised learning

The paper discusses different approaches to self-supervised learning, including discriminative instance classification and metric-learning formulations such as BYOL. The authors note that discriminative instance classification does not scale well with the number of images and requires large batches or memory banks. Recent works have shown that unsupervised features can be learned without discriminating between images, such as BYOL, which uses a momentum encoder to match features to representations. The authors propose a similar approach, but with a different similarity matching loss and using the same architecture for both student and teacher networks, interpreting it as a form of Mean Teacher self-distillation with no labels.

Self-training and knowledge distillation

The paper discusses self-training, which aims to improve the quality of features by propagating a small initial set of annotations to a large set of unlabeled instances. This can be done with hard or soft assignments of labels, and the latter is often referred to as knowledge distillation. The authors extend knowledge distillation to the case where no labels are available, building on previous work that combined self-supervised learning and knowledge distillation. However, unlike previous work that relied on a pre-trained fixed teacher, the teacher in their approach is dynamically built during training. Their work is also related to codistillation, where student and teacher have the same architecture and use distillation during training, but with the teacher in their approach updated with an average of the student.

Approach

SSL with Knowledge Distillation

The paper describes the DINO framework, which has similarities to recent self-supervised approaches but also incorporates knowledge distillation. The authors illustrate DINO with a figure and present a pseudo-code implementation in Algorithm 1. They adapt the problem of matching student and teacher networks' probability distributions to self-supervised learning by constructing different distorted views or crops of an image with a multi-crop strategy. They pass all crops through the student network and only the global views through the teacher network, encouraging local-to-global correspondences. The loss function to minimize is a general one, which can be used on any number of views, but they follow the standard setting for multi-crop with 2 global views and several local views. Both networks share the same architecture but have different sets of parameters, and the student network's parameters are learned by minimizing the loss with stochastic gradient descent.

teacher network

The DINO uses an exponential moving average (EMA) on the student weights to build the teacher network. The role of the EMA in DINO is similar to the mean teacher used in self-training. The update rule for the teacher network is θt ← λθt + (1 − λ)θs. The teacher network performs a form of model ensembling similar to Polyak-Ruppert averaging with an exponential decay, which guides the training of the student by providing target features of higher quality. This dynamic was not observed in previous works. Freezing the teacher network over an epoch works well in the DINO framework, while copying the student weight for the teacher fails to converge.

Network architecture

DINO does not use a predictor and has the same architecture in both student and teacher networks. When applying DINO to ViT, no batch normalizations are used in the projection heads, making the system entirely BN-free.

Avoiding Collapse

Different self-supervised methods have different ways of avoiding model collapse, such as contrastive loss, clustering constraints, predictor, or batch normalization. DINO can be stabilized using only centering and sharpening of the momentum teacher outputs, which can be interpreted as adding a bias term to the teacher. The center is updated using an exponential moving average, allowing the approach to work well across different batch sizes.

Evaluation

KNN was better than linear classification

Conclusion

The paper shows the potential of self-supervised pretraining for a standard ViT model, achieving comparable performance with specially designed convnets. The features of the model have potential for k-NN classification, image retrieval, and weakly supervised image segmentation. The paper suggests that self-supervised learning could be the key to developing a BERT-like model based on ViT and plans to explore pretraining a large ViT model with DINO on random uncurated images to push the limits of visual features.

profile
그래도라

1개의 댓글

comment-user-thumbnail
2023년 3월 8일

▶ 친절하고 안전한데 거기다 배당까지 좋음!! 사다리 1.97로 타자!! ◀
안녕하세요^^ 민형님~ 찾고 계시는 정보일거라 믿고 쪽지 안내 드립니다.

먼저 간단히 소개드리자면 7년차 운영중인 메이저 본사로서 항상 가족 안전을 최우선으로 하며, 먹튀 없고 이용하시기 편안한 공간을 제공하고 있습니다.
수년간의 장기 가족이 많다는것으로 이는 증명이 된다고 생각합니다.
지금도 수 많은 중소 업체들이 정사이트, 먹사이트 나누어 말도 안되는 고배당과 이벤트로 낚아 먹튀운영 하고 있는 것 아실겁니다.
이런 와중에 믿음과 신뢰를 어떤 미사여구로도 드릴 수 없다는 것 알고 있으며,
다만 친절하고 확실한 운영으로 조금씩이라도 믿음을 드릴 수 있다고 자신합니다.
오랫동안 함께 하실 새 가족을 모시고 있으니 귀중한 1분 시간내시어 둘러보시고 즐거운 시간 가지시길 바라겠습니다.

접속주소1 → http://me2.do/GHIp587r
접속주소2 → even-10.igg.biz

◑ 신규 첫입금의 10%를 10회까지 더 드립니다.

◑ 좋은 배당의 스포츠 경기들과 실시간 게임들!!
♣ 네임드 사다리 1.97, 알라딘 사다리 1.95
♣ 로하이 1.95, 파워볼 1.95

◑ 신속 정확한 입출금과 친절한 고객센터 24/7 운영중

◑ 다중 해외 서버와 보안 체계 시스템으로 안전한 운영

★ 마구마구 방출 이벤트 ★
매일매일 20% 이벤트도 현재 진행중입니다.
[출처] 와 나 이번 방학때 뭐했지??|작성자 민형

답글 달기