[논문 리뷰] DINOv2: Learning Robust Visual Features without Supervision

한의진·2024년 9월 22일
0

스터디_리뷰

목록 보기
4/15
post-thumbnail

Goal

Exploring the potential to learn general-purpose visual features of self-supervised learning if pretrained on a large quantity of curated data

Motivation

Succession of Natural Language Processing(NLP) fueled by pretraining on large quantities of raw text using pretext objectives or word vectors

Most of the image foundation model focus on text-guided pretraining and the pretraining limits the information that can be retained about the image

Contribution

Self-supervised learning has the potential to learn general-purpose visual features if pretrained on a large quantity of curated data
Most of the technical contributions are tailored toward stabilizing and accelerating discriminative self-supervised learning

Revisited existing discriminative self-supervised approaches that learn features at both the image and patch level(iBOT), and some of their design choices under the lens of a larger dataset
These improvements make the approach around 2 times faster and require 3 times less memory than similar discriminative self-supervised methods(allowing us to leverage longer training with larger batch sizes)

A naïve clustering approach works reasonably well to resolve the issue of rebalancing concepts and avoid overfitting

Data Processing

Assembled the curated LVD-142M dataset

Data Sources

The selection of curated datasets contains ImageNet-22k, the train split of ImageNet-1k, Google Landmarks and several find-grained datasets
Collected a raw unfiltered dataset of images from a publicly available repository of crawled web data

Deduplication

Applied copy detection pipeline of Pizzi et al. (2022) to the uncurated data and remove near-duplicate images

Reduces redundancy and increases diversity among images

Self-supervised image retrieval

Computed an image embedding using a self-supervised ViT-H/16 network pretrained on ImageNet-22k

Used cosine-similarity as a distance measure between images and performed k-means clustering of the uncurated data

Implementation Details

The deduplication and retrieval stages of the pipeline rely on the Faiss library (Johnson et al., 2019)

Distributed on a compute cluster of 20 nodes equipped with 8 V100-32GB GPUs and takes less than two days to produce the LVD-142M dataset

Discriminative Self-supervised Pre-training

Learning features with a discriminative self-supervised method that can be seen as a combination of DINO and iBOT losses with the centering of SwAV

Image-level objective (Caron et al., 2021)

Student Network and Teacher Network Features are coming from the class token of a ViT, obtained from different crops of the same image
Passed the student class token through the student DINO head(MLP model outputting a vector of scores)
Applied teacher DINO head to the teacher class token to obtain teacher prototype scores
Then applied a softmax followed by a centering with moving average to obtain p(t)
Learning the parameters of the student and build the teacher head with an exponential MA of past iterates

Patch-level objective (Zhou et al., 2022a)

Randomly masked some of the input patches given to the student, but not to the teacher
Applied the student iBOT head to the student mask tokens
Similarly applied the tearher iBOT head to the teacher patch tokens corresponding to the ones masked in the student
Then applied the softmax and centering steps as above, and obtain the iBOT loss term

➢ Untying head weights between both objectives

• Both the DINO and the iBOT loss use a learnable MLP projection head • An ablation study shows that sharing parameters between the DINO and iBOT heads leads to better performance • At scale, they observed that the opposite is true, and used two separate heads in all the experiments

➢ Sinkhorn-Knopp centering

• Ruan et al. (2023) recommend to replace the teacher softmax-centering step of DINO and iBot by the Sinkhorn-Knopp (SK) batch normalization of SwAV (Caron et al., 2020)
• Run the Sinkhorn-Knopp algorithm steps for 3 iterations • Student: applied the softmax normalization

➢ KoLeo regularizer

• KoLeo regularizer encourages a uniform span of the features within a batch

Efficient implementation

Fast and memory-efficient attention

Implemented own version of FlashAttention to improve memory usage and speed on the self-attention layers
Same or better than the original on all cases considered, while covering more use-cases and hardware

The efficiency is best when the embedding dimension per head is a multiple of 64, and the matrix operations are even better when the full embedding dimension is a multiple of 256

ViT-g architecture of the paper differs from the architecture proposed by Zhai et al. in order to maximize compute efficiency and used an embedding dimension of 1536 with 24 heads, rather than 1408 with 16 heads

Sequence Packing

The DINO algorithm requires forwarding both large crops and small crops
Splitting into patches, these two groups are represented by token sequences of different lengths and cannot be forwarded together
Concatenate the sequences they must forward through the transformers into a single long sequence
A block-diagonal mask is applied to the self-attention matrix in attention layers, preventing attention between different sequences

Efficient stochastic depth

Skips the computation of the dropped residuals rather than masking the result
Saves memory and compute in proportion approximately equal to the drop rate
Drastic improvement in compute efficiency and memory usage
The implementation consists of randomly shuffling the B samples over the batch dimension, slicing the first (1-d)*B samples for the computatinos in the block

Fully-Sharded Data Parallel (FSDP)

Minimizing the objective with the AdamW optimizer requires 4 model replicas in float32 precision – student, teacher, optimizer first moments, optimizer second moments
Sums to 16GB of memory for a billion-parameter model such as ViT-g in the paper
To reduce the memory footprint per GPU, splited the model replicas across GPUs
Sharding 16GB across GPUs using the PyTorch implementation of FSDP
The PyTorch implementation of FSDP saves the cross-GPU communication costs
Broadcasting weights and reducing gradients is done in float16 precision for the backbone
Leads to approximately 50% reduction in communication costs compared to the float32
Training procedure scales more efficiently then DDP with float16 autocast when scaling the number of GPU nodes

Model distillation

Most of the technical improvements to the training loop aim at improving the training of large models over large quantities of data
Knowledge distillation aims at reproducing the output of a large model with a smaller model by minimizing some distance between both outputs for a set of given inputs
Used a larger model as a frozen teacher, keep a spare EMA of the student of the final model, removed the masking and stochastic depth, and applied the iBOT loss on the two global crops
This approach achieved better performance than training from scratch, even for a ViT-L
The distillation method ends up close to the one described by Duval et al/ (2023), except they do not modify the loss terms for distillation and evaluate the EMA of the student


실제 구현 단계는 현재 진행중으로 DINOv2를 병변에 적용하여 정확도를 향상시키는 연구를 수행중이다.

0개의 댓글

관련 채용 정보