Exploring the potential to learn general-purpose visual features of self-supervised learning if pretrained on a large quantity of curated data
Succession of Natural Language Processing(NLP) fueled by pretraining on large quantities of raw text using pretext objectives or word vectors
Most of the image foundation model focus on text-guided pretraining and the pretraining limits the information that can be retained about the image
Self-supervised learning has the potential to learn general-purpose visual features if pretrained on a large quantity of curated data
Most of the technical contributions are tailored toward stabilizing and accelerating discriminative self-supervised learning
Revisited existing discriminative self-supervised approaches that learn features at both the image and patch level(iBOT), and some of their design choices under the lens of a larger dataset
These improvements make the approach around 2 times faster and require 3 times less memory than similar discriminative self-supervised methods(allowing us to leverage longer training with larger batch sizes)
A naïve clustering approach works reasonably well to resolve the issue of rebalancing concepts and avoid overfitting
Assembled the curated LVD-142M dataset
The selection of curated datasets contains ImageNet-22k, the train split of ImageNet-1k, Google Landmarks and several find-grained datasets
Collected a raw unfiltered dataset of images from a publicly available repository of crawled web data
Applied copy detection pipeline of Pizzi et al. (2022) to the uncurated data and remove near-duplicate images
Reduces redundancy and increases diversity among images
Computed an image embedding using a self-supervised ViT-H/16 network pretrained on ImageNet-22k
Used cosine-similarity as a distance measure between images and performed k-means clustering of the uncurated data
The deduplication and retrieval stages of the pipeline rely on the Faiss library (Johnson et al., 2019)
Distributed on a compute cluster of 20 nodes equipped with 8 V100-32GB GPUs and takes less than two days to produce the LVD-142M dataset
Learning features with a discriminative self-supervised method that can be seen as a combination of DINO and iBOT losses with the centering of SwAV
Student Network and Teacher Network Features are coming from the class token of a ViT, obtained from different crops of the same image
Passed the student class token through the student DINO head(MLP model outputting a vector of scores)
Applied teacher DINO head to the teacher class token to obtain teacher prototype scores
Then applied a softmax followed by a centering with moving average to obtain p(t)
Learning the parameters of the student and build the teacher head with an exponential MA of past iterates
Randomly masked some of the input patches given to the student, but not to the teacher
Applied the student iBOT head to the student mask tokens
Similarly applied the tearher iBOT head to the teacher patch tokens corresponding to the ones masked in the student
Then applied the softmax and centering steps as above, and obtain the iBOT loss term
• Both the DINO and the iBOT loss use a learnable MLP projection head • An ablation study shows that sharing parameters between the DINO and iBOT heads leads to better performance • At scale, they observed that the opposite is true, and used two separate heads in all the experiments
• Ruan et al. (2023) recommend to replace the teacher softmax-centering step of DINO and iBot by the Sinkhorn-Knopp (SK) batch normalization of SwAV (Caron et al., 2020)
• Run the Sinkhorn-Knopp algorithm steps for 3 iterations • Student: applied the softmax normalization
• KoLeo regularizer encourages a uniform span of the features within a batch
Implemented own version of FlashAttention to improve memory usage and speed on the self-attention layers
Same or better than the original on all cases considered, while covering more use-cases and hardware
ViT-g architecture of the paper differs from the architecture proposed by Zhai et al. in order to maximize compute efficiency and used an embedding dimension of 1536 with 24 heads, rather than 1408 with 16 heads
The DINO algorithm requires forwarding both large crops and small crops
Splitting into patches, these two groups are represented by token sequences of different lengths and cannot be forwarded together
Concatenate the sequences they must forward through the transformers into a single long sequence
A block-diagonal mask is applied to the self-attention matrix in attention layers, preventing attention between different sequences
Skips the computation of the dropped residuals rather than masking the result
Saves memory and compute in proportion approximately equal to the drop rate
Drastic improvement in compute efficiency and memory usage
The implementation consists of randomly shuffling the B samples over the batch dimension, slicing the first (1-d)*B samples for the computatinos in the block
Minimizing the objective with the AdamW optimizer requires 4 model replicas in float32 precision – student, teacher, optimizer first moments, optimizer second moments
Sums to 16GB of memory for a billion-parameter model such as ViT-g in the paper
To reduce the memory footprint per GPU, splited the model replicas across GPUs
Sharding 16GB across GPUs using the PyTorch implementation of FSDP
The PyTorch implementation of FSDP saves the cross-GPU communication costs
Broadcasting weights and reducing gradients is done in float16 precision for the backbone
Leads to approximately 50% reduction in communication costs compared to the float32
Training procedure scales more efficiently then DDP with float16 autocast when scaling the number of GPU nodes
Most of the technical improvements to the training loop aim at improving the training of large models over large quantities of data
Knowledge distillation aims at reproducing the output of a large model with a smaller model by minimizing some distance between both outputs for a set of given inputs
Used a larger model as a frozen teacher, keep a spare EMA of the student of the final model, removed the masking and stochastic depth, and applied the iBOT loss on the two global crops
This approach achieved better performance than training from scratch, even for a ViT-L
The distillation method ends up close to the one described by Duval et al/ (2023), except they do not modify the loss terms for distillation and evaluate the EMA of the student
실제 구현 단계는 현재 진행중으로 DINOv2를 병변에 적용하여 정확도를 향상시키는 연구를 수행중이다.