[Simple Review] [DeiT] Training data-efficient image transformers & distillation through attention

Hyungseop Lee·2024년 6월 19일

0

[Paper Review] Hybrid(Conv & Transformer) Architecture

목록 보기

2/4

https://proceedings.mlr.press/v139/touvron21a

Paper Info

Touvron, Hugo, et al. "Training data-efficient image transformers & distillation through attention." International conference on machine learning. PMLR, 2021.

빠른 이해를 위해 도움 받은 자료

https://velog.io/@heomollang/DeiT-%EA%B4%80%EB%A0%A8-%EB%85%BC%EB%AC%B8-%EB%A6%AC%EB%B7%B0-04-Training-data-efficient-image-transformers-distillation-through-attentionDeiT

Abstract

최근, neural networks purely based on attention은
image classification과 같은 image understanding 작업을 해결할 수 있는 것으로 보여진다.
이러한 high performing vision transformers는 수억 개의 Image를 사용하여
large infrastructure에서 pretrained되므로, 이를 채택하는 데에 한계가 있다.
이번 연구에서는 단일 computer에서 3일 이내에 ImageNet만을 사용하여 훈련된 경쟁력 있는
convolution-free transformer를 제작한다.
우리의 reference vision transformer(86M parameters)는
top-1 accuracy of 83.1% (single-crop) on ImageNet withno external data를 달성.
또한 transformer에 특화된 teacher-student strategy를 도입한다.
이는 student가 attention을 통해 teacher로부터 학습하도록 보장하는 distillation token에 의존하며,
일반적으로 convnet teacher로부터 학습한다.
학습된 transformer는 ImageNet에서 SOTA들과 경재할 수 있으며(85.2%의 top-1 accuracy), 다른 task로 transfer될 때도 마찬가지이다.

3. Vision transformer: overview

The class token

class token은 첫 번째 layer 이전에 patch token에 추가되는 trainable vector이다.
이 vector는 transformer layer를 거치고,
linear layer를 통해 class를 예측하기 위해 Projected된다.
이 class token은 NLP에서 유래되었으며,
computer vision에서 일반적으로 사용되는 pooling layer와는 달리 class를 예측하기 위해 사용된다.
따라서 transformer는 dimension $D$ 의 $(N+1)$ 개의 token batches를 처리하며,
이 중 class vector만이 출력을 예측하는 데에 사용된다.(그림 출처)

Fixing the positional encoding across resolutions.

Touvron et al.(2019)는 lower training resolution을 사용하고
이후 larger resolution에서 Network를 fine-tune하는 것이 바람직하다고 보여줬다.
이는 전체 training speed를 높이고 주요 data augmentation 하에 accuracy를 향상시킨다.
input image의 resolution을 높일 때는 Patch size를 동일하게 유지하므로 input patch 수 $N$ 이 변경된다.
transformer block과 class token의 architecture로 인해,
model과 classifier는 더 많은 token을 처리하기 위해 수정될 필요가 없다.

그러나 각 patch마다 하나씩 Positional embedding이 있기 때문에, positional encoding을 조정해야 한다.
Dosovitskity et al. (2020)은 resolution을 변경할 때 positional encoding을 interpolation하는 방법을 사용하여 이후 fine-tuning stage에서도 이 방법이 잘 작동한다는 것을 증명했다.

4. Distillation through attention

Distillation token.

Figure 2에 나와있는 것처럼 우리의 제안에 집중하여 설명하겠다.
우리는 initial embeddings에 새로운 token인 distillation token을 추가했다.
이 diltillation token은 class token과 유사하게 사용된다 : self-attention을 통해 다른 embedding과 상호작용하며, network의 마지막 Layer 후에 출력된다.
이 token의 target objective는 loss의 distillation component에 의해 제공된다.
distillation embedding은 model이
일반적인 distillation처럼 teacher의 output에서 배우면서도
class embedding과 보완적인 역할을 유지할 수 있게 한다.

Soft distillation

Hard-lebel distillation

5. Experiments

Distillation

distillation token을 사용하여 teacher와 student의 distillation이 진행됨.

Efficient Deep Learning Model

이전 포스트

CvT: Introducing Convolutions to Vision Transformers

다음 포스트

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference

0개의 댓글

관련 채용 정보