Training data-efficient image transformers & distillation through attention 제3부

이준석·2022년 6월 26일

Deit

목록 보기

3/8

Image Classification

Image clssification is so core to computer vision that it is often used as a benchmark to measure progress in image understanding.
이미지 분류는 컴퓨터 비전의 핵심이기 때문에 이미지 이해의 진전을 측정하기 위한 벤치마크로 자주 사용됩니다.

Any progress usually translates to improvement in other related tasks such as detection or segmentation.
모든 진행은 일반적으로 탐지 또는 분할과 같은 다른 관련 작업의 개선으로 이어집니다.

Since 2012’s AlexNet [32], convnets have dominated this benchmark and have become the de facto standard
de facto : 사실상의
2012년 AlexNet[32] 이후, 컨베네트는 이 벤치마크를 지배했고 사실상의 표준이 되었다.

The evolution of the state of the art on the ImageNet dataset [42] reflects the progress with convolutional neural network architectures and learning [32, 44, 48, 50, 51, 57].
ImageNet 데이터 세트[42]에 대한 최신 기술의 진화는 컨볼루션 신경망 아키텍처와 학습의 진전을 반영한다[32, 44, 48, 50, 51, 57].

Despite several attempts to use transformers for image classification, until now thir performance has been inferior to that of convets.
inferior : 더 낮은
미지 분류를 위해 트랜스포머를 사용하려는 여러 시도에도 불구하고[7], 지금까지 트랜스포머의 성능은 convnet보다 열등했습니다.

Neverthless hybrid architectures that combine convets and transformers, including the self-attention mechanism, have recently exhibited competitive results in image classification, detection, video processing, unsupervised object discovery, and unified text-vision tasks.
그럼에도 불구하고 self-attention 메커니즘을 포함하여 convnet과 변환기를 결합한 하이브리드 아키텍처는 최근 이미지 분류[56], 탐지[6, 28], 비디오 처리[45, 53], 감독되지 않은 객체 발견[35], 및 통합된 텍스트 비전 작업 [8, 33, 37].

Recently Vision transformers (ViT) closed the gap with the state of the art on ImageNet without using any convoultion.
최근 Vision Transformers(ViT) [15]는 Convolution을 사용하지 않고 ImageNet의 최신 기술과의 격차를 줄였습니다

This performance is remarkable since convnet methods for image classification have benefited from years of tuning and optimization.
이 성능은 이미지 분류를 위한 convnet 방법이 수년간의 조정 및 최적화의 이점을 얻었기 때문에 현저합니다[22, 55].

Nevertheless, according to this study, a pretraining phase on a large volume of curated data is required for the learned transformer to be effective.
그럼에도 불구하고 이 연구[15]에 따르면 학습된 변환기가 효과적이기 위해서는 많은 양의 큐레이트된 데이터에 대한 사전 훈련 단계가 필요합니다.

In our paper we achieve a strong performance without requiring a large training dataset, i.e., with Imagenet1k only.
우리 논문에서 우리는 Imagenet1k만을 사용하여 대규모 훈련 데이터 세트 없이도 강력한 성능을 달성합니다.

The Transformer architecture

The Transformer architecture, introduce by Vaswani et al. for machnie translation are currently the reference model for all natural langugae precessing (NLP) tasks.
Vaswani 등이 소개한 트랜스포머 아키텍처. [52] 기계 번역은 현재 모든 자연어 처리(NLP) 작업의 참조 모델이다.

Many imporvements of convets for image classification are inspired by transformers.
이미지 분류를 위한 컨베네트의 많은 개선은 변압기에서 영감을 받았다.

For example, Squeeze and Excitation, Selective Kernel and Split-Attention Neworks exploit mechanism akin to transformers self-attenton mechanism.
exploit 이용하다 활용하다 akin ~와 유사한
예를 들어, Squeeze and Excitation [2], Selective Kernel [34] 및 Split-Attention Networks [61]는 변압기 자기 주의(SA) 메커니즘과 유사한 메커니즘을 활용한다.

Knowledge Distillation

Knowledge Distillation (KD), introduced by Hinton et al., refers to the training paradigm in which a student model leverages "soft" labels coming from a strong teacher network.
Hinton et al.에 의해 소개된 지식 증류(KD). [24], 학생 모델이 강력한 교사 네트워크에서 나오는 "소프트" 레이블을 활용하는 교육 패러다임을 나타냅니다.

This is the output vector of the teacher's softmax function rather than just the maximum of scores, which gives a "hard" label.
rather than : ~보다
이것은 최대 점수가 아니라 교사의 softmax 함수의 출력 벡터이며 "하드" 레이블을 제공합니다.

Such a training improves the performance of the student model (alternatively, it can be regarded as a form of compression of the teacher model into a smaller one - the student).
이러한 훈련은 학생 모델의 성능을 향상시킵니다(또는 교사 모델을 더 작은 학생 모델로 압축하는 형태로 간주될 수 있음).

On the one hand the teacher's soft labels will have a similar effect to labels smoothing.
한편으로 교사의 소프트 라벨은 라벨 스무딩과 유사한 효과를 갖는다[58].

On the other hand as shown by Wei et al. the teacher's supervision takes into account the effects of the data augmentation, which sometimes causes a misalignment between the real label and the image.
misalignment 오정렬 takes into account : ~고려하다
반면, 웨이 외 연구진[54]이 보여주듯이, 교사의 감독은 데이터 확장의 효과를 고려하는데, 이는 때때로 실제 레이블과 이미지 사이에 불일치를 야기한다.

For example, let us consider image with a "cat" label that represents a large landscape and a small cat in a corner.
예를 들어, 큰 풍경과 구석에 있는 작은 고양이를 나타내는 "고양이" 레이블이 있는 이미지를 생각해 보겠습니다.

If the cat is no longer on the crop of the data augmentation itimplicitly changes the label of the image.
implicitly : 암묵적으로, 함축적으로
고양이가 더 이상 데이터 증강의 크롭에 없으면 이미지의 레이블을 암시적으로 변경합니다.

KD can transfer inductive biases in a soft wqy in a student model using a teacher model where they would be incorporated in a hard way.
KD는 어려운 방식으로 통합되는 교사 모델을 사용하여 학생 모델에서 부드러운 방식으로 귀납적 편향[1]을 전달할 수 있습니다.

For example, it may be useful to induce biases due to convolutions in a transformer model by using a convolutional model as teacher.
indcue : 설득하다, 유도하다
예를 들어, 컨볼루션 모델을 교사로 사용하여 변환기 모델에서 컨볼루션으로 인한 편향을 유도하는 것이 유용할 수 있습니다.

In our paper we study the distillation of a transformer student by either a convet or a transformer teacher.
distillation : 증류
우리 논문에서 우리는 convnet이나 변압기 교사에 의한 변압기 학생의 증류를 연구합니다.

We intorduce a new distillation procedure specific to transformers and show its superiority.
specific : 특유의, 분명한
우리는 변압기에 특유한 새로운 증류 절차를 도입하고 그 우월성을 보여줍니다.