Training data-efficient image transformers & distillation through attention 제6부

이준석·2022년 6월 27일

Deit

목록 보기

6/8

4. Distillation through attention

We verified that our distillation token adds something to the model, compared to simply adding an additional class token associated with the same target label:
verify 증명하다, 확인하다
우리는 증류 토큰이 동일한 대상 레이블과 관련된 추가 클래스 토큰을 추가하는 것과 비교하여 모델에 무언가를 추가한다는 것을 확인했습니다.

instead of a teacher pseudo-label, we experimented with a transformer with two class tokens.
교사 의사 레이블 대신 두 개의 클래스 토큰이있는 변압기를 실험했습니다.

Even if we initialize them randomly and independently, during training they converge towards the same vector (cos=0.999), and the output embedding are also quasi-identical.
quasi 준, 반 even if ~일지라도
무작위로 독립적으로 초기화하더라도 훈련 중에 동일한 벡터 (cos = 0.999)로 수렴되며 출력 임베딩도 준동일합니다.

This additional class token does not bring anything to the classification performance.
이 추가 클래스 토큰은 분류 성능에 아무런 영향을 주지 않습니다.

In contrast, our distillation strategy provides a significant improvement over a vanilla distillation baseline, as validated by our experiments in Section 5.2.
validate 입증하다 인증하다 검증하다
대조적으로, 우리의 증류 전략은 섹션 5.2의 실험에 의해 검증 된 바와 같이 바닐라 증류 기준선에 비해 상당한 개선을 제공합니다.

Fine-tuning with distillation

We use both the true label and teacher prediction during the fine-tuning stage at higher resolution.
더 높은 해상도에서 미세 조정 단계에서 실제 레이블과 교사 예측을 모두 사용합니다.

We use a teacher with the same target resolution, typically obtained from the lower-resolution teacher by the method of Touvron et al [50].
typically 일반적으로
우리는 일반적으로 Touvron et al[50]의 방법으로 저해상도 교사로부터 얻은 동일한 목표 해상도를 가진 교사를 사용합니다.

We have also tested with true labels only but this reduces the benefit of the teacher and leads to a lower performance.
또한 실제 레이블만 사용하여 테스트했지만 이는 교사의 이점을 줄이고 성능을 저하시킵니다.

Classification with our approach

Classification with our approach: joint classifiers.
우리의 접근 방식을 통한 분류: 공동 분류기.

At test time, both the class or the distillation embeddings produced by the transformer are associated with linear classifiers and able to infer the image label.
infer : 추룬하다, 유추하다
시험 시간에, 변압기에 의해 생성된 클래스 또는 증류 임베딩은 모두 선형 분류기와 연관되고 이미지 라벨을 추론할 수 있다.

Yet our referent method is the late fusion of these two separate heads, for which we add the softmax output by the two classifiers to make the prediction.
fow which 이를 위해
그러나 우리의 참조 방법은 이 두 개의 개별 헤드를 늦게 융합하는 것인데, 이를 위해 두 개의 분류기에서 소프트맥스 출력을 추가하여 예측합니다.

We evaluate these three options in Section 5.

이준석

인공지능 전문가가 될레요

이전 포스트

Training data-efficient image transformers & distillation through attention 제5부

다음 포스트