-
Architecture

-
re-parameterize
- training time : multi-branched architecture
- inference time : a plain CNN-like structure
-
Review blog
- Architecture

- Architecture

-
Architecture

-
Depthwise Separable Conv

- Architecture

- ViT 에 distillation token 을 추가한다. 그것은 self-attention layer들을 통해 class 와 patch token 과 함께 interact 한다.
- 이 distillation token 은 class token 과 유사한 방법 (similar fashion) 으로 이용하는데, nework 의 출력으로 true label 대신에 teacher 에 의해 예측된 (hard) label 을 reproduce 하는 것이 목적.
- transformer 에 입력된 class 와 distillation 두 토큰은 back-propagation 에 의해 학습된다.
-
Notation
| notation | description |
|---|
| Zt | the logits of the teacher model. |
| Zs | the logits of the student model. |
| τ | the temperature for the distillation |
| λ | the coefficient balancing the KL divergence loss |
| LCE | cross-entropy |
| y | ground truth labels |
| ψ | softmax function |
Soft distillation
minimizes the Kullback-Leibler divergence between the softmax of the teacher and the softmax of the student model.
Lglobal=(1−λ)LCE(ψ(Zs),y)+λτ2KL(ψ(τZs),ψ(τZt))
Hard-label distillation
We introduce a variant of distillation where we take the hard decision of the teacher as a true label. Let yt=argmaxcZt(c) be the hard decision of the teacher, the objective associated with this hard-label distillation
LglobalhardDistill=21LCE(ψ(Zs),y)+21LCE(ψ(Zs),yt)