[ FastViT ] 1. ref papers.

d4r6j·2023년 9월 25일
0

vision-paper

목록 보기
2/11
post-thumbnail

1. RepVGG

2. MobileOne

  • Architecture

    • Left : Train time MobileOne block with reparameterizable branches.
    • Right : MobileOne block at inference where the branches are reparameterized.
    • Up : depth-wise conv
    • Down : point-wise conv
  • Conv-BN 이야기. ( 수식 - update 예정.)

3. MetaFormer

4. MLP-Mixer

  • Architecture

5. ResNet-v2

  • Architecture

6. Xception

  • Architecture

  • Depthwise Separable Conv

7. DeiT

  • Architecture
  • ViT 에 distillation token 을 추가한다. 그것은 self-attention layer들을 통해 class 와 patch token 과 함께 interact 한다.
  • 이 distillation token 은 class token 과 유사한 방법 (similar fashion) 으로 이용하는데, nework 의 출력으로 true label 대신에 teacher 에 의해 예측된 (hard) label 을 reproduce 하는 것이 목적.
  • transformer 에 입력된 class 와 distillation 두 토큰은 back-propagation 에 의해 학습된다.
  • Notation

    notationdescription
    ZtZ_tthe logits of the teacher model.
    ZsZ_sthe logits of the student model.
    τ\tauthe temperature for the distillation
    λ\lambdathe coefficient balancing the KL divergence loss
    LCE{\mathcal{L}}_{CE}cross-entropy
    yyground truth labels
    ψ\psisoftmax function

Soft distillation

minimizes the Kullback-Leibler divergence between the softmax of the teacher and the softmax of the student model.

Lglobal=(1λ)LCE(ψ(Zs),y)+λτ2KL(ψ(Zsτ),ψ(Ztτ))\mathcal{L}_{global} = (1-\lambda)\mathcal{L}_{CE}(\psi(Z_s), y) + \lambda \tau^2{\rm KL} \left( \psi\left(\frac{Z_s}{\tau}\right), \psi\left(\frac{Z_t}{\tau}\right) \right)

Hard-label distillation

We introduce a variant of distillation where we take the hard decision of the teacher as a true label. Let yt=argmaxcZt(c)y_t = {\rm argmax}_cZ_t(c) be the hard decision of the teacher, the objective associated with this hard-label distillation

LglobalhardDistill=12LCE(ψ(Zs),y)+12LCE(ψ(Zs),yt)\mathcal{L}^{\rm hardDistill}_{\rm global} = \frac{1}{2}\mathcal{L}_{\rm CE}(\psi(Z_s), y) + \frac{1}{2}\mathcal{L}_{\rm CE}(\psi(Z_s), y_t)

0개의 댓글