PR-383: Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results

YeonJu Kim·2023년 3월 2일

PR12 Season4 정리

목록 보기

6/6

Regularization
- Aug: randAugment, AutoAugment
- Image: Cutout(네모칸 지우는), CutMix(두개 다른 클래스 합치기), MixUp(픽셀 섞어서)
- Architecture: drop-path, drop-block, architecture에서 특정 부분은 update x
- Label smoothing
- Progressive image resize during training
- different train-test resolution
Training configuration
- more training epoch
- dedicated optimizer for large batchsize(LAMB Optimizer)
- Scaling learning rate with batch size
- Exponential-moving average(EMA) of model weights
- Improved weight initializations
- AdamW: Decoupled weight decay
Architecture
- Vgg, Tr, EfficientNet, MLP
- 각 archi에 fit된 tailor made training scheme

architecture에 상관없는 training scheme 제안 필요
- CNN, MLP, mobile-oriented, Tr
Resnet
- 다양한 training scheme이 잘 작동
- Resnet strikes back: An improved training procedure in timm.
Mobile oriented model
- depth-wise convolution
- RMSProp, waterfall learning rate scheduling, EMA
Tr-based, MLP-only model
- inductive bias가 없음
- longer training
- strong arg(cutmix-mixup, drop-path regularization
- large weight decay, repeated augmentations
어떤 한 모델에 대한 scheme 다른 모델에 적용하면 성능 하락

Hilton loss
- distillation loss : teacher model의 soft label과 student model의 soft prediction을 KL divergence로 distribution이 가깝도록 만듦
- student loss : student model의 hard prediction을 cross-entropy
KD의 중요성에 대한 논문
- Compounding the performance improvement
  - ResNet50의 CLS에 KD가 중요
- DeIT: ViT와 같은 구조 distillation token을 사용하는 KD 적용
- Once-for-All : super network - sub network 학습
- Circumventing outlier of auto-augment with knowledge dict
  - KD가 aug의 noise 줄인다.
  - 강한 aug 가능
하지만 ImageNet에서 KD 잘 사용 안함
KD 장점
- image가 완전히 mutually exclusive 하지 않은 case, 사람이 봐도 애매, 클래스가 2개인 case
  - teacher label이 gt label보다 더 많은 정보, class간의 유사성 상관관계
  - label의 error 보정, label smoothing 따로 할 필요x
  - more effective, robust optimization

Architecture, batch size, teacher model, architecture-based regularization
- Scaling sgd batch size to 32k for imagenet training.
- Large batch optimization for deep learning: Training bert in 76 minutes.
- architecture-based regularization : drop-path 적용 상관 없음
$\alpha$ (distillation loss 영향력 조절)에 상관 없음
KD temperature( $\tau$ ): vanilla가 제일 낫다
- $\tau \gt 1$ : softening the teacher predictions
- $\tau \lt 1$ : sharpening the teacher predictions
- $\tau = 1$ : vanilla softmax