[논문 리뷰] Swin Transformer

이윤석·2021년 10월 12일

Swin Transformer : Hierarchical Vision Transformer using Shifted Windows

end-to-end semi-supervised object detection approach(previous models : complex multi-stage methods)
- multi-stage approach
  - achieve reasaonably good accuracy
  - 데이터가 충분하지 않으면 최종 성능에 제한이 생김
effective techniques
- soft teacher mechanism
- a box jittering approach : to select reliable pseudo boxes
Idea(swin = Shifted Window)
- previous(ViT) : 모든 patch가 Self attention => expensive computation cost
- Swin Transformer : patch를 window로 나누어 해당 window에서만 self attention 수행 후 window shift후 다시 self attention
normal Transformer 와 달리 hierarchical 구조 제시 => object detection, segmentation에서 성능 좋음

Results with ADE20K for Semantic segementation
- Swin-L 모델을 적용하면 대체로 연산량(FLOPS)이 크지만, 속도(FPS)는 빠른 것을 확인할 수 있음
  - mIoU에 대한 설명 밑 링크

Results with ImageNet-22K pre-training
- Swin Transformer의 성능이 기존 ViT와 같은 모델들 보다 Object Detection 분야에서는 뛰어난 것을 확인할 수 있음. 정확도, 속도 성능 모두 뛰어나고, 속도는 월등히 빠름
- #Param이 너무 차이가 나는데 이러면 학습이 엄청 오래걸리지 않나..?

Classification에서 SOTA(Object detection에서는 SWIN Transformer)
기존에는 Vision 문제들을 CNN 구조로 해결을 하였지만, Transformer 구조로 대체
- CNN : 지역적인 특징을 잘 찾는 반면 멀리 있는 픽셀간의 관계는 고려하지 않음
더 많은 데이터를 더 적은 COST로 사전학습
대용량의 학습 자원, 데이터 필요