CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows 제1부

이준석·2022년 10월 20일

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

목록 보기

1/2

논문링크 : https://openaccess.thecvf.com/content/CVPR2022/papers/Dong_CSWin_Transformer_A_General_Vision_Transformer_Backbone_With_Cross-Shaped_Windows_CVPR_2022_paper.pdf

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

CSWin Transformer: 십자형 창이 있는 일반 Vision Transformer 백본

Abstract

We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
범용 비전 작업을 위한 효율적이고 효과적인 Transformer 기반 백본인 CSWin Transformer를 소개합니다. Transformer 설계의 도전적인 문제는 전역 self-attention은 계산하는 데 매우 비용이 많이 드는 반면 local self-attention은 종종 각 토큰의 상호 작용 필드를 제한한다는 것입니다.

To address this issue, we develop the CrossShaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width.
이 문제를 해결하기 위해 우리는 입력 기능을 동일한 너비의 스트라이프로 분할하여 얻은 각 스트라이프를 사용하여 십자형 창을 형성하는 수평 및 수직 줄무늬에서 self-attention을 병렬로 계산하기 위한 CrossShaped Window self-attention 메커니즘을 개발합니다.

We provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost.
우리는 스트라이프 너비의 영향에 대한 수학적 분석을 제공하고 계산 비용을 제한하면서 강력한 모델링 기능을 달성하는 Transformer 네트워크의 여러 레이어에 대해 스트라이프 너비를 변경합니다.

We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks.
또한 기존 인코딩 방식보다 로컬 위치 정보를 더 잘 처리하는 LePE(Locally-enhanced Positional Encoding)를 도입합니다. LePE는 자연스럽게 임의의 입력 해상도를 지원하므로 다운스트림 작업에 특히 효과적이고 친숙합니다.

Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks.
이러한 설계 및 계층 구조와 통합된 CSWin Transformer는 일반적인 비전 작업에서 경쟁력 있는 성능을 보여줍니다.

Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting.
특히, 추가 교육 데이터 또는 레이블 없이 ImageNet-1K에서 85.4% Top-1 정확도, COCO 감지 작업에서 53.9 박스 AP 및 46.4 마스크 AP, ADE20K 시맨틱 분할 작업에서 52.2 mIOU를 달성하여 이전 상태를 능가합니다. - 유사한 FLOP 설정에서 각각 +1.2, +2.0, +1.4 및 +2.0만큼 최신 Swin Transformer 백본.

By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU.
더 큰 데이터 세트 ImageNet-21K에 대한 추가 사전 학습을 통해 ImageNet-1K에서 87.5%의 Top-1 정확도를 달성하고 ADE20K에서 55.7mIoU의 높은 세분화 성능을 달성했습니다.

5. Conclusion

In this paper, we have presented a new Vision Transformer architecture named CSWin Transformer. The core design of CSWin Transformer is the CSWin Self-Attention, which performs self-attention in the horizontal and vertical stripes by splitting the multi-heads into parallel groups.
이 문서에서는 CSWin Transformer라는 새로운 Vision Transformer 아키텍처를 제시했습니다. CSWin Transformer의 핵심 설계는 멀티 헤드를 병렬 그룹으로 분할하여 가로 및 세로 줄무늬에서 셀프 어텐션을 수행하는 CSWin 셀프 어텐션입니다.

This multi-head grouping design can enlarge the attention area of each token within one Transformer block efficiently. On the other hand, the mathematical analysis also allows us to increase the stripe width along the network depth to further enlarge the attention area with subtle extra computation cost.
이 다중 헤드 그룹화 설계는 하나의 Transformer 블록 내에서 각 토큰의 주의 영역을 효율적으로 확장할 수 있습니다. 다른 한편으로, 수학적 분석을 통해 네트워크 깊이를 따라 스트라이프 너비를 증가시켜 미묘한 추가 계산 비용으로 주의 영역을 더욱 확장할 수 있습니다.

We further introduce locally-enhanced positional encoding into CSWin Transformer for downstream tasks. We achieved the state-of-the-art performance on various vision tasks under constrained computation complexity. We are looking forward to applying it for more vision tasks.
다운스트림 작업을 위해 CSWin Transformer에 로컬로 강화된 위치 인코딩을 추가로 도입합니다. 우리는 제한된 계산 복잡성에서 다양한 비전 작업에서 최첨단 성능을 달성했습니다. 더 많은 비전 작업에 적용할 수 있기를 기대합니다.

1. Introduction

Transformer-based architectures [12, 30, 42, 49] have recently achieved competitive performances compared to their CNN counterparts in various vision tasks. By leveraging the multi-head self-attention mechanism, these vision Transformers demonstrate a high capability in modeling the longrange dependencies, which is especially helpful for handling high-resolution inputs in downstream tasks, e.g., object detection and segmentation. Despite the success, the Transformer architecture with full-attention mechanism [12] is computationally inefficient.
트랜스포머 기반 아키텍처[12, 30, 42, 49]는 최근 다양한 비전 작업에서 CNN에 비해 경쟁력 있는 성능을 달성했습니다. 이 비전 트랜스포머는 다중 헤드 자체 주의 메커니즘을 활용하여 장거리 종속성을 모델링하는 높은 기능을 보여주며, 이는 특히 객체 감지 및 세분화와 같은 다운스트림 작업에서 고해상도 입력을 처리하는 데 유용합니다. 성공에도 불구하고 전체 주의 메커니즘[12]이 있는 Transformer 아키텍처는 계산적으로 비효율적입니다.

이준석

인공지능 전문가가 될레요

다음 포스트

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows 제1부