CvT: Introducing Convolutions to Vision Transformers 제3부

이준석·2022년 6월 29일

CvT

목록 보기

3/6

Transformers that exclusively rely on the self-attention mechanism to capture global dependencies have dominated in natural language modelling [31, 10, 25].
전역 종속성을 포착하기 위해 자기 주의 메커니즘에만 전적으로 의존하는 변환기는 자연어 모델링에서 지배적이었습니다[31, 10, 25].

Recently, the Transformer based architecture has been viewed as a viable alternative to the convolutional neural networks (CNNs) in visual recognition tasks, such as classification [11, 30], object detection [3, 45, 43, 8, 28], segmentation [33, 36], image enhancement [4, 40], image generation [24], video processing [42, 44] and 3D point cloud processing [12].
alternatvie 대안 viable 실행가능한 visual 시각의
최근 Transformer 기반 아키텍처는 분류[11, 30], 객체 감지[3, 45, 43, 8, 28], 세분화와 같은 시각적 인식 작업에서 CNN(컨볼루션 신경망)에 대한 실행 가능한 대안으로 간주되었습니다. [33, 36], 이미지 향상 [4, 40], 이미지 생성 [24], 비디오 처리 [42, 44] 및 3D 포인트 클라우드 처리 [12].

Vision Transformers.

The Vision Transformer (ViT) is the first to prove that a pure Transformer architecture can attain state-of-the-art performance (e.g. ResNets [15], EfficientNet [29]) on image classification when the data is large enough (i.e. on ImageNet-22k, JFT-300M).
비전 트랜스포머(ViT)는 데이터가 충분히 클 때(즉, ImageNet-22k, JFT-300M에서) 순수 트랜스포머 아키텍처가 이미지 분류에서 최첨단 성능(예: ResNets [15], EfficientNet [29])을 달성할 수 있다는 것을 최초로 입증했다.

Specifically, ViT decomposes each image into a sequence of tokens (i.e. non-overlapping patches) with fixed length, and then applies multiple standard Transformer layers, consisting of Multi-Head Self-Attention module (MHSA) and Positionwise Feed-forward module (FFN), to model these tokens.
decompose 분해하다
특히 ViT는 각 이미지를 고정된 길이의 토큰 시퀀스(예: 겹치지 않는 패치)로 분해한 다음 MHSA(Multi-Head Self-Attention Module) 및 FFN(Positionwise Feed-forward Module)으로 구성된 여러 표준 Transformer 레이어를 적용합니다. ), 이러한 토큰을 모델링합니다.

DeiT [30] further explores the data-efficient training and distillation for ViT.
DeiT[30]은 ViT에 대한 데이터 효율적인 교육 및 증류를 추가로 탐구합니다.

In this work, we study how to combine CNNs and Transformers to model both local and global dependencies for image classification in an efficient way.
이 작업에서 우리는 효율적인 방법으로 이미지 분류를 위한 로컬 및 글로벌 종속성을 모델링하기 위해 CNN과 Transformer를 결합하는 방법을 연구합니다.

In order to better model loacal context in vision Transformers, some cuncurrent works have introduced design changes.
concurrent 동시(에),, 동시에 발생하는
비전 트랜스포머에서 로컬 컨텍스트를 더 잘 모델링하기 위해 일부 동시 작업에서는 설계 변경 사항이 도입되었습니다.

For example, the Conditional Position encodings Visual Transformer (CPVT) [6] replaces the predefined positional embedding used in ViT with conditional position encodings (CPE), enabling Transformers to process input images of arbitrary size without interpolation.
interpolation 보간 arbitrary 임의의 enabling 가능하게하는
예를 들어 CPVT(Conditional Position Encoding Visual Transformer)[6]는 ViT에서 사용되는 미리 정의된 위치 임베딩을 조건부 위치 인코딩(CPE)으로 대체하여 Transformer가 보간 없이 임의 크기의 입력 이미지를 처리할 수 있도록 합니다.

Transformer-iN-Transformer (TNT) [14] utilizes both an outer Transformer block that processes the patch embeddings, and an inner Transformer block that models the relation among pixel embeddings, to model both patch-level and pixel-level representation.
model 만들다
Transformer-iN-Transformer(TNT)[14]는 패치 임베딩을 처리하는 외부 Transformer 블록과 픽셀 임베딩 간의 관계를 모델링하는 내부 Transformer 블록을 모두 활용하여 패치 수준 및 픽셀 수준 표현을 모두 모델링합니다.

Tokens-to-Token (T2T) [41] mainly improves tokenization in ViT by concatenating multiple tokens within a sliding window into one token.
Tokens-to-Token(T2T) [41]은 주로 슬라이딩 창 내에서 여러 토큰을 하나의 토큰으로 연결하여 ViT의 토큰화를 개선합니다.

However, this operation fundamentally differs from convolutions especially in normalization details, and the concatenation of multiple tokens greatly increases complexity in computation and memory.
그러나 이 작업은 특히 정규화 세부 사항에서 컨볼루션과 근본적으로 다르며 여러 토큰을 연결하면 계산 및 메모리의 복잡성이 크게 증가합니다.

PVT [34] incorporates a multi-stage design (without convolutions) for Transformer similar to multi-scales in CNNs, favoring dense prediction tasks.
PVT[34]는 CNN의 다중 스케일과 유사한 Transformer에 대한 다중 단계 설계(컨볼루션 없음)를 통합하여 조밀한 예측 작업을 선호합니다.

In contrast to these concurrent works, this work aims to achieve the best of both worlds by introducing convolutions, with image domain specific inductive biases, into the Transformer architecture.
best 장점
이러한 동시 작업과 대조적으로 이 작업은 이미지 도메인 고유의 귀납적 편향이 있는 컨볼루션을 Transformer 아키텍처에 도입하여 두 세계의 장점을 모두 달성하는 것을 목표로 합니다

Table 1 shows the key differences in terms of necessity of positional encodings, type of token embedding, type of projection, and Transformer structure in the backbone, between the above representative concurrent works and ours.
representative 대표하는
표 1은 위의 대표적인 동시 작업과 우리의 백본에서의 위치 인코딩의 필요성, 토큰 임베딩의 유형, 투영의 유형 및 트랜스포머 구조의 주요 차이점을 보여준다.

Introducing Self-attentions to CNNs.

Self-attention mechanisms have been widely applied to CNns in vision tasks.
Self-attention 메커니즘은 비전 작업에서 CNN에 널리 적용되었습니다.

Among these works, the non-local networks are designed for capturing long range dependencies via global attention.
이러한 작업 중 non-local network[35]는 전역적 관심을 통해 장거리 종속성을 캡처하도록 설계되었습니다.

The local relation networks [17] adapts its weight aggregation based on the compositional relations (similarity) between pixels/features within a local window, in contrast to convolution layers which employ fixed aggregation weights over spatially neighboring input feature.
adapts 적응하다 aggregation 집합, 집계 compositional 구성 spatially 공간적으로 employ 사용하다, 이용하다
로컬 관계 네트워크[17]는 공간적으로 인접한 입력 기능에 대해 고정 집계 가중치를 사용하는 컨볼루션 레이어와 달리 로컬 창 내의 픽셀/특징 간의 구성 관계(유사성)를 기반으로 가중치 집계를 조정합니다.

Such an adaptive weight aggregation introduces geometric priors into the network which are important for the recognition tasks.
priors 우선순위, 우선하는
이러한 적응형 가중치 집계는 인식 작업에 중요한 기하학적 우선 순위를 네트워크에 도입한다.

Recently, BoTNet [27] proposes a simple yet powerful backbone architecture that just replaces the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and achieves a strong performance in image recognition.
최근 BoTNet[27]은 ResNet의 마지막 세 병목 블록에서 공간 컨볼루션을 전역 self-attention으로 대체하고 이미지 인식에서 강력한 성능을 달성하는 단순하지만 강력한 백본 아키텍처를 제안합니다.

Instead, our work performs an opposite research direction: introducing convolutions to Transformers.
대신, 우리 연구는 반대 연구 방향을 수행합니다. 즉, Transformers에 컨볼루션을 도입하는 것입니다.

Introducing Convolutions to Transformers.

In NLP and speech recognition, convolutions have been used to modify the Transformer block, either by replacing multihead attentions with convolution layers [38], or adding additional convolution layers in parallel [39] or sequentially [13], to capture local relationships.
sequentially 순차적인 either by ~에 의해
NLP 및 음성 인식에서, 컨볼루션은 다중 헤드 주의를 컨볼루션 층으로 대체하거나[38], 로컬 관계를 캡처하기 위해 병렬[39] 또는 순차적으로[13]로 추가 컨볼루션 계층을 추가함으로써 트랜스포머 블록을 수정하는 데 사용되었습니다.

Other prior work [37] proposes to propagate attention maps to succeeding layers via a residual connection, which is first transformed by convolutions.
propagate 전파하다 succedding 후속의
다른 선행 작업[37]은 컨볼루션에 의해 먼저 변환되는 잔여 연결을 통해 후속 계층에 주의 맵을 전파할 것을 제안한다.

Different from these works, we propose to introduce convolutions to two primary parts of the vision Transformer:
이러한 작업과 달리 우리는 비전 Transformer의 두 가지 주요 부분에 컨볼루션을 도입할 것을 제안합니다.
first, to replace the existing Position-wise Linear Projection for the attention operation with our Convolutional Projection,
existing 기존의 replace a with b a를 b로 바꾸다
첫째, 주의 작업을 위한 기존 위치별 선형 투영을 우리의 컨볼루션 투영으로 대체하는 것입니다.
and second, to use our hierarchical multi-stage structure to enable varied resolution of 2D reshaped token maps, similar to CNNs.
varied 다양한
둘째, 계층적 다단계 구조를 사용하여 CNN과 유사한 2D 모양 토큰 맵의 다양한 해상도를 가능하게 합니다.

Our unique design affords significant performance and efficiency benefits over prior works.
afford 제공하다 prior 이전
NAT의 고유한 설계는 이전 작업에 비해 상당한 성능 및 효율성 이점을 제공합니다.