MPViT : Multi-Path Vision Transformer for Dense Prediction 제1부

이준석·2022년 10월 21일

MPViT : Multi-Path Vision Transformer for Dense Prediction

목록 보기

1/2

MPViT : Multi-Path Vision Transformer for Dense Prediction
MPViT : 조밀한 예측을 위한 다중 경로 비전 변압기

깃허브 : https://github.com/youngwanLEE/MPViT

Abstract

Dense computer vision tasks such as object detection and segmentation require effective multi-scale feature representation for detecting or classifying objects or regions with varying sizes. While Convolutional Neural Networks (CNNs) have been the dominant architectures for such tasks, recently introduced Vision Transformers (ViTs) aim to replace them as a backbone. Similar to CNNs, ViTs build a simple multi-stage structure (i.e., fine-to-coarse) for multi-scale representation with single-scale patches.
물체 감지 및 분할과 같은 조밀한 컴퓨터 비전 작업에는 다양한 크기의 물체 또는 영역을 감지하거나 분류하기 위한 효과적인 다중 스케일 기능 표현이 필요합니다. CNN(Convolutional Neural Networks)이 이러한 작업의 지배적인 아키텍처였지만 최근에 도입된 ViT(Vision Transformers)는 이를 백본으로 대체하는 것을 목표로 합니다. CNN과 유사하게 ViT는 단일 규모 패치로 다중 규모 표현을 위한 간단한 다단계 구조(예: 미세에서 거친)를 구축합니다.

In this work, with a different perspective from existing Transformers, we explore multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT). MPViT embeds features of the same size (i.e., sequence length) with patches of different scales simultaneously by using overlapping convolutional patch embedding.
이 작업에서는 기존 Transformer와 다른 관점에서 다중 스케일 패치 임베딩 및 다중 경로 구조를 탐색하여 MPViT(Multi-Path Vision Transformer)를 구성합니다. MPViT는 중첩 컨볼루션 패치 임베딩을 사용하여 동일한 크기(즉, 시퀀스 길이)의 기능을 다른 스케일의 패치와 동시에 임베딩합니다.

Tokens of different scales are then independently fed into the Transformer encoders via multiple paths and the resulting features are aggregated, enabling both fine and coarse feature representations at the same feature level.
그런 다음 다양한 규모의 토큰이 여러 경로를 통해 Transformer 인코더에 독립적으로 공급되고 결과 기능이 집계되어 동일한 기능 수준에서 미세 및 대략적인 기능 표현이 모두 가능합니다.

Thanks to the diverse, multi-scale feature representations, our MPViTs scaling from tiny (5M) to base (73M) consistently achieve superior performance over state-of-the-art Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation.
다양한 다중 스케일 기능 표현 덕분에 작은(5M)에서 기본(73M)으로 확장되는 MPViT는 ImageNet 분류, 객체 감지, 인스턴스 분할 및 의미 분할에서 최첨단 비전 트랜스포머보다 우수한 성능을 일관되게 달성한다.

5. Discussion and Conclusion

Model Capacity Analysis.

Measuring actual GPU throughput and memory usage, we analyze the model capacity of MPViT-S, comparing with recent SOTA Transformers [16, 33, 59, 60] in Table 7. We test all models on the same Nvidia V100 GPU with a batch size of 256. Although CoaT Small [59] achieves the best detection performance thanks its additional cross-layer attention, it exhibits heavier memory usage and GPU computation than CoaT-Lite Small with a simple multi-stage structure similar to Swin-T [33] and Focal-T [60].
실제 GPU 처리량과 메모리 사용량을 측정하여 표 7의 최근 SOTA Transformers [16, 33, 59, 60]와 비교하여 MPViT-S의 모델 용량을 분석합니다. 우리는 배치 크기가 256인 동일한 Nvidia V100 GPU에서 모든 모델을 테스트한다. CoaT Small [59]은 추가적인 교차 레이어 주의 덕분에 최고의 검출 성능을 달성하지만, Swin-T [33] 및 Focolative-T [60]와 유사한 단순한 다단계 구조를 가진 CoaT-Lite Small보다 더 무거운 메모리 사용 및 GPU 계산을 보여준다.

Compared to CoaT Small, MPViT-S consumes much less memory and runs 4× faster with comparable detection performance, which means MPViT can perform efficiently and its multi-scale representations are effective without the additional cross layer attention of CoaT. Moreover, CoaT has limitations in scaling up models due to its exhaustive memory usage, but MPViT can scale to larger models. For XCiT [16] having single-stage structure, XCiT-S12/16 (16x16 patches : scale 4) shows faster speed and less memory usage, while XCiT-S12/8 requires more computation and memory than MPViT-S due to its higher feature resolution. We note that XCiT-S12/8 shows higher classification accuracy (83.4%) than MPViT-S (83.0%), whereas detection performance is the opposite (47.0 vs. 48.4).
CoaT Small에 비해 MPViT-S는 메모리 소모가 훨씬 적고 비슷한 탐지 성능과 함께 4배 더 빠르게 실행되므로 MPViT는 CoaT의 추가적인 교차 계층 주의 없이도 효율적으로 수행되고 다중 스케일 표현이 효과적이다. 또한 CoaT는 메모리 사용량이 많아 모델을 스케일업하는 데 한계가 있지만 MPViT는 더 큰 모델로 스케일업할 수 있다. 단일 단계 구조를 갖는 XCiT[16]의 경우, XCiT-S12/16(16x16 패치: scale 4)은 더 빠른 속도와 더 적은 메모리 사용량을 보이는 반면, XCiT-S12/8은 더 높은 기능 해상도로 인해 MPViT-S보다 더 많은 연산과 메모리를 필요로 한다. XCiT-S12/8은 MPViT-S(83.0%)보다 높은 분류 정확도(83.4%)를 보이는 반면, 탐지 성능은 반대(47.0 대 48.4)이다.

This result demonstrates that for dense prediction tasks, the mutli-scale embedding and multi-path structure of MPViT is both more efficient and effective than the single-stage structure of XCiT equipped with additional up-/down-sampling layers. MPViT also has a relatively smaller memory footprint than most models.
이 결과는 밀도 높은 예측 작업의 경우 MPViT의 다중 스케일 임베딩 및 다중 경로 구조가 추가 업/다운 샘플링 레이어가 장착된 XCiT의 단일 단계 구조보다 더 효율적이고 효과적이라는 것을 보여준다. MPViT는 또한 대부분의 모델보다 상대적으로 메모리 공간이 작다.

Attention maps generated by CoaT-Lite and MPViT at stage4. MPViT has a triple-path structure with patches of various sizes (e.g., 3 × 3, 5 × 5, 7 × 7), leading to fine and coarse features.
See Appendix for more visualization results.
4단계에서 CoaT-Lite 및 MPViT에 의해 생성된 어텐션 맵. MPViT는 다양한 크기(예: 3 × 3, 5 × 5, 7 × 7)의 패치가 있는 3중 경로 구조로 되어 있어 미세하고 거친 기능을 제공합니다.
더 많은 시각화 결과는 부록을 참조하십시오.

Qualitative Analysis.

In Fig. 4, we visualize the attention maps, comparing the triple-path (c in Table 5) with the single-path (CoaT-Lite Mini). Since the triple-path embeds different patch sizes, we visualize attention maps for each path. The attention maps from CoaT-Lite and path-1 have similar patch sizes and show similar attention maps. Interestingly, we observe that attention maps from path-3, which attends to larger patches with higher-level representations, are more object centric, precisely capturing the extents of the objects, as shown in the rightmost column of Fig. 4.
그림 4에서 우리는 삼중 경로(표 5의 c)와 단일 경로(CoaT-Lite Mini)를 비교하여 어텐션 맵을 시각화합니다. 삼중 경로는 서로 다른 패치 크기를 포함하므로 각 경로에 대한 주의 맵을 시각화합니다. CoaT-Lite와 path-1의 어텐션 맵은 패치 크기가 비슷하고 어텐션 맵도 비슷합니다. 흥미롭게도, 우리는 더 높은 수준의 표현으로 더 큰 패치에 주의를 기울이는 경로 3의 주의 맵이 더 객체 중심적이며 그림 4의 가장 오른쪽 열에 표시된 것처럼 객체의 범위를 정확하게 캡처한다는 것을 관찰합니다.

However, at the same time, path-3 suppresses small objects and noise. Contrarily, path-1 attends to small objects due to fine patches, but does not precisely capture largeobject boundaries due to its usage of low-level representations.
This is especially apparent in the 3rd-row of Fig. 4, where path-1 captures a smaller ball, while path-3 attends to a larger person. These results demonstrate that combining fine and coarse features via a multi-path structure can capture objects of varying scales in the given visual inputs.
그러나 동시에 path-3은 작은 물체와 노이즈를 억제합니다. 반대로 path-1은 미세한 패치로 인해 작은 객체에 주의를 기울이지만 저수준 표현을 사용하기 때문에 큰 객체 경계를 정확하게 포착하지 못합니다.
이것은 그림 4의 3번째 줄에서 특히 분명합니다. 여기서 경로 1은 더 작은 공을 포착하고 경로 3은 더 큰 사람에게 주의를 기울입니다. 이러한 결과는 다중 경로 구조를 통해 미세 기능과 거친 기능을 결합하면 주어진 시각적 입력에서 다양한 스케일의 객체를 캡처할 수 있음을 보여줍니다.

Limitations and Future work.

Thanks to the proposed multi-scale embedding strategy and multi-path scheme, we have observed that MPViT significantly outperforms current SOTA Vision Transformers not only on image-level prediction, but also on dense predictions tasks.
제안된 다중 규모 임베딩 전략 및 다중 경로 체계 덕분에 MPViT가 이미지 수준 예측뿐만 아니라 밀집 예측 작업에서도 현재 SOTA Vision Transformers보다 훨씬 우수한 성능을 보입니다.

However, a possible limitation of our MPViT model is the latency at inference time. We hypothesize that the multi-path structure leads to suboptimal GPU utilization as similar observations have been made for grouped convolution [36,58] (e.g., GPU context switching, kernel synchronization, etc.). To alleviate this issue, future works could implement an efficient MPViT and consider the path dimension in the compound scaling strategy [14, 44] which considers all depths, widths, and resolutions.
그러나 MPViT 모델의 가능한 제한 사항은 추론 시간의 대기 시간입니다. 우리는 그룹화된 컨볼루션[36,58](예: GPU 컨텍스트 스위칭, 커널 동기화 등)에 대해 유사한 관찰이 이루어졌기 때문에 다중 경로 구조가 최적이 아닌 GPU 활용으로 이어진다고 가정합니다. 이 문제를 완화하기 위해 향후 작업에서는 효율적인 MPViT를 구현하고 모든 깊이, 너비 및 해상도를 고려하는 복합 스케일링 전략[14, 44]에서 경로 차원을 고려할 수 있습니다.

이준석

인공지능 전문가가 될레요

다음 포스트

MPViT : Multi-Path Vision Transformer for Dense Prediction 제1부