MaxViT: Multi-Axis Vision Transformer 논문 리뷰

Serendipity·2023년 7월 1일
1

Problem
lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones.

Solution
multi-axis attention, which allows global-local spatial interactions on arbitrary input resolutions with only linear complexity

-> Ensenble (Backbone : dubbed MaxViT, by repeating the basic building block over multiple stages.)

Performance : image classification, object detection, and visual aesthetic assessment.

Contribution
MaxViT, can achieve state-of-the-art performance on a variety of vision tasks, and more importantly, scale extremely well to massive scale data sizes.

State-of-the-art performancea
MaxViT model on the AVA benchmark [61]. This
dataset consists of 255K images rated by armature photographers through pho-tography contest

how well?

Psuedo code

profile
I'm an graduate student majoring in Computer Engineering at Inha University. I'm interested in Machine learning developing frameworks, Formal verification, and Concurrency.

0개의 댓글