MaxViT: Multi-Axis Vision Transformer 논문 리뷰

Serendipity·2023년 7월 1일
1
post-custom-banner

Problem
lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones.

Solution
multi-axis attention, which allows global-local spatial interactions on arbitrary input resolutions with only linear complexity

-> Ensenble (Backbone : dubbed MaxViT, by repeating the basic building block over multiple stages.)

Performance : image classification, object detection, and visual aesthetic assessment.

Contribution
MaxViT, can achieve state-of-the-art performance on a variety of vision tasks, and more importantly, scale extremely well to massive scale data sizes.

State-of-the-art performancea
MaxViT model on the AVA benchmark [61]. This
dataset consists of 255K images rated by armature photographers through pho-tography contest

how well?

Psuedo code

profile
I'm an graduate student majoring in Computer Engineering at Inha University. I'm interested in Machine learning developing frameworks, Formal verification, and Concurrency.
post-custom-banner

0개의 댓글