[paper review] Vision Transformers with Mixed-Resolution Tokenization

dusruddl2·2024년 2월 21일

paper review

목록 보기

1/3

Abstract

기존의 ViT들은 input image를 동일한 크기(equal-size)의 patch로 나눠 처리하였다.
그러나 Transformer가 처음 제안된 배경인 NLP를 생각해보면, 이때는 각각의 token이 subword를 의미하며, arbitrary size로 학습이 되어왔다.

저자는, 이 부분에 의문을 제기하여 다음 방법을 제안한다.
ViT도 equal하지 않은 arbitrary한 사이즈의 patch를 학습시켜야 하지 않을까?

standard uniform grid가 아니라 mixed-resolution sequence of tokens이 될 수 있도록 새로운 image tokenization 방법을 제안하는게 바로 이 논문의 핵심이다.

다음 메소드를 구현하기 위해,
Quadtree algorithm과 새로운 saliency scorer을 사용하여 patch mosaic를 만든다.
이때, low-saliency areas는 low resolution으로 처리가 되며, 모델이 important image region에 더 집중할 수 있게 해준다.
(즉, 중요하지 않은 부분은 낮은 해상도로, 중요한 부분은 높은 해상도로 처리한다고 생각하면 좋다.)

기존 vanilla ViT와 동일한 architecture을 사용하며,
본 Quadformer 모델은 computational budget(계산량)은 유지하면서 image classification에서 성능 향상을 보여주었다.

Introduction

이전에도 multi-resolution processing을 ViT에 시도하려는 연구가 있었다.

CNN 구조 안에 feature pyramid를 도입하거나
multi-resolution attention을 사용하거나
전체 image를 가로질러 token representation들을 결합하거나

그러나 선행 연구와 달리
본 논문은 tokenization에서 mixed-resolution을 제안한 첫 논문이다.

abstract에서도 언급했지만 본 논문은 다음과 같이 요약할 수 있다.

patch mosaic
: low-saliency 구역은 low solution으로 처리함으로써, important area에 더 집중할 수 있도록 한다
quadtree algorithm
: 이미지를 쪼갤 때 사용하는 알고리즘으로, saliency scorer 값을 기준으로 쪼갤지 말지를 정한다.
2D position embedding
: 1D position embedding을 사용했던 기존 ViT와 다르게 패치의 위치를 알려주는 2D를 사용한다.

본 실험은 ImageNet-1K classification dataset인 ImageNet-1K에서 진행되었고, 본 연구의 방법과 vanilla ViT model의 성능을 비교하였다.

이때, vanilla ViT는 single patch size(16^2)을 사용한 반면
본 연구는 3개의 patch size(16^2,32^2,64^2)를 사용하였다.

Method

이어서 계속..

dusruddl2

정리된 글은 https://dusruddl2.tistory.com/로 이동

다음 포스트

[paper review] Vision Transformers with Mixed-Resolution Tokenization

paper review

Abstract

Introduction

Method

PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

0개의 댓글

관련 채용 정보