ViTAR review

진성현·2024년 3월 31일

ViT ViTAR

paper_reviews

목록 보기

12/14

Title:

ViTAR: Vision Transformer with Any Resolution (Fan et al, 2024)

Abstract

ViT -> constrained scalability accross different image resolutions

ViTAR

Element 1
- Dynamic resolution adjustment with single Transformer block
- highly efficient incremental token integration
Element 2
- fuzzy positional encoding
- consistent positional awareness across multiple resolutions
Results
- 83.3% top-1 acc (1120x1120)
- 80.4% top-1 acc (4032x4032)
- compatible with self-supervised learning

1. Introduction

ViTs

Segment images into non-overlapping patches, project each patch into tokens, and then apply MHSA to capture the dependencies among different tokens

Results

Diverse visual tasks
- Image classification (Swin Transformer, Neighborhood attention transformer)
- Object detection (DETR, PVT)
- Vision-language learning (ALIGN, CLIP)
- Video recognition (Svformer)

Variable input resolutions

Previous works -> no traning can cover all resolution

Direct interpolation

Simple and widely used approach
Directly interpolate the positional encodings before feeding them into the ViT
Significant performance degradation

ResFormer

Multiple resolution images during training
Improvements on positional encodings -> flexible, convolution-based positional encoding

Challenges

High performance on relatively narrow range of resolution
Cannot integrate with self-supervised learning(Masked AutoEncoder) because of convolution-based positional encoding

Vision Transformer with Any Resolution (ViTAR)

ViT which processes high-resolution images with low computational burden and exhibits strong resolution generalization capability
Adaptive Token Merger (ATM) module
- Iteratively processes tokens that have undergone the patch embedding
- Scatters all tokens onto grids
- Tokens within a grid as a single unit -> merges tokens within each unit -> mapping all tokens onto a gride of fixed shape => grid tokens
- low computational complexity with high-resolution images
Fuzzy Positional Encoding (FPE)
- introduces a positional perturbation
- Precise position perception -> fuzzy perception with random noise
- Prevents the model overfitting to position at specific resolutions
- Form of implicit data augmentation

Vision Transformers

Multi-Resolution Inference

Remains uncharted field
NaViT -> employs original resolution images as input to ViT
FlexiViT -> use patches of multiple sizes to train the model -> adapt to various patch size
ResFormer -> multi-resolution training -> CNN based positional encoding
- Challenging to be used in self-supervised learning
- Significant computational overhead (based on original ViT)

Positional Encodings

Crucial for ViT -> providing it with positional awareness and performance improvements
Early ViTs -> sin-cos encoding -> limited resolution robustness
CNN based PEs -> stronger resolution robustness

3. Methods

3.1. Overall Architecture

ATM + FPE + ViT

3.2. Adaptive Token Merger (ATM)

Receives tokens that have been processed through patch embedding as its input
$G_h \times G_w$ -> number of tokens ultimately aim to obtain
ATM partitions tokens with the shape of $H \times W$ into a grid of size $G_{th} \times G_{tw}$
- Assume $H$ is divisible by $G_{th}$ , $W$ is divisible by $G_{tw}$
  => Number of tokens contained in each grid -> ${H \over G_{th}} \times {W \over G_{tw}}$ (each typically set to 1 or 2)
- Case $H \geq 2G_h$ -> $G_{th}={H\over2}$ -> ${H \over G_{th}}=2$
- Case $2G_h>H>G_h$ -> pad $H$ to $2G_h$ , set $G_{th}=G_h$ -> ${H \over G_{th}}=2$
- Case $H = G_h$ -> tokens on the edge of $H$ are no longer fused -> ${H \over G_{th}}=1$
- Same with $W$

Grid Attention

Specific grid -> has tokens $x_{ij}$ , $0\leq i<{H \over G_{th}}$ and $0\leq j<{W \over G_{tw}}$
Average pooling on all $\{x_{ij}\}$ -> mean token
Cross attention to merge all tokens within a grid into a single token
- Q: mean token
- K, V: all $\{x_{ij}\}$
Residual connections
$x_{avg}=\text{AvgPool}(\{x_{ij}\})$
$\text{GridAttn}(\{x_{ij}\})=x_{avg} + \text{Attn}(x_{avg}, \{x_{ij}\}, \{x_{ij}\}$
Fused token -> FFN to complete channel fusion => One iteration of merging token
All iterations share the same weights
Gradually decreasing the value of $(G_{th}, G_{tw})$ until $G_{th}=G_h$ & $G_{tw}=G_w$
- $G_h =G_w=14$ (similar to standard ViT)

3.3. Fuzzy Positional Encoding

Learnable positional encoding + sin-cos positional encoding
=> highly sensitive to changes in input resolution (fail to provide effective resolution adaptability)
Conv-based PE -> better resolution robustness
- perception of adjacent tokens -> prevents its application in self-supervised learning

Advantages of FPE

Enhance the model's resolution robustness
No specific spatial structure like convolution -> self-supervised learning

FPE

Training

Randomly initialize a set of learnable PE.
- Typical PE -> provide precise location information to the model
- FPE supplies the model with fuzzy positional information
Positional information shifts within a certain range
$(i+s_1, j+s_2)$ as positional coordinates generated with FPE
- $(i, j)$ : Precise coordinates of the target token
- $-0.5\leq s_1, s_2 \leq 0.5$ , uniform distributions
Add randomly generated coordinate offsets to the reference coordinates in the training stage
Perfrom grid sample on the learnable positional embeddings -> resulting in the FPE

Inference

Use precise PEs
Change in the input image resolution -> interpolation on the learnable PEs.
- The model may have seen the interpolated PEs due to FPEs.
  => Robust positional resilience

3.4. Multi-Resolution Training

Multi-resolution training following ResFormer
ViTAR: high-resolution image with significantly lower computational demands

Traning detail

ResFormer

process each batch containing various resolutions
KL loss for inter-resolution supervision

ViTAR

processes each batch with consistent resolution
Basic cross-entropy loss for supervision

4. Experiments

Classification (ImageNet-1K)
Segmentation (COCO)
Semantic segmentation (ADE20K)
Self-supervised (MAE)

4.1. Image classification

Setting

ImageNet-1K
Training strategy following DeiT without distillation loss
AdamW optimizer
Strong data augmentation and regularization
Layer decay

Results

Capable of inferencing high resolution images

4.2. Object Detection

Setting

COCO
Experiment setting s follow the ResFormer and ViTDet
AdamW
Do not utilize multi-resolution training strategy
- ATM iterates only once
- ${H \over G_{th}}={W \over G_{tw}}=1$ => Excellent performance
- ${H \over G_{th}}={W \over G_{tw}}=2$ => effective

Results

Case 1
Case 2

4.3. Semantic Segmentation

Setting

ADE20K
UperNet with the MMSegmentation
- ${H \over G_{th}}={W \over G_{tw}}=1$ => Excellent performance
- ${H \over G_{th}}={W \over G_{tw}}=2$ => effective

Results

Case 1
Case 2

4.4. Compatibility with Self-Supervised Learning

Setting

FPE > ResFormer in self-supervised learning
MAE
Multi-resolution input strategy
Pretrain 300 epoches, fine-ture 100 epoches

Results

Better performance overall

Explanation for performance advantage

ATM enables the model to learn higher-quality tokens (information gain)
FPE as implicit data augmentation -> robust positional information

4.5. Ablation Study

Adaptive Token Merger

ATM vs AvgPool

Fuzzy Positional Encoding

Comparision

Absolute positional encoding (APE)
Conditional positional encoding (CPE)
Global-local positional encoding (GLPE in ResFormer)
Relative Positional Bias (RPB in SwinT)
FPE

MAE

only APE and FPE are compatible with MAE

Results

Training resolutions

5. Conclusion

ViTAR

진성현

Undergraduate student at SNU

이전 포스트

3D Gaussian Splatting

다음 포스트

xLSTM review

1개의 댓글

진성현

2024년 3월 31일

ssss

답글 달기

ViTAR review

paper_reviews

Title:

Abstract

1. Introduction

ViTs

Results

Variable input resolutions

Direct interpolation

ResFormer

Challenges

Vision Transformer with Any Resolution (ViTAR)

2. Related Works

Vision Transformers

Multi-Resolution Inference

Positional Encodings

3. Methods

3.1. Overall Architecture

3.2. Adaptive Token Merger (ATM)

Grid Attention

3.3. Fuzzy Positional Encoding

Advantages of FPE

FPE

Training

Inference

3.4. Multi-Resolution Training

Traning detail

ResFormer

ViTAR

4. Experiments

4.1. Image classification

Setting

Results

4.2. Object Detection

Setting

Results

4.3. Semantic Segmentation

Setting

Results

4.4. Compatibility with Self-Supervised Learning

Setting

Results

Explanation for performance advantage

4.5. Ablation Study

Adaptive Token Merger

Fuzzy Positional Encoding

Comparision

MAE

Results

Training resolutions

5. Conclusion

3D Gaussian Splatting

xLSTM review

1개의 댓글

관련 채용 정보