ViTAR review

진성현·2024년 3월 31일
0

paper_reviews

목록 보기
12/14
post-thumbnail

Title:

ViTAR: Vision Transformer with Any Resolution (Fan et al, 2024)

Abstract

ViT -> constrained scalability accross different image resolutions

ViTAR

  • Element 1

    • Dynamic resolution adjustment with single Transformer block
    • highly efficient incremental token integration
  • Element 2

    • fuzzy positional encoding
    • consistent positional awareness across multiple resolutions
  • Results

    • 83.3% top-1 acc (1120x1120)
    • 80.4% top-1 acc (4032x4032)
    • compatible with self-supervised learning

1. Introduction

ViTs

  • Segment images into non-overlapping patches, project each patch into tokens, and then apply MHSA to capture the dependencies among different tokens

Results

  • Diverse visual tasks
    • Image classification (Swin Transformer, Neighborhood attention transformer)
    • Object detection (DETR, PVT)
    • Vision-language learning (ALIGN, CLIP)
    • Video recognition (Svformer)

Variable input resolutions

  • Previous works -> no traning can cover all resolution

Direct interpolation

  • Simple and widely used approach
  • Directly interpolate the positional encodings before feeding them into the ViT
  • Significant performance degradation

ResFormer

  • Multiple resolution images during training
  • Improvements on positional encodings -> flexible, convolution-based positional encoding

Challenges

  • High performance on relatively narrow range of resolution
  • Cannot integrate with self-supervised learning(Masked AutoEncoder) because of convolution-based positional encoding

Vision Transformer with Any Resolution (ViTAR)

  • ViT which processes high-resolution images with low computational burden and exhibits strong resolution generalization capability
  • Adaptive Token Merger (ATM) module
    • Iteratively processes tokens that have undergone the patch embedding
    • Scatters all tokens onto grids
    • Tokens within a grid as a single unit -> merges tokens within each unit -> mapping all tokens onto a gride of fixed shape => grid tokens
    • low computational complexity with high-resolution images
  • Fuzzy Positional Encoding (FPE)
    • introduces a positional perturbation
    • Precise position perception -> fuzzy perception with random noise
    • Prevents the model overfitting to position at specific resolutions
    • Form of implicit data augmentation

2. Related Works

Vision Transformers

Multi-Resolution Inference

  • Remains uncharted field
  • NaViT -> employs original resolution images as input to ViT
  • FlexiViT -> use patches of multiple sizes to train the model -> adapt to various patch size
  • ResFormer -> multi-resolution training -> CNN based positional encoding
    • Challenging to be used in self-supervised learning
    • Significant computational overhead (based on original ViT)

Positional Encodings

  • Crucial for ViT -> providing it with positional awareness and performance improvements
  • Early ViTs -> sin-cos encoding -> limited resolution robustness
  • CNN based PEs -> stronger resolution robustness

3. Methods

3.1. Overall Architecture

  • ATM + FPE + ViT

3.2. Adaptive Token Merger (ATM)

  • Receives tokens that have been processed through patch embedding as its input

  • Gh×GwG_h \times G_w -> number of tokens ultimately aim to obtain

  • ATM partitions tokens with the shape of H×WH \times W into a grid of size Gth×GtwG_{th} \times G_{tw}

    • Assume HH is divisible by GthG_{th}, WW is divisible by GtwG_{tw}
      => Number of tokens contained in each grid -> HGth×WGtw{H \over G_{th}} \times {W \over G_{tw}} (each typically set to 1 or 2)

    • Case H2GhH \geq 2G_h -> Gth=H2G_{th}={H\over2} -> HGth=2{H \over G_{th}}=2

    • Case 2Gh>H>Gh2G_h>H>G_h -> pad HH to 2Gh2G_h, set Gth=GhG_{th}=G_h -> HGth=2{H \over G_{th}}=2

    • Case H=GhH = G_h -> tokens on the edge of HH are no longer fused -> HGth=1{H \over G_{th}}=1

    • Same with WW

Grid Attention

  • Specific grid -> has tokens xijx_{ij}, 0i<HGth0\leq i<{H \over G_{th}} and 0j<WGtw0\leq j<{W \over G_{tw}}

  • Average pooling on all {xij}\{x_{ij}\} -> mean token

  • Cross attention to merge all tokens within a grid into a single token

    • Q: mean token
    • K, V: all {xij}\{x_{ij}\}
  • Residual connections

  • xavg=AvgPool({xij})x_{avg}=\text{AvgPool}(\{x_{ij}\})

  • GridAttn({xij})=xavg+Attn(xavg,{xij},{xij}\text{GridAttn}(\{x_{ij}\})=x_{avg} + \text{Attn}(x_{avg}, \{x_{ij}\}, \{x_{ij}\}

  • Fused token -> FFN to complete channel fusion => One iteration of merging token

  • All iterations share the same weights

  • Gradually decreasing the value of (Gth,Gtw)(G_{th}, G_{tw}) until Gth=GhG_{th}=G_h & Gtw=GwG_{tw}=G_w

    • Gh=Gw=14G_h =G_w=14 (similar to standard ViT)

3.3. Fuzzy Positional Encoding

  • Learnable positional encoding + sin-cos positional encoding
    => highly sensitive to changes in input resolution (fail to provide effective resolution adaptability)

  • Conv-based PE -> better resolution robustness

    • perception of adjacent tokens -> prevents its application in self-supervised learning

Advantages of FPE

  • Enhance the model's resolution robustness
  • No specific spatial structure like convolution -> self-supervised learning

FPE

Training

  • Randomly initialize a set of learnable PE.

    • Typical PE -> provide precise location information to the model
    • FPE supplies the model with fuzzy positional information
  • Positional information shifts within a certain range

  • (i+s1,j+s2)(i+s_1, j+s_2) as positional coordinates generated with FPE

    • (i,j)(i, j): Precise coordinates of the target token
    • 0.5s1,s20.5-0.5\leq s_1, s_2 \leq 0.5, uniform distributions
  • Add randomly generated coordinate offsets to the reference coordinates in the training stage

  • Perfrom grid sample on the learnable positional embeddings -> resulting in the FPE

Inference

  • Use precise PEs
  • Change in the input image resolution -> interpolation on the learnable PEs.
    • The model may have seen the interpolated PEs due to FPEs.
      => Robust positional resilience

3.4. Multi-Resolution Training

  • Multi-resolution training following ResFormer
  • ViTAR: high-resolution image with significantly lower computational demands

Traning detail

ResFormer

  • process each batch containing various resolutions
  • KL loss for inter-resolution supervision

ViTAR

  • processes each batch with consistent resolution
  • Basic cross-entropy loss for supervision

4. Experiments

  • Classification (ImageNet-1K)
  • Segmentation (COCO)
  • Semantic segmentation (ADE20K)
  • Self-supervised (MAE)

4.1. Image classification

Setting

  • ImageNet-1K
  • Training strategy following DeiT without distillation loss
  • AdamW optimizer
  • Strong data augmentation and regularization
  • Layer decay

Results

  • Capable of inferencing high resolution images

4.2. Object Detection

Setting

  • COCO
  • Experiment setting s follow the ResFormer and ViTDet
  • AdamW
  • Do not utilize multi-resolution training strategy
    • ATM iterates only once
    • HGth=WGtw=1{H \over G_{th}}={W \over G_{tw}}=1 => Excellent performance
    • HGth=WGtw=2{H \over G_{th}}={W \over G_{tw}}=2 => effective

Results

  • Case 1

  • Case 2

4.3. Semantic Segmentation

Setting

  • ADE20K
  • UperNet with the MMSegmentation
    • HGth=WGtw=1{H \over G_{th}}={W \over G_{tw}}=1 => Excellent performance
    • HGth=WGtw=2{H \over G_{th}}={W \over G_{tw}}=2 => effective

Results

  • Case 1
  • Case 2

4.4. Compatibility with Self-Supervised Learning

Setting

  • FPE > ResFormer in self-supervised learning
  • MAE
  • Multi-resolution input strategy
  • Pretrain 300 epoches, fine-ture 100 epoches

Results

  • Better performance overall

Explanation for performance advantage

  • ATM enables the model to learn higher-quality tokens (information gain)
  • FPE as implicit data augmentation -> robust positional information

4.5. Ablation Study

Adaptive Token Merger

  • ATM vs AvgPool

Fuzzy Positional Encoding

Comparision

  • Absolute positional encoding (APE)
  • Conditional positional encoding (CPE)
  • Global-local positional encoding (GLPE in ResFormer)
  • Relative Positional Bias (RPB in SwinT)
  • FPE

MAE

  • only APE and FPE are compatible with MAE

Results

Training resolutions

5. Conclusion

  • ViTAR
profile
Undergraduate student at SNU

1개의 댓글

comment-user-thumbnail
2024년 3월 31일

ssss

답글 달기