MambaOut

진성현·2024년 5월 19일
0

paper_reviews

목록 보기
14/14

Title

  • MambaOut: Do We Really Need Mamba for Vision? (Yu & Wang, Arxiv 2024)

Abstract

  • Mamba -> architecture with RNN-like token mixer of SSM
  • Mamba adresses quadratic complexity of the attention -> applied to vision tasks
  • Performance of Mamba for vision -> underwhelming when compared to convolutional and attention-based models
  • This paper conceptuallly conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics.
  • Hypothesize that Mamba is not necessary for image classification
  • Detection & segmentation -> no autoregressive, but adhere long-sequence characteristics -> worthwhile to explore Mamba's potential.
  • MambaOut model to empirically verify the hypothesis
  • MambaOut: Stacked Mamba blocks without SSM
  • MambaOut model surpasses all visual Mamba models on ImageNet classification
  • MambaOut cannot match the performance of sota visual Mamba models.

1 Introduction

  • Transformer and linear complexity model
  • Vision in Mamba
    • Vision Mamba
    • VMamba
    • LocalMamba
    • PlainMamba
  • Underwhelming performace compared with sota convolutional and attention based models

    Do we really need Mamba for Vision?

The paper's works

  • Mamba is ideally suited for tasks with
    • Long-sequence
    • autoregressive

Two hypotheses

  • SSM is not necessary for image classification, since this task conforms to neither the long-sequence or autoregressive characteristic
  • SSM may be potentially beneficial for object detection & instance segmentation and semantic segmentation, since they follow the long-sequence characteristic, though they are not autoregressive.

MambaOut

  • Gated CNN blocks(Mamba without SSM)
  • Performs better than Mamba in ImageClassification => verifying Hypothesis 1.
  • Falls short of detection and segmentation => validating Hypothesis 2.

Contribution

  • Analyzing the RNN-like mechanism of SSM and conceptually concluding that Mamba is suited for tasks with long-sequence and autoregressive characteristics
  • Examined characteristics for visual tasks and hypothesize that SSM is unnecessary for image classification on ImageNet, yet potential of SSM for detection and segmentation tasks remains valuable.
  • Developing MambaOut based on Gated CNN blocks without SSM. MambaOut may readily serve as a natural baseline for future research on visual Mamba models.

2 Related work

  • Transformer and models to solve its quadratic scaling
  • SSM and Mamba for visual tasks
    • Vision Mamba - isotrophic vision models akin to ViT
    • VMamba - hierarchical vision models similar to ResNet
    • LocalMamba - enhance visual Mamba with local inductive biases
    • PlainMamba - aims to enhance performance of isotropic Mamba models

3 Conceptual discussion

3.1 What tasks is Mamba suitable for?

  • Selective SSM (token mixer of Mamba)

  • Four input-dependant parameters (Δ,A, B, C)(\Delta, \textmd{A, B, C})

  • Transforms them to (Aˉ,Bˉ,C)(\bar\textmd{A}, \bar\textmd{B},\textmd{C})

    • Aˉ=exp(ΔA)\bar\textmd{A} = \exp(\Delta A)
    • Bˉ=(ΔA)1(exp(ΔA)I)ΔB\bar\textmd{B} = (\Delta A)^{-1}(\exp(\Delta A)-I) \cdot\Delta B
  • Sequence to sequence transform of SSM

    • ht=Aˉht1+Bˉxth_t=\bar\textmd{A}h_{t-1}+\bar\textmd{B}x_t
    • yt=Chty_t=Ch_t

  • Causal attention: stores all keys and valeus as its memory

  • RNN-lie ssm: fixed, constant, lossy memory

  • Limitation of Mamba: hth_t can only access information from the previous and current timesteps

Token Mixing

  • Causal mode
    • yt=f(x1,x2,,xt)y_t=f(x_1, x_2, \cdots, x_t)
  • Fully-visible mode
    • yt=f(x1,x2,,xt,,xT)y_t=f(x_1, x_2, \cdots, x_t, \cdots, x_T)
    • TT is total number of tokens
    • suitable for understanding tasks, where all inputs can be accessed by the model at once
  • Attention is in fully-visible mode by default + can easily trun into causal mode (causal masks)
  • RNN-like models inherently operate in causal mode -> no fully-visible mode
    • Can approximate fully-visible mode using bidirectional branches, but each branch is individually causal.
      => Mamba is well-suited for tasks that require causal token mixing

3.2 Do visual recognition tasks have very long sequences?

  • Consider Transformer block with MLP ratio of 4

  • input: XRL×DX \in \mathbb{R}^{L\times D}

  • FLOPs: 24D2L+4DL224D^2L+4DL^2

  • ratio of the quaratic term to the linear term

    • rL=L6Dr_L = {L\over 6D}
      => if L>6DL>6D, computational load of the quadratic term in LL surpasses linear term.
  • Metric for long sequence tasks

  • ViT-S: 384 channels => τsmall=6×384=2304\tau_{small}=6\times 384=2304

  • ViT-B: 786 channels => τbase=6×768=4608\tau_{base}=6\times768=4608

On tasks

  • ImageNet: 2242224^2=> 14214^2 with patch size 16216^2. => does not qualify as a long-sequence task
  • COCO & ADE20K (800×1280800\times1280, 512×2048512\times2048)
    => number of tokens: 4K
    => Can be considered long-sequence

3.3 Do visual recognition tasks need causal token mixing mode?

  • Visual recognition -> understanding task that model can see the entire image at once
  • performance degration in ViT in causal mode

4 Experimental verification

4.1 Gated CNN and MambaOut

  • Meta-architecture of Gated CNN and Mamba is identical
    • X=Norm(X)X'=\text{Norm}(X)
    • Y=(TokenMixer(XW1)σ(XW2))W3+XY=(\text{TokenMixer}(X'W_1)\odot\sigma(X'W_2))W_3 + X
  • Token mixers of Gated CNN and Mamba
    • TokenMixerGatedCNN(Z)=Conv(Z)\text{TokenMixer}_{GatedCNN}(Z)=\text{Conv}(Z)
    • TokenMixerMamba(Z)=SSM(σ(Conv(Z)))\text{TokenMixer}_{Mamba}(Z)=\text{SSM}(\sigma(\text{Conv}(Z)))

4.2 Image classification on ImageNet

Setup

  • Follows DeiT

Results

  • MambaOut models consistently outperform visual Mamba models across all model sizes
  • MambaOut-Small achieves top-1 accuracy 84.1% vs LocalVMamba-S 83.7% while requiring 79% of MACs.
  • Visual Mamba models shows significant performance gap compared to sota convolution and attention models.
    • CAFormer-M36 (conv + attn) outperforms all Mamba models by 1% in accuracy

4.3 Object detection & instance segmentation on COCO

Setup

  • MambaOut as backbone within Mask R-CNN

Results

  • MambaOut lags behind sota visual Mambas => Mamba in long-sequence visual tasks has benefits
  • Still exhibits significant performance gap with sota conv-attn hybrid models, TransNeXt.

4.4 Semantic segmentation on ADE20K

Setup

  • MambaOut as backbone for UperNet

Results

  • trends like COCO

5 Conclusion

  • Future works
    • Further explore Mamba and RNN concepts
    • Integratino of RNN and Transformers for LLM and LMMs.

Side Notes


(??)

profile
Undergraduate student at SNU

0개의 댓글