MambaOut

진성현·2024년 5월 19일

Computer Vision Mamba MambaOut

paper_reviews

목록 보기

14/14

Title

MambaOut: Do We Really Need Mamba for Vision? (Yu & Wang, Arxiv 2024)

Abstract

Mamba -> architecture with RNN-like token mixer of SSM
Mamba adresses quadratic complexity of the attention -> applied to vision tasks
Performance of Mamba for vision -> underwhelming when compared to convolutional and attention-based models
This paper conceptuallly conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics.
Hypothesize that Mamba is not necessary for image classification
Detection & segmentation -> no autoregressive, but adhere long-sequence characteristics -> worthwhile to explore Mamba's potential.
MambaOut model to empirically verify the hypothesis
MambaOut: Stacked Mamba blocks without SSM
MambaOut model surpasses all visual Mamba models on ImageNet classification
MambaOut cannot match the performance of sota visual Mamba models.

1 Introduction

Transformer and linear complexity model
Vision in Mamba
- Vision Mamba
- VMamba
- LocalMamba
- PlainMamba
Underwhelming performace compared with sota convolutional and attention based models

Do we really need Mamba for Vision?

The paper's works

Mamba is ideally suited for tasks with
- Long-sequence
- autoregressive

Two hypotheses

SSM is not necessary for image classification, since this task conforms to neither the long-sequence or autoregressive characteristic
SSM may be potentially beneficial for object detection & instance segmentation and semantic segmentation, since they follow the long-sequence characteristic, though they are not autoregressive.

MambaOut

Gated CNN blocks(Mamba without SSM)
Performs better than Mamba in ImageClassification => verifying Hypothesis 1.
Falls short of detection and segmentation => validating Hypothesis 2.

Contribution

Analyzing the RNN-like mechanism of SSM and conceptually concluding that Mamba is suited for tasks with long-sequence and autoregressive characteristics
Examined characteristics for visual tasks and hypothesize that SSM is unnecessary for image classification on ImageNet, yet potential of SSM for detection and segmentation tasks remains valuable.
Developing MambaOut based on Gated CNN blocks without SSM. MambaOut may readily serve as a natural baseline for future research on visual Mamba models.

Transformer and models to solve its quadratic scaling
SSM and Mamba for visual tasks
- Vision Mamba - isotrophic vision models akin to ViT
- VMamba - hierarchical vision models similar to ResNet
- LocalMamba - enhance visual Mamba with local inductive biases
- PlainMamba - aims to enhance performance of isotropic Mamba models

3 Conceptual discussion

3.1 What tasks is Mamba suitable for?

Selective SSM (token mixer of Mamba)
Four input-dependant parameters $(\Delta, \textmd{A, B, C})$
Transforms them to $(\bar\textmd{A}, \bar\textmd{B},\textmd{C})$
- $\bar\textmd{A} = \exp(\Delta A)$
- $\bar\textmd{B} = (\Delta A)^{-1}(\exp(\Delta A)-I) \cdot\Delta B$
Sequence to sequence transform of SSM
- $h_t=\bar\textmd{A}h_{t-1}+\bar\textmd{B}x_t$
- $y_t=Ch_t$
Causal attention: stores all keys and valeus as its memory
RNN-lie ssm: fixed, constant, lossy memory
Limitation of Mamba: $h_t$ can only access information from the previous and current timesteps

Token Mixing

Causal mode
- $y_t=f(x_1, x_2, \cdots, x_t)$
Fully-visible mode
- $y_t=f(x_1, x_2, \cdots, x_t, \cdots, x_T)$
- $T$ is total number of tokens
- suitable for understanding tasks, where all inputs can be accessed by the model at once
Attention is in fully-visible mode by default + can easily trun into causal mode (causal masks)
RNN-like models inherently operate in causal mode -> no fully-visible mode
- Can approximate fully-visible mode using bidirectional branches, but each branch is individually causal.
  => Mamba is well-suited for tasks that require causal token mixing

3.2 Do visual recognition tasks have very long sequences?

Consider Transformer block with MLP ratio of 4
input: $X \in \mathbb{R}^{L\times D}$
FLOPs: $24D^2L+4DL^2$
ratio of the quaratic term to the linear term
- $r_L = {L\over 6D}$
  => if $L>6D$ , computational load of the quadratic term in $L$ surpasses linear term.
Metric for long sequence tasks
ViT-S: 384 channels => $\tau_{small}=6\times 384=2304$
ViT-B: 786 channels => $\tau_{base}=6\times768=4608$

On tasks

ImageNet: $224^2$ => $14^2$ with patch size $16^2$ . => does not qualify as a long-sequence task
COCO & ADE20K ( $800\times1280$ , $512\times2048$ )
=> number of tokens: 4K
=> Can be considered long-sequence

3.3 Do visual recognition tasks need causal token mixing mode?

Visual recognition -> understanding task that model can see the entire image at once
performance degration in ViT in causal mode

4 Experimental verification

4.1 Gated CNN and MambaOut

Meta-architecture of Gated CNN and Mamba is identical
- $X'=\text{Norm}(X)$
- $Y=(\text{TokenMixer}(X'W_1)\odot\sigma(X'W_2))W_3 + X$
Token mixers of Gated CNN and Mamba
- $\text{TokenMixer}_{GatedCNN}(Z)=\text{Conv}(Z)$
- $\text{TokenMixer}_{Mamba}(Z)=\text{SSM}(\sigma(\text{Conv}(Z)))$

4.2 Image classification on ImageNet

Setup

Follows DeiT

Results

MambaOut models consistently outperform visual Mamba models across all model sizes
MambaOut-Small achieves top-1 accuracy 84.1% vs LocalVMamba-S 83.7% while requiring 79% of MACs.
Visual Mamba models shows significant performance gap compared to sota convolution and attention models.
- CAFormer-M36 (conv + attn) outperforms all Mamba models by 1% in accuracy

4.3 Object detection & instance segmentation on COCO

Setup

MambaOut as backbone within Mask R-CNN

Results

MambaOut lags behind sota visual Mambas => Mamba in long-sequence visual tasks has benefits
Still exhibits significant performance gap with sota conv-attn hybrid models, TransNeXt.

4.4 Semantic segmentation on ADE20K

Setup

MambaOut as backbone for UperNet

Results

trends like COCO

5 Conclusion

Future works
- Further explore Mamba and RNN concepts
- Integratino of RNN and Transformers for LLM and LMMs.

Side Notes

(??)

진성현

Undergraduate student at SNU

이전 포스트

MambaOut

paper_reviews

Title

Abstract

1 Introduction

The paper's works

Two hypotheses

MambaOut

Contribution

3 Conceptual discussion

3.1 What tasks is Mamba suitable for?

Token Mixing

3.2 Do visual recognition tasks have very long sequences?

On tasks

3.3 Do visual recognition tasks need causal token mixing mode?

4 Experimental verification

4.1 Gated CNN and MambaOut

4.2 Image classification on ImageNet

Setup

Results

4.3 Object detection & instance segmentation on COCO

Setup

Results

4.4 Semantic segmentation on ADE20K

Setup

Results

5 Conclusion

Side Notes

xLSTM review

0개의 댓글

관련 채용 정보

MambaOut

paper_reviews

Title

Abstract

1 Introduction

The paper's works

Two hypotheses

MambaOut

Contribution

2 Related work

3 Conceptual discussion

3.1 What tasks is Mamba suitable for?

Token Mixing

3.2 Do visual recognition tasks have very long sequences?

On tasks

3.3 Do visual recognition tasks need causal token mixing mode?

4 Experimental verification

4.1 Gated CNN and MambaOut

4.2 Image classification on ImageNet

Setup

Results

4.3 Object detection & instance segmentation on COCO

Setup

Results

4.4 Semantic segmentation on ADE20K

Setup

Results

5 Conclusion

Side Notes

xLSTM review

0개의 댓글

관련 채용 정보