Vision Mamba vs VMamba vs S4ND

진성현·2024년 1월 26일

paper_reviews

목록 보기

5/14

Mamba의 등장 이후, SSMs를 다양한 분야에 적용해보려는 연구들이 다수 등장하고 있고, 하루의 시간차를 가지고 Vision Mamba와 VMamba가 arxiv에 게재되었다.
Vision Mamba는 Bidirectional Mamba를, VMamba는 Cross-Scan Module를 활용하여 ViT가 가진 back-bone 기능을 대체하려고 한다.
이번 리뷰에서는 Vision Mamba와 VMamba를 비교하고, 이전 세대의 S4 모델을 활용했던 비전 모델인 S4ND와의 비교도 수행해보려고 한다.

Mamba SSM

figure from recent paper 'MambaByte: Token-free Selective State Space Model'

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

2024.01.18

Code

https://github.com/hustvl/Vim

Abstract

Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have shown great potential for long sequence modeling. Building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding.

In this paper, we show that the reliance of visual representation learning on self-attention is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT,

while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8× faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248×1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to become the next-generation backbone for vision foundation models.

Backgrounds

Pure SSM based vision model

A generic pure-SSM-based backbone has not been explored for vision tasks.

ViT vs CNNs

ViT can provide each patch with global context through self-attention, while CNN uses same convolutional filters for all positions.
ViTs can use modality-agnostic modeling by treating an image as a sequence of patches without 2D inductive bias, which makes it the preferred architecture for multimodal applications

Mamba

Success in language modeling
Two challenges for Mamba(unidirectional modeling and lack of positional awareness)
Explore building a pure-SSM-based model as a generic vision backbone without using attention, while preserving the sequential, modality-agnostic modeling merit of ViT.

State space models for long sequence modeling

Structured State-Space Sequence (S4) model
S5 (MIMO SSM and efficient parallel scan)
H3
Gated State Space layer on S4
Mamba

State Space models for visual applications

ViS4mer (long-range temporal dependencies for video classification)
S4ND (multi-dimensional data handling-2D, 3D)
TranS4mer (state-of-the-art performance for movie scene detection)
Selective S4(S5 model- model selectivity mechanism, long-form video understanding)
U-Mamba (hybrid CNN-SSM for biomedical image segmentation)
=> All of them are hybrid or domain specific

Main Contributions

Vision Mamba(Vim)

bidirectional SSM
data-dependant global visual context modeling
position embeddings for location-aware visual understanding

vs ViT

Same modeling power without attention
subquadratic-time computation & linear memory complexity
2.8 $\times$ faster than DeiT, saving 86% GPU memory when performing batch inference to extract features on images at resolution 1248 $\times$ 1248

Vision Mamba

$\mathbf{t} \in \mathbb{R}^{H \times W \times C}$ (2D image) => $\mathbf{x}_{\mathbf{p}} \in \mathbb{R}^{J \times (P^2 \cdot C)}$ (flattened patch) =[linear projection]=>
$\mathbf{T}_0 = [\mathbf{t}_{cls};\mathbf{t}^1_{p}\mathbf{W};\mathbf{t}^12_{p}\mathbf{W};\cdots;\mathbf{t}^J_{p}\mathbf{W}]+\mathbf{E}_{pos}$ ( $\mathbf{W}$ is learnable projection matrix)

Vim Block

Experiment

Image Classification(ImageNet-1K)

Shows good result on imagenet data

Semantic Segmentation (ADE20K)

Shows strong result in segmentation
Throughput data speed and memory efficiency is great

Object detection and instance segmentation

Ablation Study

Using naive Mamba lead to best result on image classification, but its performance in segmentation is lower.
Using bidirectionality made it work better at segmentation, while achieving similar classification accuracy.

Questions

Too small benchmark results
Small model size

VMamba: Visual State Space Model

2024.01.19

Code

https://github.com/MzeroMiko/VMamba

Abstract

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) stand as the two most popular foundation models for visual representation learning. While CNNs exhibit remarkable scalability with linear complexity w.r.t. image resolution, ViTs surpass them in fitting capabilities despite contending with quadratic complexity. A closer inspection reveals that ViTs achieve superior visual modeling performance through the incorporation of global receptive fields and dynamic weights. This observation motivates us to propose a novel architecture that inherits these components while enhancing computational efficiency.

To this end, we draw inspiration from the recently introduced state space model and propose the Visual State Space Model (VMamba), which achieves linear complexity without sacrificing global receptive fields.

To address the encountered direction-sensitive issue, we introduce the Cross-Scan Module (CSM) to traverse the spatial domain and convert any non-causal visual image into order patch sequences.

Extensive experimental results substantiate that VMamba not only demonstrates promising capabilities across various visual perception tasks, but also exhibits more pronounced advantages over established benchmarks as the image resolution increases.

Introduction & Backgrounds

Non-causal nature of visual data -> restricted receptive fields
relation ships against unscanned patches could not be estimated
=> direction-sensitive problem -> address with Cross-Scan Module(CSM)

CSM

four-way scanning (from four corners all across the feature map to the opposite location)

VMamba Model

2D-Selective-Scan

Model architecture

Architecture of VMamba-Tiny

Steps of Vmamba

Partitioning input image into patches (without flattening the patches - preserves 2D structure of images)
Stage1 -> stack of VSS blocks (same dimension)
Stage2 -> Downsampling through patch merge operation(Swin Transformers v2) + Stack of VSS blocks
Stage3 & Stage4 -> Downsampling + VSS block

VSS Block

LN==> LN ==> conv ==> SS2D(CSM) ==> LN ==>
Refrain from utilizing position embedding bias in VMamba due to its causal nature
Discarded MLP operation(<-> Norm -> attention -> Norm -> MLP of ViT)

Experiments

Image Classification on ImageNet-1K

The authors reported there is a bug in VMamba-B model

Object Detection on COCO

Semantic Segmentation on ADE20K

Analysis

Effective Receptive Field(ERFs)

measures the significance of model input concerning its output
Only DeiT(ViT) and VMamba exhibit global ERFs
ViT "evenly activates all pixels using attention" vs VMamba activates all pixels and notably emphasizes cross-shaped activations(due to CSM -> central pixel is most influenced by pixels along the corss)
VMamba initially exhibits only a local ERF at Before Training, Training transforms the ERF to global (DeiT maintains nearly identical ERFs)

Input Scaling

trained with $224^2$ image => test classification with various image resolution
VMamba shows most stable performance (*only upward trend in 224 -> 384)

S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces

진성현

Undergraduate student at SNU

이전 포스트

Paper Review: Transformers are Multi-State RNNs

다음 포스트

Vision Mamba vs VMamba vs S4ND

paper_reviews

Mamba SSM

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

2024.01.18

Code

Abstract

Backgrounds

Pure SSM based vision model

ViT vs CNNs

Mamba

State space models for long sequence modeling

State Space models for visual applications

Main Contributions

Vision Mamba(Vim)

vs ViT

Vision Mamba

Vim Block

Experiment

Image Classification(ImageNet-1K)

Semantic Segmentation (ADE20K)

Object detection and instance segmentation

Ablation Study

Questions

VMamba: Visual State Space Model

2024.01.19

Code

Abstract

Introduction & Backgrounds

CSM

VMamba Model

2D-Selective-Scan

Model architecture

Steps of Vmamba

VSS Block

Experiments

Image Classification on ImageNet-1K

Object Detection on COCO

Semantic Segmentation on ADE20K

Analysis

Effective Receptive Field(ERFs)

Input Scaling

S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces

Paper Review: Transformers are Multi-State RNNs

Paper Review: Are Emergent Abilities of Large Language Models a Mirage?

0개의 댓글