Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

진성현·2024년 2월 7일

Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks (Park et al, 2024)

Abstract

State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in language modeling, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention. Although SSMs exhibit competitive performance, their in-context learning (ICL) capabilities, a remarkable emergent property of modern language models that enables task execution without parameter optimization, remain underexplored compared to Transformers. In this study, we evaluate the ICL performance of SSMs, focusing on Mamba, against Transformer models across various tasks. Our results show taht SSMs perform comparably to Transformers in standard regression ICL tasks, while outperforming them in tasks like sparse parity learning. However, SSMs fall short in tasks involving non-standard retrieval functionality. To address these limitations, we introduce a hybrid model, MambaFormer, that combines Mamba with attention blocks, surpassing individual models in tasks where they struggle independently. Our findings sugest that hybrid architectures offer promising avenues for enhancing ICL ~~in language models~~.

1. Introduction

In-context learning (ICL)

Figure from "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes (Garg et al, 2022)"

Transformer language models -> currently the only large models that is capable of ICL in practice.

Can attention-free models perform ICL?

ICL study

ICL capabilities usually emerge(?) at scales beyond 3 billion parameters
Testing the hypothesis usually requires 7B or more.

Small-scale ICL capabilities

specifically training a model to perform in-context learning, following Garg et al (2022)

Mamba vs Transformer in ICL

most of SSMs matches the performace of Transformers across multiple tasks
Mamba shows limitations in learning decision trees and retrieval tasks
Mamba outperforms Transformers in other complex tasks like sparse parity

MambaFormer

Interleaving SSM blocks with MHA blocks
Leverage the strengths of both Mamba and Transformers (good at both sparse parity and retrieval)

ICL in Transformers

Figures from "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes (Garg et al, 2022)"

Sub-quadratic architectures

S4

family of sequence models with discretized state-space model $\mathbf{h}_t = \mathbf{\bar{A}}\mathbf{h}_{t-1}+\mathbf{\bar{B}} \mathbf{x}_t, y_t=\mathbf{C}\mathbf{h}_t$

Mamba

selection mechanism in $\mathbf{\bar{A}}, \mathbf{\bar{B}}, \mathbf{C}$ , making them dependent on $\mathbf{x}_t$

3. Experimental Setup

Trained each model from scratch

3.1 Model Training in In-context Learning

Train models to learn specific fuction classes $\mathcal{F}$ in-context
Training Step
1. Select a function $f \in \mathcal{F}$ from distribution $\mathcal{D}_{\mathcal{F}}$
2. Sample a seq of random inputs $\mathbf{x}_1, \cdots, \mathbf{x}_N \in \mathbb{R}^d$ i.i.d from $\mathcal{D}_{\mathcal{X}}$
  (N: number of in-context example, d is dim of $\mathbf{x}_i$
3. Prompt $P=(\mathbf{x}_1, f(\mathbf{x}_1), \cdots, \mathbf{x}_N, f(\mathbf{x}_N))$ from 1 & 2
4. Train model $f_\theta$
  With $\underset{\theta}{\text{min }}\mathbb{E}_P \left[ {1\over N} \displaystyle\sum^{N-1}_{i=1}l(f_\theta(P^i), f(\mathbf{x}_i))\right]$
  $P^i:=(\mathbf{x}_1, f(\mathbf{x}_1, \cdots,\mathbf{x}_i, f(\mathbf{x}_i), \mathbf{x}_{i+1})$ , For $f : \mathbb{R}^d \rightarrow \mathbb{R}$ , append d-1 zeros to $f(\mathbf{x})$ .

Model architectures

Mamba (state-of-the-art SSM)
S4 (linear time-invariant counterpart to Mamba)
S4-Mamba (Mamba's S6 replaced with S4)
Figure from Mamba (Gu & Dao, 2023)

Model Training & Evaluation

500,000 iterations of training
1,280 prompts sample for evaluation, sampled from $\mathcal{D}_\mathcal{F}, \mathcal{D}_\mathcal{X}$ consistent with training.

3.2 ICL tasks

3.2.1 Learning regression

In-context examples $\mathbf{x}_i$ : sampled from the Gaussian distribution $\mathcal{N}(0, \mathbf{I}_d)$
Loss: squared error loss

Linear regression: $\mathcal{F}=\{f|f(\mathbf{x})=\mathbf{w}^\top\mathbf{x}, \mathbf{w}\in\mathbb{R}^d\}$ , $\mathbf{w}$ sampled from $\mathcal{N}(0, \mathbf{I}_d)$ .
Sparse linear regression: Identical to linear regression, except only $k$ coordinates of $\mathbf{w}$ are randomly used (rest set to 0)
Two-layer neural network: $\mathcal{F}=\{f|f(\mathbf{x})=\mathbf{W}^{(2)}\sigma(\mathbf{W}^{(1)}\mathbf{x}),\mathbf{W}^{(2)} \in \mathbb{R}^{1 \times h} ;\mathbf{W}^{(1)} \in \mathbb{R}^{h \times d} \}$ , $\sigma$ is ReLU.
Decision Tree: full binary tree with a fixed depth and input $\mathbf{x} \in \mathbb{R}^d$

3.2.2 Learning with outliers

Each pair of $(\mathbf{x}_i,f(\mathbf{x}_i))$ is replaced with "dummy" vectors with a fixed probability $p$
The loss is not computed for the replaced outliers during training

Orthogonal-outlier regression
Many-outlier regression: x and f are randomly replaced with a $\{1\}^d$ and one-hot vector with 90% probability

3.2.3 Learning discrete functions

Sparse parity: $\mathbf{x}_i$ sampled uniformly at random from $\{-1, 1\}^d$ , $\mathcal{F}=\{f|f(\mathbf{x})=\prod_{j \in \mathcal{S}}\mathbf{x}_i[j], \mathcal{S} \text{ is a subset of integer with size k}\}$ , uses cross-entropy loss

3.2.4 learning Chain-of-Thought

Chain-of-Thought-I/O: $\mathcal{F}=\{f|f(\mathbf{x})=\mathbf{W}^{(2)}\sigma(\mathbf{W}^{(1)}\mathbf{x}),\mathbf{W}^{(2)} \in \mathbb{R}^{1 \times h}\}$ , Interleaves the intermediate hidden feature $\mathbf{s}_i = \sigma(\mathbf{W}^{(1)}\mathbf{x}_i)$ to create input sequence $(\mathbf{x}_1, \mathbf{s}_1, f(\mathbf{x}_1), \cdots, \mathbf{x}_N, \mathbf{s}_N, f(\mathbf{x}_N), \mathbf{x}_{test})$

3.2.5 Learning Retrieval

Vector-valued multi-query associative recall(MQAR)
Model's associative recall ability is highly related to ICL abilities

Model is given key-value pairs of vectors $\{\mathbf{k}_1, \mathbf{v}_1, \cdots, \mathbf{k}_n, \mathbf{v}_n\}$
Query: $\{\mathbf{q}_1, \cdots, \mathbf{q}_m\}$ , each sample is from key set.
Model must output $\mathbf{v}_l$ for each $\mathbf{q}_j$ .

4. Experiment results

4.1 Mamba can in-context learn!

Performance gaps in more complex ICL tasks

Concerning Decision tree, sparse parity, Chain-of-Thought

Transformers > Mamba

Decision tree, MQAR

Mamba > Transformers

Sparse parity

Filtering Outlier in regression

Chain of Thought

Mamba models excel over Transformer at smaller sizes

MQAR

5. The Advantages of Hybrid Architectures for In-context Learning

5.1 Simultaneously learning parities and retrieval

6. Discussion

SSMs are capable in-context learners
Neither SSMs nor Transformers are great at all tasks.
Hybrid architecture MambaFormer achieves best-of-both-worlds performace

Future research directions

How performance on artificial ICL tasks correlates with general language modeling capabilities, such as perplexity on standard NLP benchmarks.
The potential for developing more effective architectures by integrating elements from transformers, SSMs, and gating mechanisms.
Identifying architectural features that contribute to effective
in-context learning
Assessing the impact of MambaFormer and other innovative architectures on language modeling performance.

진성현

Undergraduate student at SNU

이전 포스트

Paper Review: Are Emergent Abilities of Large Language Models a Mirage?

다음 포스트

Grandmaster-Level Chess Without Search

1개의 댓글

Eric Menk

2025년 9월 6일

For those exploring international study opportunities, understanding how to approach new learning methods is key, just like the mamba study highlights adaptive learning strategies. https://studyvista.pk/ provides personalized guidance for students aiming for a student visa for Germany, helping them navigate university applications, scholarships, and interview prep to succeed confidently abroad. This ensures learners not only adapt academically but also transition smoothly into global education environments.

답글 달기