Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks (Park et al, 2024)
Abstract
State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in language modeling, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention. Although SSMs exhibit competitive performance, their in-context learning (ICL) capabilities, a remarkable emergent property of modern language models that enables task execution without parameter optimization, remain underexplored compared to Transformers. In this study, we evaluate the ICL performance of SSMs, focusing on Mamba, against Transformer models across various tasks. Our results show taht SSMs perform comparably to Transformers in standard regression ICL tasks, while outperforming them in tasks like sparse parity learning. However, SSMs fall short in tasks involving non-standard retrieval functionality. To address these limitations, we introduce a hybrid model, MambaFormer, that combines Mamba with attention blocks, surpassing individual models in tasks where they struggle independently. Our findings sugest that hybrid architectures offer promising avenues for enhancing ICL in language models.
1. Introduction
In-context learning (ICL)

- Figure from "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes (Garg et al, 2022)"
Transformer language models -> currently the only large models that is capable of ICL in practice.
Can attention-free models perform ICL?
ICL study
- ICL capabilities usually emerge(?) at scales beyond 3 billion parameters
- Testing the hypothesis usually requires 7B or more.
Small-scale ICL capabilities
- specifically training a model to perform in-context learning, following Garg et al (2022)
- most of SSMs matches the performace of Transformers across multiple tasks
- Mamba shows limitations in learning decision trees and retrieval tasks
- Mamba outperforms Transformers in other complex tasks like sparse parity

- Interleaving SSM blocks with MHA blocks
- Leverage the strengths of both Mamba and Transformers (good at both sparse parity and retrieval)



- Figures from "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes (Garg et al, 2022)"
Sub-quadratic architectures
S4
- family of sequence models with discretized state-space model ht=Aˉht−1+Bˉxt,yt=Cht
Mamba
- selection mechanism in Aˉ,Bˉ,C, making them dependent on xt
3. Experimental Setup
- Trained each model from scratch

3.1 Model Training in In-context Learning
- Train models to learn specific fuction classes F in-context
- Training Step
- Select a function f∈F from distribution DF
- Sample a seq of random inputs x1,⋯,xN∈Rd i.i.d from DX
(N: number of in-context example, d is dim of xi
- Prompt P=(x1,f(x1),⋯,xN,f(xN)) from 1 & 2
- Train model fθ
With θmin EP[N1i=1∑N−1l(fθ(Pi),f(xi))]
Pi:=(x1,f(x1,⋯,xi,f(xi),xi+1), For f:Rd→R, append d-1 zeros to f(x).
Model architectures
- Mamba (state-of-the-art SSM)
- S4 (linear time-invariant counterpart to Mamba)
- S4-Mamba (Mamba's S6 replaced with S4)

- Figure from Mamba (Gu & Dao, 2023)
Model Training & Evaluation
- 500,000 iterations of training
- 1,280 prompts sample for evaluation, sampled from DF,DX consistent with training.
3.2 ICL tasks
3.2.1 Learning regression
In-context examples xi: sampled from the Gaussian distribution N(0,Id)
Loss: squared error loss
-
Linear regression: F={f∣f(x)=w⊤x,w∈Rd}, w sampled from N(0,Id).
-
Sparse linear regression: Identical to linear regression, except only k coordinates of w are randomly used (rest set to 0)
-
Two-layer neural network: F={f∣f(x)=W(2)σ(W(1)x),W(2)∈R1×h;W(1)∈Rh×d}, σ is ReLU.
-
Decision Tree: full binary tree with a fixed depth and input x∈Rd
3.2.2 Learning with outliers
Each pair of (xi,f(xi)) is replaced with "dummy" vectors with a fixed probability p
The loss is not computed for the replaced outliers during training
- Orthogonal-outlier regression
- Many-outlier regression: x and f are randomly replaced with a {1}d and one-hot vector with 90% probability
3.2.3 Learning discrete functions
- Sparse parity: xi sampled uniformly at random from {−1,1}d, F={f∣f(x)=∏j∈Sxi[j],S is a subset of integer with size k}, uses cross-entropy loss
3.2.4 learning Chain-of-Thought
- Chain-of-Thought-I/O: F={f∣f(x)=W(2)σ(W(1)x),W(2)∈R1×h}, Interleaves the intermediate hidden feature si=σ(W(1)xi) to create input sequence (x1,s1,f(x1),⋯,xN,sN,f(xN),xtest)
3.2.5 Learning Retrieval
- Vector-valued multi-query associative recall(MQAR)
Model's associative recall ability is highly related to ICL abilities
Model is given key-value pairs of vectors {k1,v1,⋯,kn,vn}
Query: {q1,⋯,qm}, each sample is from key set.
Model must output vl for each qj.
4. Experiment results
4.1 Mamba can in-context learn!

- Concerning Decision tree, sparse parity, Chain-of-Thought
- Sparse parity

Filtering Outlier in regression

Chain of Thought
- Mamba models excel over Transformer at smaller sizes

MQAR

5. The Advantages of Hybrid Architectures for In-context Learning

5.1 Simultaneously learning parities and retrieval

6. Discussion
- SSMs are capable in-context learners
- Neither SSMs nor Transformers are great at all tasks.
- Hybrid architecture MambaFormer achieves best-of-both-worlds performace
Future research directions
- How performance on artificial ICL tasks correlates with general language modeling capabilities, such as perplexity on standard NLP benchmarks.
- The potential for developing more effective architectures by integrating elements from transformers, SSMs, and gating mechanisms.
- Identifying architectural features that contribute to effective
in-context learning
- Assessing the impact of MambaFormer and other innovative architectures on language modeling performance.
For those exploring international study opportunities, understanding how to approach new learning methods is key, just like the mamba study highlights adaptive learning strategies. https://studyvista.pk/ provides personalized guidance for students aiming for a student visa for Germany, helping them navigate university applications, scholarships, and interview prep to succeed confidently abroad. This ensures learners not only adapt academically but also transition smoothly into global education environments.