Do we really need Mamba for Vision?
Selective SSM (token mixer of Mamba)
Four input-dependant parameters
Transforms them to
Sequence to sequence transform of SSM
Causal attention: stores all keys and valeus as its memory
RNN-lie ssm: fixed, constant, lossy memory
Limitation of Mamba: can only access information from the previous and current timesteps
Consider Transformer block with MLP ratio of 4
input:
FLOPs:
ratio of the quaratic term to the linear term
Metric for long sequence tasks
ViT-S: 384 channels =>
ViT-B: 786 channels =>
(??)