..?
Transformers require memory and compute to predict the next token of a sequence of length (using Flash Attention!)
Attempts to make similar architectures but with memory to predict each token S4 or Mamba / RNN / models that can trained in parallel like linear attention / parallel RNNs
Resent work says GSSM's performance but it is not clear what these models sacrifice for efficiency
Theoritical analysis of copying task
In practice, large GSSM may have enough capacity to represent the entire input in the latent state
dictionary which contains alphabet tokens
seq2seq model
sequence to token model
Finite set is a state space
the number of bits required to encode the states of as
GSSM is a sequence model defined by an update rule and some output function
Note that for any sequence model, there are two types of memory considerations
GSSM definition constraints the input-dependent memory
It doesn't restrict in any way the amount of input-independent memory or the runtime of state updates
Leaving all other considerations unconstrained shows the lower bownd on the state space memory
input length
dimension
input tokens
an attention head is parametrized as
the output of the head at token is
with attention heads, the full dimension should be
embedding
MLP
embedding and MLP is applied on the token level
Attention-block is a set of heads applied in parallel
transformer-block is an attention-block floowed by an MLP on the concatenated output of heads
To perform the hashing described in the algorithm, it is necessary to leverage local positional information to define a hash and apply it globally on the entire input use Hard version of ALiBi
Alibi : biases the attention scores with a penalty that is proportional to their distance ( is a head-specific slope fixed before training)
add a bias to the -th attention head
Theorem 2.3.
For all , there exists a depth-2 transformer of dimension s.t. for all and for any copy distribution ,
Lemma 2.4.
Let be the copy distribution generated by sampling from the uniform distribution over the non-special (alphabet) tokens. Then
Corollary 2.5.
Fix some and some , there exists a depth-2 Transformer of dimension s.t. for the uniform copy distribution ,
Theorem 2.7.
Fix some GSSM over state space . Then for all , for the uniform copy distribution , the model has error
Corollary 2.8.
Fix some then every GSSM with state space s.t. has error for uniform copy distribution
Test to generalize out-of-distribution
Understand which function the model has learned
Trained all models on sequences of tokens test them up to 100 tokens (string-level accuracy)
Transformers shows better generalization to longer input compared to GSSMs
Using Hard-ALiBi in sequence length less than 50 shows almost perfect generalization up to 1000 tokens
To test whether the transformer uses the storage mechanism and retrieval of n-grams
Train Hard-ALiBi Transformer on the copy task with a dataset contains duplicate n-grams
Draw uniform sequences of tokens and randomly replace some n-gram with another n-gram that already appears in the sequence (each example always have two copies of n-gram)
n-gram lookup task : the model should use given n-gram as a key to loop up k-token key that follows the query
Suffix key version
Prefix key version
Pythia transformer models 410M ~ 2.8B
Mamba with similar size
Pretrained on Pile, used same tokenizer
Copy based task / Information Retrieval (selective copy)
String-Level Accuracy
Phone-book Lookup
QA
Transformer > GSSM at copying from their input text
SSM have many advantages over transformers
Future work is needed to make hybrid architectures of SSM and attention-like mechanism to enhance retrieving ability
제목이 자극적이었음. Retrieval 부분에서 Transformer의 성능을 증명했음. 다른 분야보다도 텍스트 관련해서는 이 점 때문에 SSM의 도입이 쉽지는 않을듯