Decoder-only transformers can be conceptualized as infinite multi-state RNNs
* Multi-state RNNs: an RNN variant with unlimited hidden state size
Pretrained transformers can be converted into finite multi-state RNNs by fixing the size of their hidden state
-> several existing transformers cache compression techniques can be framed as such conversion policies
Novel and simple policy TOVA
TOVA outperforms all other baseline policies, is nearly on par with the full(infinite) model with cache size.
Transformer decoder LLMs often behave in practice as RNNs.
layer computation is
q, k, v are the self-attention projections of x, and each single-state of K, V corresponds to a specific token
MSRNN equation for transformers:
In practice, transformer models are trained up to a specific length and often struggle.
In theory, they possess the capacity to handle infinite-length inputs, and thus correspond to an infinite size MSRNN