Linear RNNs and Attention

October 31, 2025

by Leonardo

1. How to Make Attention More Efficient?

Prefilling Phase: Process the input tokens to build the KV cache for generating the first output token
Decoding Phase: Generate each subsequent token based on the stored KV cache

Size required for KV Cache: $𝐿 \times 𝑛 \times 𝐻 \times 𝑑$

Grouped-Query Attention (GQA)

Multi-Head Latent Attention (MLA)

Sliding Window Attention (SWA)

Cross-Layer Attention (CLA)

Yes! We can convert a Linear RNN to Prefix Sum Problem and solve it in $𝑂 (\log 𝑛)$ time.

Although the state transition is linear, nonlinearity can be introduced elsewhere.

Linear Attention can be seen as a special type of RNN.

Standard softmax attention has quadratic complexity: $𝑂 (𝑛^{2} 𝑑)$ and $𝑂 (𝑛^{2})$ memory.

Linear Attention tries to rewrite softmax attention:

Notice softmax and RMSNorm afterwards are both doing normalization, we can take $𝑂 = (\exp (𝑄 𝐾^{𝑇}) ⊙ 𝑀) 𝑉$