Towards Long-Context Memory

August 29, 2025 · 18 min read

#ml#post#project#research

Thank you to Dhruv Pai and Ben Keigwin for discussions that have contributed to this post.

Introduction

Modern-day Large Language Models (LLMs) are variants of the transformer architecture, which alternates sequence mixers (attention blocks) and feature mixers (MLPs). Attention moves information across tokens by comparing a current token’s query to earlier keys those similarities weight the corresponding values, which are added to the residual stream, whereas feature mixers only transform tokens within its hidden dimension. MLPs, by contrast, only transform each token within its hidden dimension. Because attention computes pairwise similarities across the context, it is the primary computational bottleneck as sequence length $n$ grows. Long-context computation is essential for problem-solving tasks and code generation, where chain-of-thought models thrive.

This is one reason why LLMs have a maximum context length, typically ~250k for frontier models. The other reason is that many models use Rotary Positional Embeddings (RoPE) to encode the position of a token, which applies a rotation to the complex plane for query and key vectors. However, because such rotations are inherently sinusoidal (i.e. they repeat with a period), so past a certain context length, the phase transformations will begin to wrap. Approaches like YaRN gradually extend RoPE to infinite-length context lengths during pre-training by rescaling position indices, but they do not change the quadratic time/space cost of vanilla attention.

In autoregressive generation, KV caching — storing the newly generated keys and values at each token, and only comparing the new query with previously cached keys — decreases the per-step cost from $O(n^2)$ to $O(n)$ . However, we still run into issues: for large $n$ , there isn't enough storage on VRAM to cache increasing numbers of high-dimensional key and value vectors. Moreover, to generate those tokens, the total time complexity would be $O(n^2)$ , so KV caching isn't enough.

Linear Attention

Instead of caching every individual key and value, linear attention stores a fixed-size matrix $M$ that compresses all of the important past context. Concretely, if our keys are $k_1, \dots, k_n$ and our values are $v_1, \dots, v_n,$ we can multiply $M_n$ by our computed query $q_{n+1}$ and to derive the output $o_{n+1} = M_n q_{n+1} \in \mathbb{R}^d$ . This yields a simple calculation that is not more computationally intensive as $n$ increases. But, how is $M$ updated?

Vanilla attention uses

o_{n+1} = \sum_{i=1}^{n} \alpha_{(n+1)i}\, v_i,\qquad \alpha_{(n+1)i} = \frac{\exp\!\big(\tfrac{q_{n+1}^\top k_i}{\sqrt{d_k}}\big)} {\sum_{j=1}^{n}\exp\!\big(\tfrac{q_{n+1}^\top k_j}{\sqrt{d_k}}\big)}.

to generate the outputs for the next token. To turn this into the form of $M_n q_n$ for some updating matrix $M_n,$ we want a function $\phi$ such that $\kappa(q,k) \triangleq \exp(\tfrac{qk}{\sqrt{d}}) \approx \phi(q)^{\top} \phi(k).$ This factorization rewrites the attention calculation as

\begin{aligned} o_{n+1} &= \sum_{i=1}^{n} \frac{\phi(q_n)^{\!\top}\phi(k_i)}{\sum_{j=1}^{n} \phi(q_n)^{\!\top}\phi(k_j)}\, v_i \\ &= \frac{\sum_{i=1}^{n} \big(\phi(q_n)^{\!\top}\phi(k_i)\big)\, v_i}{\sum_{j=1}^{n} \phi(q_n)^{\!\top}\phi(k_j)} \\ &= \frac{\left(\sum_{i=1}^{n} v_i\, \phi(k_i)^{\!\top}\right)\phi(q_n)} {\left(\sum_{j=1}^{n}\phi(k_j)\right)^{\!\top}\phi(q_n)}. \end{aligned}

which means that now, using a prefix sum enables $O(1)$ updates to the state matrix $M$ . A simple example of such a $\phi$ is $\max(0,Wx)$ .

However, errors will accumulate with only finite memory. One significant challenge in long-context tasks is for our model to accurately retrieve specific values, since the quantity of information we want to remember increases linearly, but our state matrix remains fixed. For simplicity, assume $\phi$ is the identity and ignore the (normalizing) denominator. Then,

o_{n+1} \;=\; \sum_{i=1}^{n} (q_n^\top k_i)\, v_i\;=\; \left(\sum_{i=1}^{n} v_i k_i^\top\right) q_n\;\triangleq\; M_{n}\, q_n,\qquad M_{n} \;=\; \sum_{i=1}^{n} v_i k_i^\top .

Suppose we want to retrieve a specific value $v_i$ ; if the keys are normalized, we can multiply by $k_i^{\top}$ to obtain $k_i^{\top} k_i v_i^{\top} = v_i^{\top}$ . However, in practice, multiplying by the matrix $M_n$ also accumulates a residual error given by the second term in

\begin{aligned} k_i^{\top} M_{n}^{\top} &= k_i^{\top}\!\left(\sum_{j=1}^{n} k_j v_j^{\top}\right) = \sum_{j=1}^{n} (k_i^{\top} k_j)\, v_j^{\top} \\ &= (k_i^{\top} k_i)\, v_i^{\top} \;+\; \sum_{j\neq i} (k_i^{\top} k_j)\, v_j^{\top}. \end{aligned}

Note that any two randomly selected normalized vectors in $d-$ dimensional space has an expected dot product similarity of $1/d$ , so retrieval error grows with $O(n/d)$ which becomes prohibitive for large $n$ . This matches empirical findings: gated-convolution linear attention architectures underperform transformers on associative recall tasks (Arora et. al., 2023); for example, predicting the next token in “Hakuna Matata means no worries Hakuna Matata it means no _." ; Linear attention will struggle with precision across many thousands of tokens.

Deriving a Regression

The retrieval objective suggested by the calculation above is: learn a matrix $M$ such that $Mk_i \approx v_i$ for past pairs. Thus, to minimize retrieval loss, it seems reasonable to minimize $\lVert M_{n}k_{n} - v_{n} \rVert$ , which turns out to be what the Delta update rule $M_{n+1} = M_{n} - \beta_nM_{n}k_nk_n^{\top} + \beta_nv_nk_n^{\top}$ minimizes (Yang, 2024). However, it would be nice to extend this more generally to ensure continued recall across all keys and values; the most general form of this would be

\mathcal{L}(M_{n}) = \min \sum _{i=1} ^{n} \alpha_i \lVert M_{n}k_{i} - v_{i} \rVert^2

where $(a_i)$ are input-dependent parameters.

Why do we care so much about key-value retrieval in the first place? We'll revisit this assumption later, but there is substantial evidence that transformers will naturally solve a similar regression during in-context learning. (Oswald et. al, 2023) finds that single attention layers implement the one-step gradient step on $\nabla \mathcal{L}(M_{n-1}),$ and deeper Transformers approach multiple gradient steps on the same loss. Moreover, replacing the linear attention layers with a Mesa-layer that exactly solves the inner optimization improve the trained Transformer's in-context learning performance. But, instead of multiple sub-optimal gradient steps towards the minimum $M_{n-1}$ , models would ideally solve the regression exactly at each step. MesaNet leverages this insight to achieve "optimal test-time regression" by computing the optimal solution to

\mathcal{L}(M_{n}) = \min \sum _{i=1} ^{n} \alpha_i \lVert M_{n}k_{i} - v_{i} \rVert^2 + \tfrac{1}{2} \text{Tr}(M_{n}^{\top} \Lambda M_{n}).

at each token. This weighted ridge regression has the closed-form solution

M_{n} = \left(\sum _{i=1} ^{n} \alpha_iv_ik_i^{\top}\right)\left(\sum _{i=1} ^{n} \alpha_ik_ik_i^{\top} + \Lambda\right)^{-1}

which enables constant-time per-token updates in $n$ , although despite MesaNet's parallelization, the update runs slightly slower than simpler linear attention rules.

So far, the memory has been stored in a matrix. But, an MLP's non-linearity could potentially offer far more mixing and memorization ability, albeit with efficiency costs. ATLAS (Behrouz, 2025) uses the Muon optimizer to optimize the same regression, albeit over a sliding window instead of globally:

\mathcal{L}(M_{n}) = \min \sum _{i=n-c+1} ^{n} \alpha_i \lVert M_{n}k_{i} - v_{i} \rVert^2

Importantly, the regression ATLAS minimizes is devoid of the regularizing trace term, since Muon, by design, seeks to optimally update gradients whilst ensuring numerical stability by bounding the change in the L2 norm $\lVert M_{n+1}-M_n \rVert,$ which is exactly the same regularization term as MesaNet. Without such guarantees, MesaNet might overfit state matrices to become highly volatile to small changes in the key space, underscoring the need for a regularizer to prevent gradient update explosions.

Regression Alternatives

Now, let's challenge the importance of the aforementioned key-value retrieval regression. Even though the exponentials in softmax attention do amplify differences between query-key similarities, attention layers also serve as sequence mixers which transfer information between tokens. The temperature $\tau$ in attention determines how much mixing vs. precise retrieval the model wishes to perform. Since $\tau = 0$ , which causes $o_n = v_t$ where $t = \text{argmax } q_nk_i$ , is the optimal temperature to minimize the aforementioned regression, never occurs, we suspect that solely focusing on retrieval is not correct.

Unfortunately, there isn't a theoretically grounded answer to what the correct combination of mixing vs. retrieval is, and methods today are highly empirical. This sections documents strategies that have seen improvements to long-context memory.

Memory Compression

We return to our first example of vanilla linear attention. One downside, which causes the previously observed $O(n/d)$ retrieval error, is that keys and values are never deleted, causing interference. The secondary issue is that language models have maximum context limits, and we will necessarily need to reset our KV cache after that point, making further contextually-aware inference impossible. For linear attention, an intuitive way to perform compression is through a deep, persistent neural network $M$ that generates an output $m_n = M h_n,$ which is incorporated into attention, as in Titans (Behrouz, 2024) which stores contextual information well past the current context window. However, in vanilla attention, the KV cache is continually appended, making the memorization of keys and values in the far past tricky. An Evolved Universal Transformer Memory employs a neural network to compress the KV cache to a fixed size. At each step, it takes in the generated attention matrices and determines whether to append to the cache, replace a previous key-value pair, or merge with existing pairs.

Another intuitive solution is to gradually diminish values in the more distant path. Hierarchical Memory Architectures (HMA) achieve this by organizing memory at multiple granularities; short-term, real memory stores a queue of key-value pairs, whereas synthetic memory comprises of encoded "memory tokens" structured into RMA, mid-term, and long-term memory. At periodic intervals, a window of recent RMA memory tokens are consolidated into mid-term memory, and mid-term memory tokens are consolidated into the long-term memory, creating coarser summaries of the distant past. MEGALODON (Ma, Wang, 2024) uses exponential decay to entirely remove the context window, by using an exponential decay term

o_{n} = \sum_{i=1}^{n} \alpha_{ni}\, \gamma^{\,n-i} v_i, \qquad \gamma \in (0,1).

Thus, the weighted values will become arbitrarily small for small enough $i$ and thus can be safely clipped, providing a more robust way to discard unneeded keys and values.

Persistent Memory

Across long horizons, the line between training and inference itself becomes blurred due to in-context learning, and important keys and values could be integrated into the slow weights, enabling a form of fine-tuning on the specific context provided. Under this lens, vanilla softmax attention begins without any stored keys and values and appends every key-value pair, memory compression starts from scratch and persists vectors as needed, and persistent memory initializes with slow weights that the model appends to during inference time. For example, Titans (Behrouz, 2024) leverages a set of learnable, input-independent parameters appended to the beginning of the sequence, where it is concatenated with contextual memory compressed from previous context windows and the current sequence's attention.

Many existing associative memory architectures lend themselves readily to such integration. Memory Layers at Scale (Berges, Oğuz, 2024) leverage extreme sparsity by learning millions of key-value pairs through product quantization to replace MLP layers, and selecting around a hundred of such pairs to incorporate into the residual stream. It is designed similarly to a Mixture of Experts layer, except with far greater sparsity at the cost of reduced expressivity by retrieving vectors instead of MLPs. Memory Mosaics (Zhang, 2025) also use layers of memory modules comprised of key-value pairs, but these vectors are autoregressively generated instead of learned during pre-training. Fortunately, associative memory modules are flexible enough to support persistent memory that adapts during inference.

We illustratively augment the Memory Layers at Scale architecture to support test-time modifications whilst improving its time complexity from $O(\sqrt{N})$ to $O(k \log N),$ where $N$ is the number of stored pairs and $k$ is the number of retrieved pairs. Iteratively build a tree of depth $\log_b N$ , using our $N$ vectors as leaf nodes, such that each group of $b$ nodes has a parent node, with a learnable vector $v \in \mathbb{R}^{d}$ as its label, until we have a single root node.

Perform retrieval for a query vector $q \in \mathbb{R}^d$ in $O(kb\log_bN)$ time with the following algorithm:

Start at the root node, and iteratively go down by a layer.
At each layer, save the $k$ nodes with labels of highest cosine similarity to the query vector $q$ , and recurse down those nodes in the next layer to evaluate the next $kb$ vectors, yielding the desired time complexity as we process at most $kb$ vectors per layer. At the last layer, take those keys and values as our retrieved top- $k$ values.

Note that inserts will be similarly easy to implement, by take an $\epsilon$ -greedy approach by going down the node with the highest cosine similarity to the key at each step. Load-balancing challenges can be addressed by implementing an exponential moving average of $(1 - \alpha)v + \beta\mu \to v'$ where $\mu$ is the mean of each node’s child vectors (or keys, if the child nodes are leaf nodes) every set amount of steps, and by incorporating Deepseek's Auxillary-Loss-Free Load Balancing (Wang, 2024). Such integrative approaches serve as strict generalizations of attention, which is essential for saving key information across millions of tokens.

Polynomial Feature Maps

Modern Hopfield networks generalize associative memory by replacing the old quadratic energy with kernelized similarity. Re-examining our kernel $\kappa(x,y) \triangleq \exp(\tfrac{xy}{\sqrt{d}}) \approx \phi(x)^{\top} \phi(y),$ we note that our derivation of linear attention used a linear kernel by assuming $\phi$ is the identity, and only capturing first-order correlations. This choice only lets us memorize $O(d_k)$ correlations, as $M$ has dimension $d_k \times d_v$ and exact memorization requires solving $MK = V,$ which happens if and only if $m d_v \leq d_k d_v$ where $K = \mathbb{R}^{d_k \times m},$ and $m$ is the number of memorized pairs.

However, by expanding more terms of the Taylor series

\exp\!\left(\tfrac{xy}{\sqrt{d}}\right) = 1 + \frac{xy}{\sqrt{d}} + \frac{1}{2!}\left(\frac{xy}{\sqrt{d}}\right)^2 + \frac{1}{3!}\left(\frac{xy}{\sqrt{d}}\right)^3 + \cdots

we are able to memorize $O(d_k^p)$ key-value pairs, where $p$ is the degree of the expansion. This intuitively makes sense, as small differences between two similarity scores will be greatly magnified through the exponentiation, and we can implement this using the kernel

\phi(x) \;=\; \left( \frac{x^{\otimes n}}{\sqrt{n!}} \right)_{n=0}^\infty .

Thus, many linear attention variants use choices of feature maps for representing inputs to increase retrieval capacities. For example, state space models like LMUFormer (Liu, 2024) uses the Legendre basis to mix in the feature dimension, whereas Mamba (Gu, Dao, 2023) uses the Laplace basis.

Hybrid Architectures

Finally, we discuss hybrid architectures, where the model chooses how much of each of multiple module outputs to use in a final weighted summation. In gating, For two modules with outputs $o_1,o_2\in\mathbb{R}^{d_v}$ , a gated combination $o \;=\; \alpha\, o_1 + (1-\alpha)\, o_2,\qquad \alpha\in[0,1],$ is differentiable and expressive. Routing works similarly, by computing probabilities $p_1,p_2\ge 0$ with $p_1+p_2=1$ to select one output. The Gumbel-softmax ensures that both modules are differentiable.

Note that gating provides greater expressivity: The 0.5B Falcon H-1 concatenates a traditional attention block and the state space machine Mamba to rival 7B models on benchmarks. However, this gating comes at the cost of increased compute. Routing saves compute, but there is no free lunch either, as we still need to store both modules in memory, and we can suffer lack of gradient signal (these are both issues that plague Mixture of Experts routers).

Gating and routing also commonly show up in sparse attention to decide which tokens the model attends to. Reformer: The Efficient Transformer uses locally sensitive hashing (LSH), which hashes each query and key into a bucket, and only compares tokens in the same bucket for attention. Reformer uses random hyperplane LSH, where random vectors $r_1, \dots, r_m$ are independently selected. Then, for any $x \in \mathbb{R}^d,$

h(x) = (\text{sign}(r_1^{\top}x), \text{sign}(r_2^{\top}x),\dots, \text{sign}(r_m^{\top}x) ) \in \{-1,1\}^m.

Then, $\Pr(h(x) = h(y)) = 1 - \tfrac{\theta(x,y)}{\pi}$ where $\theta(x,y)$ is the angle between the vectors $x$ and $y.$

Another example is DeepSeek's hardware-aligned NSA (Native Sparse Attention), which groups keys and values into blocks (say, of size $64$ ) and assigns each block a differentiable centroid, e.g. the mean of its constituents. Then, it scores the dot product between the query and each block's representative key, and selects the blocks with the highest similarities. Additionally, a local sliding window of $r$ blocks is selected. Finally, NSA optionally gates the previous attention output with the attention output on the compressed stream (using the representative keys and values for each block).

It's worth mentioning hybrid architectures that combine an autoregressive model with a masked diffusion model (MDM), which generate a number of masked tokens that the transformer denoises into text over multiple steps. MDMs are highly parallelizable, providing large upsides to transformers, but they can only generate a fixed number of tokens. Thus, since attention is bidirectional, it's impossible for MDMs to keep a KV cache, and keys and values are recomputed at each denoising step. One hybrid solution is Block Diffusion, which is autoregressive across blocks of tokens, but tokens within each block are generated using discrete diffusion models. Moreover, Esoteric Language Models (Sahoo, Yang et. al, 2025) use a denoising permutation $\sigma$ that decides the order of token unmasking. $\sigma$ enables KV caching by establishing order, and an autoregressive model completes inference on the remaining masked tokens after a number of denoising time-steps. This leads to great efficiency gains at a slight perplexity cost.

Future Directions

Back to the Regression

Inspired by approaches mentioned in the Persistent Memory section (a la Titans), concatenating the regression with auxiliary keys and values yields

\mathcal{L}(M_{n}) = \min \left( \sum _{i=1} ^{n} \alpha_i \lVert M_{n}k_{i} - v_{i} \rVert^2 + \sum _{i=1} ^N \beta_i \lVert M_{n}\hat{k}_{i} - \hat{v}_{i} \rVert^2 \right),

where the $N$ additional keys and values are $(\hat{k_i})$ and $(\hat{v_i})$ . This facilitates context-length extensions, since information from previous context windows can be compressed into the auxiliary values before $M$ is reset, and enables the retrieval of specific information learned during training (a la Memory Layers at Scale). Solely optimizing the regression, which corresponds to retrieval, isn't optimal, but similar to varying temperature balances mixing and retrieval, the additional keys and values can serve can enable both retrieval and mixing. Moreover, this framing relates to a "semi-parametric" memory across long contexts: new, important keys and values can be permanently inserted into the regression objective as valuable, persistent data-points, even past the original context length and the current matrix, keys, and values are reset.

Another area for improvement is the selection of norm. Traditionally, the L2 norm is preferred as $\lVert x\rVert = x^{\top} x,$ and the dot product can exploit fused multiply-add paths on GPUs that the L1 norm cannot. However, the efficiency differences between the L2 and L1 norm is not major, and alternative losses have been proposed, e.g. the Huber loss (Behrouz, 2025), which is given by

\mathcal{H}(a) = \begin{cases} \tfrac{1}{2} a^2, & \text{if } |a| \leq \delta, \\[6pt] \delta \left( |a| - \tfrac{1}{2}\delta \right), & \text{if } |a| > \delta. \end{cases}

More hardware-aligned characterizations for losses serve as a key area for improvement in linear attention design.

Utilizing External Memory

GPU VRAM comes at a premium, so using inexpensive CPU RAM and SSD storage seems ideal for knowledge-intensive tasks or storing large models. Unfortunately, on-device GPU communication has a bandwidth of 3 TB/s, whereas the CPU's PCle/NVLink bandwidth is ~64 GB/s. So, CPU retrieval is typically extremely sparse, and such algorithms typically consist of $k$ -nearest neighbor search or maximum inner product search (for smaller memories) or approximate $k$ -nearest neighbor lookup (for larger queries); architectures have used these schema to store keys and values for long documents (Memorizing Transformers) or upcoming chunks of similar documents (RETRO). However, there has been little innovation on keep parts of the model itself on CPU, which would revolutionize hyper-sparse models such as Mixture of Experts.

FloE: On-the-Fly MoE Inference on Memory-constrained GPU (Zhou, Li, 2025) notes that many parameters in experts are low-impact or can be quantized to very low bit widths without harming inference quality; this enables the CPU RAM to hold all compressed experts (up to hundreds of GB), and FloE prefetches from CPU RAM to GPU through PCle or NVLink by speculatively guessing which MoE experts the model will use later, thus providing time for experts to be loaded onto GPU. Unfortunately, FloE and similar inference systems cannot be extended to pre-training, as batching and parallelization would cause cross-device communication to bottleneck performance, and effective cross-device pre-training remains an important open question.

Data-Architecture Co-design

Unfortunately, high-quality long-context data is scarce. Hierarchical strategies maximize existing data by concatenating shorter chunks of data, and using curriculum learning schedules, which gradually increase difficulty over time similar to how YaRN expands context limits. Synthetic approaches such as He et. al, 2025, whom splits long books into variable-length chunks and generates QA pairs to fine-tune on and gradually increases the number of concatenated chunks up to 1M tokens, have also seen promise.

Conclusion

Efficient long-context reasoning is one of the biggest bottlenecks towards generally intelligent AI. If we trace back the last decade of machine learning, we have evidence for The Bitter Lesson, which posits that in the limit, models that effectively scale with increased computation will win. Self-attention's genius lies in its remarkable parallelization and design with specialized GPU hardware in mind. This enabled actualized performance gains with increased compute and larger model sizes, which were not replicated in RNNs and LSTMs. The Chinchilla scaling laws hammers the point home: despite architectural differences, model perplexity boils down to dataset size and compute (which governs parameter count), and past a certain point, compute is much easier to reliably continue to source than high-quality data. The recent innovation in chain-of-thought models again reiterates this philosophy: it enables the continued trading of compute for performance gains by using generated reasoning tokens as a "scratchpad", in a way that universal transformers of the past couldn't. It's likely that this trend will continue, and optimized architectures for ever-longer chain-of-thought traces will be critical for building fully autonomous ML engineers or researchers.

This is also my first go at technical writing, so I'd love to discuss any thoughts or feedback, and apologizes for any rough patches :)