Sources: arXiv:2501.00663 · alphaXiv overview
Core contribution
Titans proposes a family of sequence models that combine ordinary short-term attention with a learned long-term memory module that updates at test time. The central claim is not just that more inference compute helps, but that some of that compute can be spent on writing memory, not only on reranking or extra forward passes.
Why this matters for Parameter Golf
This paper sharpens evaluation-time compute in a way that the existing shelf did not cover well. The interesting idea is that a hard artifact cap may be partly offset by behavioral memory formation at evaluation time: instead of storing more capacity, the model can build temporary task-specific state while reading the sequence.
What to import
- Evaluation-time compute can update memory, not just search over outputs.
- A compact core plus a learned memory interface may be a cleaner compute-for-storage trade than only widening the static trunk.
- Persistent memory and temporary memory should be thought of separately.
What not to over-import
Titans is a broad long-context architecture paper, not a challenge-ready recipe for a tiny artifact-constrained LM. Its memory module is more ambitious than what a tight runtime and code budget may tolerate. The durable import is the framing: test-time adaptation can be memory formation, not only decoding strategy.
Best synthesis links
- Extends Inference Scaling Laws from “better test-time allocation” to “learned test-time memory.”
- Gives sharper motivation to refinement loops as decompression by making the extra compute update hidden state rather than only output selection.
- Pairs naturally with iterative refinement and recurrent wide architecture.
Parameter Golf translation
Titans suggests asking whether bounded evaluation-time passes should:
- revise token predictions,
- update a temporary memory state,
- or do both.
For this challenge, the valuable question is not whether Titans as written fits, but whether a much smaller memory-writing mechanism could buy more than another round of static parameter storage.
Related
- Inference Scaling Laws
- Evaluation-time compute and inference scaling
- Refinement loops as decompression
- Iterative refinement
- Recurrent wide architecture