Transformers are SSMs

(Dao & Gu, 2024)

Sources: arXiv:2405.21060 · alphaXiv overview

Core contribution

The paper introduces a structured state space duality view that connects linear attention, structured state space models, and transformer-style sequence transformations. It matters less as a single model pitch than as a bridge paper: it says the design space between transformers and recurrent/state-space models is more continuous than the field often treats it.

Why this matters for Parameter Golf

Parameter Golf makes alternative sequence mixers attractive because long-context behavior and repeated computation matter more once bytes are hard-capped. This paper is therefore useful as a conceptual permit: exploring state-space or recurrent alternatives is not a jump into an alien design family, but a route that still inherits some transformer logic and tooling.

What to import

State-space and attention models are closer than they look.
Linear-time sequence models can be understood through structured transformations, not only opaque recurrence.
Architectural ideas can move between families once the duality is explicit.

What not to over-import

The paper does not prove that SSMs are automatically the right answer under a hard artifact cap. It also does not settle the byte accounting of the surrounding machinery, training dynamics, or compression robustness. The transferable lesson is the bridge, not a blanket claim that Mamba-style models dominate compact transformers.

Best synthesis links

Connects Universal Transformers and Relaxed Recursive Transformers to newer state-space work.
Gives context for Mamba-PTQ, because once SSMs enter the candidate set, their compression behavior becomes a first-class question.
Strengthens evaluation-time compute by widening the set of architectures that can spend compute linearly with context length.

Parameter Golf translation

This paper suggests a broader architectural search space:

compact transformers with smarter recurrence,
SSM-like cores with transformer-inspired parameterization,
hybrids where linear-time sequence modeling is used to buy runtime headroom that can be spent elsewhere.

Dao, T., & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. arXiv Preprint arXiv:2405.21060. https://arxiv.org/abs/2405.21060

Parameter Golf Research Garden

Section Tree

Transformers are SSMs

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans