(Dao & Gu, 2024)

Sources: arXiv:2405.21060 · alphaXiv overview

Core contribution

The paper introduces a structured state space duality view that connects linear attention, structured state space models, and transformer-style sequence transformations. It matters less as a single model pitch than as a bridge paper: it says the design space between transformers and recurrent/state-space models is more continuous than the field often treats it.

Why this matters for Parameter Golf

Parameter Golf makes alternative sequence mixers attractive because long-context behavior and repeated computation matter more once bytes are hard-capped. This paper is therefore useful as a conceptual permit: exploring state-space or recurrent alternatives is not a jump into an alien design family, but a route that still inherits some transformer logic and tooling.

What to import

  • State-space and attention models are closer than they look.
  • Linear-time sequence models can be understood through structured transformations, not only opaque recurrence.
  • Architectural ideas can move between families once the duality is explicit.

What not to over-import

The paper does not prove that SSMs are automatically the right answer under a hard artifact cap. It also does not settle the byte accounting of the surrounding machinery, training dynamics, or compression robustness. The transferable lesson is the bridge, not a blanket claim that Mamba-style models dominate compact transformers.

Parameter Golf translation

This paper suggests a broader architectural search space:

  • compact transformers with smarter recurrence,
  • SSM-like cores with transformer-inspired parameterization,
  • hybrids where linear-time sequence modeling is used to buy runtime headroom that can be spent elsewhere.
Dao, T., & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. arXiv Preprint arXiv:2405.21060. https://arxiv.org/abs/2405.21060