Sources: arXiv:2405.21060 · alphaXiv overview
Core contribution
The paper introduces a structured state space duality view that connects linear attention, structured state space models, and transformer-style sequence transformations. It matters less as a single model pitch than as a bridge paper: it says the design space between transformers and recurrent/state-space models is more continuous than the field often treats it.
Why this matters for Parameter Golf
Parameter Golf makes alternative sequence mixers attractive because long-context behavior and repeated computation matter more once bytes are hard-capped. This paper is therefore useful as a conceptual permit: exploring state-space or recurrent alternatives is not a jump into an alien design family, but a route that still inherits some transformer logic and tooling.
What to import
- State-space and attention models are closer than they look.
- Linear-time sequence models can be understood through structured transformations, not only opaque recurrence.
- Architectural ideas can move between families once the duality is explicit.
What not to over-import
The paper does not prove that SSMs are automatically the right answer under a hard artifact cap. It also does not settle the byte accounting of the surrounding machinery, training dynamics, or compression robustness. The transferable lesson is the bridge, not a blanket claim that Mamba-style models dominate compact transformers.
Best synthesis links
- Connects Universal Transformers and Relaxed Recursive Transformers to newer state-space work.
- Gives context for Mamba-PTQ, because once SSMs enter the candidate set, their compression behavior becomes a first-class question.
- Strengthens evaluation-time compute by widening the set of architectures that can spend compute linearly with context length.
Parameter Golf translation
This paper suggests a broader architectural search space:
- compact transformers with smarter recurrence,
- SSM-like cores with transformer-inspired parameterization,
- hybrids where linear-time sequence modeling is used to buy runtime headroom that can be spent elsewhere.
Related
- Mamba-PTQ
- Universal Transformers
- Relaxed Recursive Transformers
- Recursive and shared-parameter architectures
- Evaluation-time compute and inference scaling