Universal Transformers

(Dehghani et al., 2019)

Sources: arXiv:1807.03819 · alphaXiv overview

Core contribution

Universal Transformers reintroduce recurrence into transformer depth. Instead of treating each layer as a wholly distinct learned transformation, the model repeatedly applies a shared self-attentive transition across steps, optionally with dynamic per-position halting. The paper’s motivation is both empirical and conceptual: recurrence can improve generalization while preserving much of the transformer’s parallelism across sequence positions.

Why this matters for Parameter Golf

This is one of the foundational papers behind the idea that stored parameters and executed computation are exchangeable resources. That is exactly the challenge geometry of Parameter Golf. If a shared block can be applied many times, depth becomes something the model can do without fully paying for it in persistent bytes.

What to import

Depth can be recurrent rather than fully stored.
Dynamic compute allocation is compatible with transformer-style modeling.
Repeated computation can improve generalization and capacity without multiplying unique weights.

What not to over-import

The paper is older and not tuned for today’s decoder-only, extreme-compression setting. Dynamic halting may also be too complex or too difficult to justify in some challenge submissions. The core import is the compute-for-storage principle, not every mechanism detail.

Best synthesis links

Acts as a conceptual precursor to MoEUT and Relaxed Recursive Transformers.
Strengthens inference-time compute because repeated shared steps naturally invite adaptive evaluation-time logic.
Provides architectural background for recurrent wide architecture.

Parameter Golf translation

Universal Transformers encourage a simple but powerful reframing:

the artifact budget limits how many unique transformations we can store, not how many transformations we can execute

That framing supports aggressive depth sharing, lightweight depth-specific relaxation, and modest test-time adaptation strategies.

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, Ł. (2019). Universal Transformers. arXiv Preprint arXiv:1807.03819. https://arxiv.org/abs/1807.03819

Parameter Golf Research Garden

Section Tree

Universal Transformers

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Universal Transformers

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes