(Dehghani et al., 2019)

Sources: arXiv:1807.03819 · alphaXiv overview

Core contribution

Universal Transformers reintroduce recurrence into transformer depth. Instead of treating each layer as a wholly distinct learned transformation, the model repeatedly applies a shared self-attentive transition across steps, optionally with dynamic per-position halting. The paper’s motivation is both empirical and conceptual: recurrence can improve generalization while preserving much of the transformer’s parallelism across sequence positions.

Why this matters for Parameter Golf

This is one of the foundational papers behind the idea that stored parameters and executed computation are exchangeable resources. That is exactly the challenge geometry of Parameter Golf. If a shared block can be applied many times, depth becomes something the model can do without fully paying for it in persistent bytes.

What to import

  • Depth can be recurrent rather than fully stored.
  • Dynamic compute allocation is compatible with transformer-style modeling.
  • Repeated computation can improve generalization and capacity without multiplying unique weights.

What not to over-import

The paper is older and not tuned for today’s decoder-only, extreme-compression setting. Dynamic halting may also be too complex or too difficult to justify in some challenge submissions. The core import is the compute-for-storage principle, not every mechanism detail.

Parameter Golf translation

Universal Transformers encourage a simple but powerful reframing:

the artifact budget limits how many unique transformations we can store, not how many transformations we can execute

That framing supports aggressive depth sharing, lightweight depth-specific relaxation, and modest test-time adaptation strategies.

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, Ł. (2019). Universal Transformers. arXiv Preprint arXiv:1807.03819. https://arxiv.org/abs/1807.03819