(Csordás et al., 2024)

Sources: arXiv:2405.16039 · alphaXiv overview

Core contribution

MoEUT combines the Universal Transformer idea of recurrent/shared depth with sparse mixture-of-experts capacity. Its central message is that shared-depth models do not have to choose between byte efficiency and expressivity; sparse conditional capacity can restore much of what naive recurrence would otherwise lose.

Why this matters for Parameter Golf

This paper is valuable because it decomposes a promising but fragile idea into separable parts:

  • reuse weights aggressively to save bytes
  • spend compute conditionally where extra capacity matters

That is almost the ideal shape of a Parameter Golf strategy. It aligns naturally with the notion that stored parameters and per-token computation should be traded, not optimized independently.

What to import

  • Recurrence in depth is stronger when capacity is conditional rather than uniform.
  • Normalization and grouping details matter a lot in shared-depth models.
  • Sparse capacity is a better recovery tool than simply making every recurrent step heavier.

What not to over-import

Mixture-of-experts machinery can add routing complexity and may be difficult to justify under a highly constrained evaluation harness. The main import is the design principle: if sharing hurts specialization, conditional extra capacity is one of the cleanest ways to buy some of it back.

  • Extends Universal Transformers from “recurrent depth is possible” to “recurrent depth can be competitive with better capacity allocation.”
  • Complements Relaxed Recursive Transformers, which recovers specialization through lightweight per-layer adaptation instead of sparse experts.
  • Strengthens recurrent wide architecture by legitimizing a compute-for-storage trade with selective extra capacity.

Parameter Golf translation

MoEUT suggests that saved storage from recursive sharing does not have to be spent only on width. Some of it can be spent on better allocation of compute, whether through sparse branches, adaptive refinement, or other conditional mechanisms that let a compact backbone act less uniformly.

Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039