(Bae et al., 2024)

Sources: arXiv:2410.20672 · alphaXiv overview

Core contribution

Relaxed Recursive Transformers shows that a pretrained transformer can be compressed into a recursive/shared-depth model and then partially repaired with small layer-specific LoRA modules. The central idea is that strict sharing is often too rigid, but almost shared can be a strong byte-quality compromise.

Why this matters for Parameter Golf

This is probably the strongest direct support for recursive width scaling. It says the relevant choice is not “unique depth or collapse,” but “how much specialization is worth paying for once a strong shared backbone exists?” That is exactly the kind of trade a hard artifact cap forces.

What to import

  • One strong shared block can carry a surprising amount of depth.
  • Tiny depth-specific adjustments can recover a lot of lost specialization.
  • Good initialization matters. Shared-depth models are easier to love in theory than to optimize in practice.

What not to over-import

The paper starts from pretrained transformers, which is not identical to this repo’s search setting. It also does not prove that LoRA-style relaxation is the best use of bytes in every artifact-capped scenario. The main durable lesson is the shape of the optimum: strict tying is often too harsh, but tiny deviations can be enough.

Parameter Golf translation

This paper argues for spending artifact bytes on:

  • one strong shared block
  • small per-step scales, adapters, or low-rank corrections

rather than on many fully unique blocks that each contribute only modest marginal value.

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672