Relaxed Recursive Transformers

(Bae et al., 2024)

Sources: arXiv:2410.20672 · alphaXiv overview

Core contribution

Relaxed Recursive Transformers shows that a pretrained transformer can be compressed into a recursive/shared-depth model and then partially repaired with small layer-specific LoRA modules. The central idea is that strict sharing is often too rigid, but almost shared can be a strong byte-quality compromise.

Why this matters for Parameter Golf

This is probably the strongest direct support for recursive width scaling. It says the relevant choice is not “unique depth or collapse,” but “how much specialization is worth paying for once a strong shared backbone exists?” That is exactly the kind of trade a hard artifact cap forces.

What to import

One strong shared block can carry a surprising amount of depth.
Tiny depth-specific adjustments can recover a lot of lost specialization.
Good initialization matters. Shared-depth models are easier to love in theory than to optimize in practice.

What not to over-import

The paper starts from pretrained transformers, which is not identical to this repo’s search setting. It also does not prove that LoRA-style relaxation is the best use of bytes in every artifact-capped scenario. The main durable lesson is the shape of the optimum: strict tying is often too harsh, but tiny deviations can be enough.

Best synthesis links

Grounds recursive layer sharing and recurrent wide architecture.
Sits naturally after Universal Transformers and ALBERT, which motivate recurrence and sharing more conceptually.
Provides a cleaner byte-specialization compromise than naive depth duplication.

Parameter Golf translation

This paper argues for spending artifact bytes on:

one strong shared block
small per-step scales, adapters, or low-rank corrections

rather than on many fully unique blocks that each contribute only modest marginal value.

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672

Parameter Golf Research Garden

Section Tree

Relaxed Recursive Transformers

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Relaxed Recursive Transformers

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes