Sources: arXiv:2410.20672 · alphaXiv overview
Core contribution
Relaxed Recursive Transformers shows that a pretrained transformer can be compressed into a recursive/shared-depth model and then partially repaired with small layer-specific LoRA modules. The central idea is that strict sharing is often too rigid, but almost shared can be a strong byte-quality compromise.
Why this matters for Parameter Golf
This is probably the strongest direct support for recursive width scaling. It says the relevant choice is not “unique depth or collapse,” but “how much specialization is worth paying for once a strong shared backbone exists?” That is exactly the kind of trade a hard artifact cap forces.
What to import
- One strong shared block can carry a surprising amount of depth.
- Tiny depth-specific adjustments can recover a lot of lost specialization.
- Good initialization matters. Shared-depth models are easier to love in theory than to optimize in practice.
What not to over-import
The paper starts from pretrained transformers, which is not identical to this repo’s search setting. It also does not prove that LoRA-style relaxation is the best use of bytes in every artifact-capped scenario. The main durable lesson is the shape of the optimum: strict tying is often too harsh, but tiny deviations can be enough.
Best synthesis links
- Grounds recursive layer sharing and recurrent wide architecture.
- Sits naturally after Universal Transformers and ALBERT, which motivate recurrence and sharing more conceptually.
- Provides a cleaner byte-specialization compromise than naive depth duplication.
Parameter Golf translation
This paper argues for spending artifact bytes on:
- one strong shared block
- small per-step scales, adapters, or low-rank corrections
rather than on many fully unique blocks that each contribute only modest marginal value.
Related
- Recursive width scaling
- Recurrent wide architecture
- Universal Transformers
- MoEUT
- Fine-grained Parameter Sharing
- Recursive and shared-parameter architectures
- Recursive layer sharing