Recursive Layer Sharing

Core idea

A strict artifact cap changes the architecture question from

“What model is strongest for a given parameter count?”

“What model is strongest for a given number of stored unique weights?”

Recursive layer sharing is one of the cleanest responses to that shift.

What the literature suggests

Relaxed Recursive Transformers shows that pretrained transformers can be compressed into recursive forms and partly recovered with lightweight per-layer LoRA. (Bae et al., 2024)
MoEUT shows that recurrence in depth can become much more competitive once capacity is reallocated intelligently rather than uniformly. (Csordás et al., 2024)
Fine-grained Parameter Sharing argues that sharing should not be treated as all-or-nothing; more granular sharing patterns can preserve more expressivity. (Üyük et al., 2024)

The compact-LLM interpretation

Recursive sharing is best understood as a storage conversion:

unique layers become repeated computation
stored depth becomes reusable transformation capacity
the saved bytes can be reinvested into width, cheap specialization, or selective precision

That is why this note sits near both recursive width scaling and compute-for-storage exchange.

Why this matters for Parameter Golf

If one shared block can play the role of many unique blocks, the saved bytes can be spent on:

extra width
more careful quantization or protected precision
lightweight per-depth adaptation
perhaps even evaluation-time refinement rather than stored depth

That is the logic behind recurrent wide architecture.

Typical failure modes

Recursive sharing usually breaks for familiar reasons:

the shared block is forced to serve too many incompatible functions
early and late depth steps need different behaviors
training becomes unstable because the same weights are repeatedly reused
wider recurrent designs can worsen activation and outlier problems if the compression path is not improved too

This is why shared depth often pairs naturally with:

Better framing

Recursive sharing is not just “smaller depth.” It is a compute-for-storage exchange. The model can still execute many transformations; it just stops storing a unique set of weights for every one of them.

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672

Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039

Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816

Parameter Golf Research Garden

Section Tree

Recursive Layer Sharing

Core idea

What the literature suggests

The compact-LLM interpretation

Why this matters for Parameter Golf

Typical failure modes

Better framing

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Recursive Layer Sharing

Core idea

What the literature suggests

The compact-LLM interpretation

Why this matters for Parameter Golf

Typical failure modes

Better framing

Related

Graph View

Table of Contents

Referenced by

Recent notes