Core idea

A strict artifact cap changes the architecture question from

“What model is strongest for a given parameter count?”

to

“What model is strongest for a given number of stored unique weights?”

Recursive layer sharing is one of the cleanest responses to that shift.

What the literature suggests

The compact-LLM interpretation

Recursive sharing is best understood as a storage conversion:

  • unique layers become repeated computation
  • stored depth becomes reusable transformation capacity
  • the saved bytes can be reinvested into width, cheap specialization, or selective precision

That is why this note sits near both recursive width scaling and compute-for-storage exchange.

Why this matters for Parameter Golf

If one shared block can play the role of many unique blocks, the saved bytes can be spent on:

  • extra width
  • more careful quantization or protected precision
  • lightweight per-depth adaptation
  • perhaps even evaluation-time refinement rather than stored depth

That is the logic behind recurrent wide architecture.

Typical failure modes

Recursive sharing usually breaks for familiar reasons:

  • the shared block is forced to serve too many incompatible functions
  • early and late depth steps need different behaviors
  • training becomes unstable because the same weights are repeatedly reused
  • wider recurrent designs can worsen activation and outlier problems if the compression path is not improved too

This is why shared depth often pairs naturally with:

Better framing

Recursive sharing is not just “smaller depth.” It is a compute-for-storage exchange. The model can still execute many transformations; it just stops storing a unique set of weights for every one of them.

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672
Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039
Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816