Hypothesis

Replacing many unique layers with a smaller recurrent/shared core should free enough artifact budget to increase width, add cheap specialization, or improve protected precision, leading to a better size-quality tradeoff. (Bae et al., 2024; Csordás et al., 2024; Üyük et al., 2024)

Why this is plausible

  • Relaxed Recursive Transformers shows strong pretrained transformers can be converted into recursive ones with modest losses.
  • MoEUT suggests recurrence in depth can be competitive when paired with better capacity allocation.
  • Fine-grained Parameter Sharing suggests sharing should not be all-or-nothing; structured partial sharing can preserve expressivity.

The core exchange

This is not just “make the network smaller.” It is a specific exchange:

  • store fewer unique blocks
  • reuse them more times
  • spend saved bytes on whichever margin matters most

Possible places to reinvest the savings:

Variants worth distinguishing

  • strict full block sharing
  • sharing with per-layer scales or LoRA-like relaxation
  • MLP-heavy sharing before attention-heavy sharing
  • few-block recurrence rather than single-block recurrence

What must be true for it to win

  1. the saved bytes must be reinvested into something more valuable than unique depth
  2. repeated reuse must not collapse specialization
  3. post-compression quality must improve, not only pre-export behavior
  4. the compute overhead of reuse must remain worthwhile

Concrete descendants

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672
Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039
Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816