Hypothesis

A single wide recurrent block, or a very small recurrent stack, may use the artifact budget more effectively than many thinner unique layers.

The basic bet is:

Why this is plausible

This is a concrete descendant of recursive width scaling.

The literature supports each part of the stack:

Concrete design sketch

One family of designs looks like:

  • a wide shared transformer block rather than many thinner unique blocks
  • repeated application across depth steps
  • optional phase-conditioned sharing so each recurrence step is not forced to behave identically
  • selective higher-precision protection for the subset that proves most fragile

This should be viewed as a compute-for-storage exchange rather than just a smaller network.

What has to be true for it to win

  1. the wider shared block must outperform the narrower unique-depth baseline after compression
  2. repeated reuse must not destabilize training too badly
  3. extra width must survive roundtrip export rather than only helping pre-quant metrics
  4. any saved bytes must be spent where they matter most
  5. cheap specialization must recover enough phase-specific behavior when strict sharing is too rigid

Main risks

  • the shared block may be forced to do too many incompatible jobs
  • repeated reuse may amplify optimization instability
  • wider activations may worsen outlier behavior unless normalization and precision protection are good enough
  • early wins may disappear if the architecture is too fragile at longer horizons
Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672
Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039
Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823
Zhang, W., Liu, B., Hu, Y., Bai, X., Zhang, W., & Cui, B. (2026). pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training. arXiv Preprint arXiv:2602.22592. https://arxiv.org/abs/2602.22592