Hypothesis
A single wide recurrent block, or a very small recurrent stack, may use the artifact budget more effectively than many thinner unique layers.
The basic bet is:
- share depth aggressively
- reinvest saved bytes into width
- stabilize the shared block with pre-projection normalization
- recover lost specialization with tiny conditioning or adapters
- use decoupled precision or outlier preservation where the cheap path fails
Why this is plausible
This is a concrete descendant of recursive width scaling.
The literature supports each part of the stack:
- Relaxed Recursive Transformers suggests deep sharing can work if recurrence is not treated too rigidly. (Bae et al., 2024)
- MoEUT suggests recurrent/shared computation becomes stronger when capacity is allocated more intelligently. (Csordás et al., 2024)
- Extra RMSNorm suggests low-bit stability often depends on how activations are normalized before sensitive projections. (Steinmetz et al., 2025)
- pQuant suggests uniform low-bit treatment wastes quality on the most sensitive parameters. (Zhang et al., 2026)
Concrete design sketch
One family of designs looks like:
- a wide shared transformer block rather than many thinner unique blocks
- repeated application across depth steps
- optional phase-conditioned sharing so each recurrence step is not forced to behave identically
- selective higher-precision protection for the subset that proves most fragile
This should be viewed as a compute-for-storage exchange rather than just a smaller network.
What has to be true for it to win
- the wider shared block must outperform the narrower unique-depth baseline after compression
- repeated reuse must not destabilize training too badly
- extra width must survive roundtrip export rather than only helping pre-quant metrics
- any saved bytes must be spent where they matter most
- cheap specialization must recover enough phase-specific behavior when strict sharing is too rigid
Main risks
- the shared block may be forced to do too many incompatible jobs
- repeated reuse may amplify optimization instability
- wider activations may worsen outlier behavior unless normalization and precision protection are good enough
- early wins may disappear if the architecture is too fragile at longer horizons
Related
- Recursive width scaling
- Phase-conditioned sharing
- Iterative refinement over stored depth
- Recursive layer sharing
- Shared depth needs cheap specialization
- Outlier-aware compression
Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672
Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039
Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823
Zhang, W., Liu, B., Hu, Y., Bai, X., Zhang, W., & Cui, B. (2026). pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training. arXiv Preprint arXiv:2602.22592. https://arxiv.org/abs/2602.22592