Recursive Width Scaling

Hypothesis

Replacing many unique layers with a smaller recurrent/shared core should free enough artifact budget to increase width, add cheap specialization, or improve protected precision, leading to a better size-quality tradeoff. (Bae et al., 2024; Csordás et al., 2024; Üyük et al., 2024)

Why this is plausible

Relaxed Recursive Transformers shows strong pretrained transformers can be converted into recursive ones with modest losses.
MoEUT suggests recurrence in depth can be competitive when paired with better capacity allocation.
Fine-grained Parameter Sharing suggests sharing should not be all-or-nothing; structured partial sharing can preserve expressivity.

The core exchange

This is not just “make the network smaller.” It is a specific exchange:

store fewer unique blocks
reuse them more times
spend saved bytes on whichever margin matters most

Possible places to reinvest the savings:

more width in the shared block
phase-conditioned sharing
selective precision or sparse outlier protection
larger or better-conditioned attention/MLP sublayers

Variants worth distinguishing

strict full block sharing
sharing with per-layer scales or LoRA-like relaxation
MLP-heavy sharing before attention-heavy sharing
few-block recurrence rather than single-block recurrence

What must be true for it to win

the saved bytes must be reinvested into something more valuable than unique depth
repeated reuse must not collapse specialization
post-compression quality must improve, not only pre-export behavior
the compute overhead of reuse must remain worthwhile

Concrete descendants

Recurrent wide architecture
Phase-conditioned sharing
possible pairing with iterative refinement over stored depth

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672

Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039

Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816

Parameter Golf Research Garden

Section Tree

Recursive Width Scaling

Hypothesis

Why this is plausible

The core exchange

Variants worth distinguishing

What must be true for it to win

Concrete descendants

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Recursive Width Scaling

Hypothesis

Why this is plausible

The core exchange

Variants worth distinguishing

What must be true for it to win

Concrete descendants

Related

Graph View

Table of Contents

Referenced by

Recent notes