Hypothesis
Replacing many unique layers with a smaller recurrent/shared core should free enough artifact budget to increase width, add cheap specialization, or improve protected precision, leading to a better size-quality tradeoff. (Bae et al., 2024; Csordás et al., 2024; Üyük et al., 2024)
Why this is plausible
- Relaxed Recursive Transformers shows strong pretrained transformers can be converted into recursive ones with modest losses.
- MoEUT suggests recurrence in depth can be competitive when paired with better capacity allocation.
- Fine-grained Parameter Sharing suggests sharing should not be all-or-nothing; structured partial sharing can preserve expressivity.
The core exchange
This is not just “make the network smaller.” It is a specific exchange:
- store fewer unique blocks
- reuse them more times
- spend saved bytes on whichever margin matters most
Possible places to reinvest the savings:
- more width in the shared block
- phase-conditioned sharing
- selective precision or sparse outlier protection
- larger or better-conditioned attention/MLP sublayers
Variants worth distinguishing
- strict full block sharing
- sharing with per-layer scales or LoRA-like relaxation
- MLP-heavy sharing before attention-heavy sharing
- few-block recurrence rather than single-block recurrence
What must be true for it to win
- the saved bytes must be reinvested into something more valuable than unique depth
- repeated reuse must not collapse specialization
- post-compression quality must improve, not only pre-export behavior
- the compute overhead of reuse must remain worthwhile
Concrete descendants
- Recurrent wide architecture
- Phase-conditioned sharing
- possible pairing with iterative refinement over stored depth
Related
- Recursive and shared-parameter architectures
- Recursive layer sharing
- Compute-for-storage exchange
- Shared depth needs cheap specialization
- Hypothesis ledger
Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672
Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039
Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816