Core idea
A strict artifact cap changes the architecture question from
“What model is strongest for a given parameter count?”
to
“What model is strongest for a given number of stored unique weights?”
Recursive layer sharing is one of the cleanest responses to that shift.
What the literature suggests
- Relaxed Recursive Transformers shows that pretrained transformers can be compressed into recursive forms and partly recovered with lightweight per-layer LoRA. (Bae et al., 2024)
- MoEUT shows that recurrence in depth can become much more competitive once capacity is reallocated intelligently rather than uniformly. (Csordás et al., 2024)
- Fine-grained Parameter Sharing argues that sharing should not be treated as all-or-nothing; more granular sharing patterns can preserve more expressivity. (Üyük et al., 2024)
The compact-LLM interpretation
Recursive sharing is best understood as a storage conversion:
- unique layers become repeated computation
- stored depth becomes reusable transformation capacity
- the saved bytes can be reinvested into width, cheap specialization, or selective precision
That is why this note sits near both recursive width scaling and compute-for-storage exchange.
Why this matters for Parameter Golf
If one shared block can play the role of many unique blocks, the saved bytes can be spent on:
- extra width
- more careful quantization or protected precision
- lightweight per-depth adaptation
- perhaps even evaluation-time refinement rather than stored depth
That is the logic behind recurrent wide architecture.
Typical failure modes
Recursive sharing usually breaks for familiar reasons:
- the shared block is forced to serve too many incompatible functions
- early and late depth steps need different behaviors
- training becomes unstable because the same weights are repeatedly reused
- wider recurrent designs can worsen activation and outlier problems if the compression path is not improved too
This is why shared depth often pairs naturally with:
Better framing
Recursive sharing is not just “smaller depth.” It is a compute-for-storage exchange. The model can still execute many transformations; it just stops storing a unique set of weights for every one of them.
Related
- Recursive and shared-parameter architectures
- Recursive width scaling
- Phase-conditioned sharing
- Compute-for-storage exchange