This lane asks whether the best way to spend the artifact budget is to store fewer unique blocks and reuse them across depth, phases, or refinement steps.
Core question
If stored bytes are the real bottleneck, should a compact LLM prefer:
- many unique thin layers, or
- a small number of stronger shared layers plus repeated application and cheap conditioning?
Parameter Golf makes that trade unusually central.
Why it matters
Depth is expensive when every layer is unique. Sharing turns depth into a partly recomputed resource.
That can free bytes for:
- extra width in the shared block
- small per-step scales, gates, or adapters
- selective higher precision for fragile subsets
- more deliberate use of evaluation-time compute
This is the main logic behind recursive width scaling and recurrent wide architecture.
Central papers
- Relaxed Recursive Transformers (Bae et al., 2024)
- MoEUT (Csordás et al., 2024)
- Fine-grained Parameter Sharing (Üyük et al., 2024)
- ClusComp (Liao et al., 2025)
Important sub-mechanisms
1. Strict recurrence
Reuse one block many times and treat depth as repeated computation.
2. Relaxed sharing
Share most weights, but allow tiny per-depth adjustments such as scales, gates, or LoRA-like adapters.
3. Selective unsharing
Keep only the most role-specific tensors unique while sharing the rest.
4. Shared depth plus width
Spend saved bytes on a stronger recurrent block instead of on more unique stages.
Why shared depth often pairs with other lanes
Shared blocks become more plausible when combined with:
- pre-projection normalization to stabilize repeated reuse
- sparse outlier preservation so the most fragile weights are not forced through the cheapest path
- iterative refinement over stored depth when the same block can serve both depth and test-time refinement roles
Main risks
- shared blocks can lose layer specialization
- repeated reuse may amplify optimization pathologies
- attention and MLP sublayers may want different sharing granularity
- wider recurrent blocks can create harder outlier problems unless the compression path improves too
Most useful next hypotheses
- Recursive width scaling
- Recurrent wide architecture
- Phase-conditioned sharing
- interaction with RMSNorm stabilized scaling
- interaction with sparse outlier preservation