This lane asks whether the best way to spend the artifact budget is to store fewer unique blocks and reuse them across depth, phases, or refinement steps.

Core question

If stored bytes are the real bottleneck, should a compact LLM prefer:

  • many unique thin layers, or
  • a small number of stronger shared layers plus repeated application and cheap conditioning?

Parameter Golf makes that trade unusually central.

Why it matters

Depth is expensive when every layer is unique. Sharing turns depth into a partly recomputed resource.

That can free bytes for:

  • extra width in the shared block
  • small per-step scales, gates, or adapters
  • selective higher precision for fragile subsets
  • more deliberate use of evaluation-time compute

This is the main logic behind recursive width scaling and recurrent wide architecture.

Central papers

Important sub-mechanisms

1. Strict recurrence

Reuse one block many times and treat depth as repeated computation.

2. Relaxed sharing

Share most weights, but allow tiny per-depth adjustments such as scales, gates, or LoRA-like adapters.

3. Selective unsharing

Keep only the most role-specific tensors unique while sharing the rest.

4. Shared depth plus width

Spend saved bytes on a stronger recurrent block instead of on more unique stages.

Why shared depth often pairs with other lanes

Shared blocks become more plausible when combined with:

Main risks

  • shared blocks can lose layer specialization
  • repeated reuse may amplify optimization pathologies
  • attention and MLP sublayers may want different sharing granularity
  • wider recurrent blocks can create harder outlier problems unless the compression path improves too

Most useful next hypotheses

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672
Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039
Liao, B., Herold, C., Hashemi, S. H., Vasilev, S., Khadivi, S., & Monz, C. (2025). ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning. arXiv Preprint arXiv:2503.13089. https://arxiv.org/abs/2503.13089
Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816