Recursive and Shared-Parameter Architectures

This lane asks whether the best way to spend the artifact budget is to store fewer unique blocks and reuse them across depth, phases, or refinement steps.

Core question

If stored bytes are the real bottleneck, should a compact LLM prefer:

many unique thin layers, or
a small number of stronger shared layers plus repeated application and cheap conditioning?

Parameter Golf makes that trade unusually central.

Why it matters

Depth is expensive when every layer is unique. Sharing turns depth into a partly recomputed resource.

That can free bytes for:

extra width in the shared block
small per-step scales, gates, or adapters
selective higher precision for fragile subsets
more deliberate use of evaluation-time compute

This is the main logic behind recursive width scaling and recurrent wide architecture.

Central papers

Important sub-mechanisms

1. Strict recurrence

Reuse one block many times and treat depth as repeated computation.

Share most weights, but allow tiny per-depth adjustments such as scales, gates, or LoRA-like adapters.

3. Selective unsharing

Keep only the most role-specific tensors unique while sharing the rest.

4. Shared depth plus width

Spend saved bytes on a stronger recurrent block instead of on more unique stages.

Why shared depth often pairs with other lanes

Shared blocks become more plausible when combined with:

pre-projection normalization to stabilize repeated reuse
sparse outlier preservation so the most fragile weights are not forced through the cheapest path
iterative refinement over stored depth when the same block can serve both depth and test-time refinement roles

Main risks

shared blocks can lose layer specialization
repeated reuse may amplify optimization pathologies
attention and MLP sublayers may want different sharing granularity
wider recurrent blocks can create harder outlier problems unless the compression path improves too

Most useful next hypotheses

Recursive width scaling
Recurrent wide architecture
Phase-conditioned sharing
interaction with RMSNorm stabilized scaling
interaction with sparse outlier preservation

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672

Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039

Liao, B., Herold, C., Hashemi, S. H., Vasilev, S., Khadivi, S., & Monz, C. (2025). ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning. arXiv Preprint arXiv:2503.13089. https://arxiv.org/abs/2503.13089

Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816

Parameter Golf Research Garden

Section Tree

Recursive and Shared-Parameter Architectures

Core question

Why it matters

Central papers

Important sub-mechanisms

1. Strict recurrence

3. Selective unsharing

4. Shared depth plus width

Why shared depth often pairs with other lanes

Main risks

Most useful next hypotheses

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Recursive and Shared-Parameter Architectures

Core question

Why it matters

Central papers

Important sub-mechanisms

1. Strict recurrence

2. Relaxed sharing

3. Selective unsharing

4. Shared depth plus width

Why shared depth often pairs with other lanes

Main risks

Most useful next hypotheses

Related synthesis

Graph View

Table of Contents

Referenced by

Recent notes