Recurrent Wide Architecture

Hypothesis

A single wide recurrent block, or a very small recurrent stack, may use the artifact budget more effectively than many thinner unique layers.

The basic bet is:

share depth aggressively
reinvest saved bytes into width
stabilize the shared block with pre-projection normalization
recover lost specialization with tiny conditioning or adapters
use decoupled precision or outlier preservation where the cheap path fails

Why this is plausible

This is a concrete descendant of recursive width scaling.

The literature supports each part of the stack:

Relaxed Recursive Transformers suggests deep sharing can work if recurrence is not treated too rigidly. (Bae et al., 2024)
MoEUT suggests recurrent/shared computation becomes stronger when capacity is allocated more intelligently. (Csordás et al., 2024)
Extra RMSNorm suggests low-bit stability often depends on how activations are normalized before sensitive projections. (Steinmetz et al., 2025)
pQuant suggests uniform low-bit treatment wastes quality on the most sensitive parameters. (Zhang et al., 2026)

Concrete design sketch

One family of designs looks like:

a wide shared transformer block rather than many thinner unique blocks
repeated application across depth steps
optional phase-conditioned sharing so each recurrence step is not forced to behave identically
selective higher-precision protection for the subset that proves most fragile

This should be viewed as a compute-for-storage exchange rather than just a smaller network.

What has to be true for it to win

the wider shared block must outperform the narrower unique-depth baseline after compression
repeated reuse must not destabilize training too badly
extra width must survive roundtrip export rather than only helping pre-quant metrics
any saved bytes must be spent where they matter most
cheap specialization must recover enough phase-specific behavior when strict sharing is too rigid

Main risks

the shared block may be forced to do too many incompatible jobs
repeated reuse may amplify optimization instability
wider activations may worsen outlier behavior unless normalization and precision protection are good enough
early wins may disappear if the architecture is too fragile at longer horizons

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672

Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039

Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823

Zhang, W., Liu, B., Hu, Y., Bai, X., Zhang, W., & Cui, B. (2026). pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training. arXiv Preprint arXiv:2602.22592. https://arxiv.org/abs/2602.22592

Parameter Golf Research Garden

Section Tree

Recurrent Wide Architecture

Hypothesis

Why this is plausible

Concrete design sketch

What has to be true for it to win

Main risks

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Recurrent Wide Architecture

Hypothesis

Why this is plausible

Concrete design sketch

What has to be true for it to win

Main risks

Related

Graph View

Table of Contents

Referenced by

Recent notes