Norm-Only Phase Specialization

Hypothesis

A shared-depth transformer may recover much of the benefit of unique layers if the only step-specific parameters are:

per-step RMSNorm gains
tiny per-step channel gates before the most sensitive projections

and everything else stays shared.

This is more aggressive than Phase-conditioned sharing because it bets that norms and gates alone may capture most of the missing role information.

Mechanism sketch

A minimal implementation would use:

one shared attention + MLP block repeated across depth
step embeddings or step IDs only as inputs to a norm/gate path
per-step RMSNorm parameters before attention and MLP projections
optional rank-1 or diagonal channel gates instead of LoRA modules

The key claim is that the cheapest useful specialization might live in activation geometry, not in extra weight matrices.

Why this might work

This idea connects three observations that are usually discussed separately:

Extra RMSNorm suggests low-bit success depends strongly on activation scale control before linear projections (Steinmetz et al., 2025)
Relaxed Recursive Transformers suggests strict sharing is too rigid, but the recovery path does not need to be large (Bae et al., 2024)
Fine-grained Parameter Sharing suggests efficient specialization may come from structured, low-dimensional corrections rather than full unsharing (Üyük et al., 2024)

Put together, that suggests the missing ingredient in shared depth may be less “new weights per layer” and more “a cheap way to rotate or rescale features differently by phase.”

Evidence threads

Recursive and shared-parameter architectures already treats light specialization as the key to making recurrence competitive.
Quantization and outliers implies the best specialization path is one that also improves compression robustness.
Normalization before projections supports the view that tiny scale changes can have outsized low-bit effects.

What would falsify it

This hypothesis should lose credibility if:

norm-and-gate-only recurrence remains far behind even tiny LoRA-based relaxation at the same byte cost
the gains appear before compression but disappear after roundtrip export
different depth phases need genuine subspace changes that diagonal or norm-only controls cannot express
the specialized norms themselves prove fragile under aggressive compression

Why it matters under the 16 MB cap

Per-step RMSNorm and diagonal gates cost kilobytes, not megabytes. If they recover even a modest fraction of the gap between strict sharing and unique layers, they may dominate the storage trade.

This is exactly the kind of mechanism the cap rewards:

almost no artifact overhead
clean compatibility with shared-depth trunks
likely good compressibility because the added parameters are tiny and structured

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672

Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823

Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816

Parameter Golf Research Garden

Section Tree

Norm-Only Phase Specialization

Hypothesis

Mechanism sketch

Why this might work

Evidence threads

What would falsify it

Why it matters under the 16 MB cap

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Norm-Only Phase Specialization

Hypothesis

Mechanism sketch

Why this might work

Evidence threads

What would falsify it

Why it matters under the 16 MB cap

Related

Graph View

Table of Contents

Referenced by

Recent notes