Hypothesis
A shared-depth transformer may recover much of the benefit of unique layers if the only step-specific parameters are:
- per-step RMSNorm gains
- tiny per-step channel gates before the most sensitive projections
and everything else stays shared.
This is more aggressive than Phase-conditioned sharing because it bets that norms and gates alone may capture most of the missing role information.
Mechanism sketch
A minimal implementation would use:
- one shared attention + MLP block repeated across depth
- step embeddings or step IDs only as inputs to a norm/gate path
- per-step RMSNorm parameters before attention and MLP projections
- optional rank-1 or diagonal channel gates instead of LoRA modules
The key claim is that the cheapest useful specialization might live in activation geometry, not in extra weight matrices.
Why this might work
This idea connects three observations that are usually discussed separately:
- Extra RMSNorm suggests low-bit success depends strongly on activation scale control before linear projections (Steinmetz et al., 2025)
- Relaxed Recursive Transformers suggests strict sharing is too rigid, but the recovery path does not need to be large (Bae et al., 2024)
- Fine-grained Parameter Sharing suggests efficient specialization may come from structured, low-dimensional corrections rather than full unsharing (Üyük et al., 2024)
Put together, that suggests the missing ingredient in shared depth may be less “new weights per layer” and more “a cheap way to rotate or rescale features differently by phase.”
Evidence threads
- Recursive and shared-parameter architectures already treats light specialization as the key to making recurrence competitive.
- Quantization and outliers implies the best specialization path is one that also improves compression robustness.
- Normalization before projections supports the view that tiny scale changes can have outsized low-bit effects.
What would falsify it
This hypothesis should lose credibility if:
- norm-and-gate-only recurrence remains far behind even tiny LoRA-based relaxation at the same byte cost
- the gains appear before compression but disappear after roundtrip export
- different depth phases need genuine subspace changes that diagonal or norm-only controls cannot express
- the specialized norms themselves prove fragile under aggressive compression
Why it matters under the 16 MB cap
Per-step RMSNorm and diagonal gates cost kilobytes, not megabytes. If they recover even a modest fraction of the gap between strict sharing and unique layers, they may dominate the storage trade.
This is exactly the kind of mechanism the cap rewards:
- almost no artifact overhead
- clean compatibility with shared-depth trunks
- likely good compressibility because the added parameters are tiny and structured
Related
- Phase-conditioned sharing
- RMSNorm stabilized scaling
- Recursive and shared-parameter architectures
- Quantization and outliers
- Shared depth needs cheap specialization