Hypothesis

A shared-depth transformer may recover much of the benefit of unique layers if the only step-specific parameters are:

  • per-step RMSNorm gains
  • tiny per-step channel gates before the most sensitive projections

and everything else stays shared.

This is more aggressive than Phase-conditioned sharing because it bets that norms and gates alone may capture most of the missing role information.

Mechanism sketch

A minimal implementation would use:

  • one shared attention + MLP block repeated across depth
  • step embeddings or step IDs only as inputs to a norm/gate path
  • per-step RMSNorm parameters before attention and MLP projections
  • optional rank-1 or diagonal channel gates instead of LoRA modules

The key claim is that the cheapest useful specialization might live in activation geometry, not in extra weight matrices.

Why this might work

This idea connects three observations that are usually discussed separately:

Put together, that suggests the missing ingredient in shared depth may be less “new weights per layer” and more “a cheap way to rotate or rescale features differently by phase.”

Evidence threads

What would falsify it

This hypothesis should lose credibility if:

  1. norm-and-gate-only recurrence remains far behind even tiny LoRA-based relaxation at the same byte cost
  2. the gains appear before compression but disappear after roundtrip export
  3. different depth phases need genuine subspace changes that diagonal or norm-only controls cannot express
  4. the specialized norms themselves prove fragile under aggressive compression

Why it matters under the 16 MB cap

Per-step RMSNorm and diagonal gates cost kilobytes, not megabytes. If they recover even a modest fraction of the gap between strict sharing and unique layers, they may dominate the storage trade.

This is exactly the kind of mechanism the cap rewards:

  • almost no artifact overhead
  • clean compatibility with shared-depth trunks
  • likely good compressibility because the added parameters are tiny and structured
Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672
Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823
Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816