Compression Interfaces for Shared Depth

A surprising cross-paper pattern is emerging between two literatures that are often discussed separately:

low-bit stability papers like Extra RMSNorm and QuEST
shared-depth papers like Relaxed Recursive Transformers, MoEUT, and Fine-grained Parameter Sharing

They appear to be circling the same deeper problem:

if one block must serve many roles and survive harsh compression, then the model needs a better interface between activations, repeated reuse, and stored weights.

Why this seam matters now

The basic promise of recursive sharing is clear: store fewer unique blocks, spend the savings on width, light specialization, or protected precision. But the failure mode is equally clear in Recursive layer sharing: one shared block is forced to play incompatible depth roles.

At the same time, the low-bit papers keep finding that the easiest way to make fragile weights behave is to stabilize what flows into them:

Extra RMSNorm says pre-projection normalization can rescue extremely low-bit fine-tuning. (Steinmetz et al., 2025)
QuEST says distribution fit and backward trust matter because low-bit training is a dynamics problem, not just a storage problem. (Panferov et al., 2025)
BitNet b1.58 treats constrained linear structure and RMSNorm-like choices as core design, not garnish. (Wang et al., 2024)

The seam is the interaction: repeated shared blocks likely amplify precisely the scale drift and role drift that low-bit methods hate.

The central synthesis

A recursive/shared block may need two things at once:

activation discipline so repeated application does not produce scale chaos
cheap role hints so the same stored weights can act a little differently at different depths or phases

That suggests a stronger version of recursive width scaling:

shared depth is not mainly limited by lack of capacity; it is limited by lack of a robust compression interface.

The phrase “compression interface” matters here. The relevant question is not only whether the full-precision shared model trains. It is whether the repeated block still behaves after final compression.

What the papers imply together

Extra RMSNorm + Relaxed Recursive Transformers

If pre-projection normalization reduces fragility in low-bit settings, then recursive models may need it even more because the same projections are reused many times. A small mismatch can compound with depth reuse.

QuEST + MoEUT

QuEST says low-bit robustness depends on forward/backward distribution control. MoEUT says shared-depth models become viable when they get better normalization and sparse extra capacity. Put together: recurrence is not just a parameter-sharing problem; it is a quantization-stability problem.

Both papers weaken the naive “share everything identically” story. They imply that the missing ingredient is often tiny, structured deviation from pure sharing.

A falsifiable thesis

Thesis: recursive models will benefit disproportionately from a combination of pre-projection normalization and tiny phase-specific adaptation, compared with non-shared baselines at the same byte budget.

That means the key interaction is not “recurrence vs no recurrence” in isolation. It is:

recurrence
plus normalization that tames repeated reuse
plus micro-specialization that resolves role conflict

What would support it

adding extra RMSNorm helps shared-depth models more than equally sized non-shared models
tiny per-step scales, gates, or low-rank adapters recover much more quality in shared models than their byte cost would suggest
the compressed shared model degrades less than expected once these interface pieces are present

What would falsify it

shared models still collapse even with explicit normalization and micro-specialization
benefits appear only before compression, not after
wider shared blocks plus interface tricks still lose to simply storing more unique depth

The strongest new idea hiding here

The most interesting direction is not “recursive models with adapters.” It is phase-conditioned compression interfaces.

That would treat a recurrent block as a common core with three extremely cheap surrounding structures:

pre-projection normalization to keep inputs in range
a phase/depth signal to resolve role ambiguity
a tiny protected pathway for the few tensors that cannot survive pure sharing

This connects directly to:

The key prediction is that these mechanisms are complements, not competing tweaks.

Why this seam is better than a generic recursion pitch

A naive recursion pitch says: “depth is free if weights are shared.” The frontier version is stricter:

depth is only partly free
repeated reuse creates new instability and specialization problems
those problems may be solvable with tiny interface structures rather than by abandoning sharing

That is a much more testable and realistic claim.

Experiments this frontier suggests

compare shared-depth vs non-shared models with and without extra pre-projection normalization
add tiny phase-conditioned scales or gates and measure post-compression recovery per byte
test whether protected precision is especially valuable on the shared block rather than on the whole model
measure whether recursive models show steeper sensitivity to activation scale drift than non-shared baselines
compare “one shared block + interface” against “more unique thin blocks” at equal final bytes

Why this could matter a lot for Parameter Golf

If this frontier is right, then the most powerful architecture move may not be storing radically new mechanisms. It may be storing one stronger repeated mechanism with a smarter interface.

That is exactly the sort of move a hard artifact cap should reward.

Panferov, A., Chen, J., Tabesh, S., Castro, R. L., Nikdan, M., & Alistarh, D. (2025). QuEST: Stable Training of LLMs with 1-Bit Weights and Activations. arXiv Preprint arXiv:2502.05003. https://arxiv.org/abs/2502.05003

Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823

Wang, H., Ma, S., Ma, L., Wang, L., Wang, W., Huang, S., Dong, L., Wang, R., Xue, J., & Wei, F. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv Preprint arXiv:2402.17764. https://arxiv.org/abs/2402.17764

Parameter Golf Research Garden

Section Tree

Compression Interfaces for Shared Depth

Why this seam matters now

The central synthesis

What the papers imply together

Extra RMSNorm + Relaxed Recursive Transformers

QuEST + MoEUT

A falsifiable thesis

What would support it

What would falsify it

The strongest new idea hiding here

Why this seam is better than a generic recursion pitch

Experiments this frontier suggests

Why this could matter a lot for Parameter Golf

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Compression Interfaces for Shared Depth

Why this seam matters now

The central synthesis

What the papers imply together

Extra RMSNorm + Relaxed Recursive Transformers

QuEST + MoEUT

Fine-grained Parameter Sharing + Relaxed Recursive Transformers

A falsifiable thesis

What would support it

What would falsify it

The strongest new idea hiding here

Why this seam is better than a generic recursion pitch

Experiments this frontier suggests

Why this could matter a lot for Parameter Golf

Related

Graph View

Table of Contents

Referenced by

Recent notes