Phase-Conditioned Sharing

Hypothesis

A shared-depth transformer can recover much of the benefit of unique layers if each recurrence step gets a tiny amount of phase-specific conditioning such as scales, gates, embeddings, or very small adapters.

Why this is plausible

Strict recurrence often fails because early, middle, and late depth steps want different behavior. But fully unique layers may be a very expensive way to buy that specialization.

The middle ground is:

keep the heavy weights shared
let each step receive a small role signal
spend only a tiny fraction of the bytes that full unsharing would require

This is conceptually aligned with both Relaxed Recursive Transformers and Fine-grained Parameter Sharing.

What counts as “phase conditioning”

Potentially useful cheap specialization mechanisms include:

per-step learned scales or biases
recurrence-step embeddings injected into attention or MLP paths
tiny LoRA-like adapters attached only to the most role-sensitive projections
different normalization gains across steps

Why it matters under a hard artifact cap

If this works, it creates a better exchange than either extreme:

cheaper than fully unique layers
more expressive than perfectly identical recurrence

That makes it a natural bridge between recursive width scaling and recurrent wide architecture.

What would support it

shared-depth models recovering most of the gap to unique-depth baselines with very small extra bytes
better behavior at deeper recurrence counts than strict sharing achieves
improved post-compression performance when the conditioning parameters are themselves cheap to store

Main risks

the conditioning path becomes too large or too numerous
gains depend on hidden extra capacity rather than true efficient specialization
step-specific parameters become fragile under aggressive compression

Parameter Golf Research Garden

Section Tree

Phase-Conditioned Sharing

Hypothesis

Why this is plausible

What counts as “phase conditioning”

Why it matters under a hard artifact cap

What would support it

Main risks

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Phase-Conditioned Sharing

Hypothesis

Why this is plausible

What counts as “phase conditioning”

Why it matters under a hard artifact cap

What would support it

Main risks

Related

Graph View

Table of Contents

Referenced by

Recent notes