Hypothesis
A shared-depth transformer can recover much of the benefit of unique layers if each recurrence step gets a tiny amount of phase-specific conditioning such as scales, gates, embeddings, or very small adapters.
Why this is plausible
Strict recurrence often fails because early, middle, and late depth steps want different behavior. But fully unique layers may be a very expensive way to buy that specialization.
The middle ground is:
- keep the heavy weights shared
- let each step receive a small role signal
- spend only a tiny fraction of the bytes that full unsharing would require
This is conceptually aligned with both Relaxed Recursive Transformers and Fine-grained Parameter Sharing.
What counts as “phase conditioning”
Potentially useful cheap specialization mechanisms include:
- per-step learned scales or biases
- recurrence-step embeddings injected into attention or MLP paths
- tiny LoRA-like adapters attached only to the most role-sensitive projections
- different normalization gains across steps
Why it matters under a hard artifact cap
If this works, it creates a better exchange than either extreme:
- cheaper than fully unique layers
- more expressive than perfectly identical recurrence
That makes it a natural bridge between recursive width scaling and recurrent wide architecture.
What would support it
- shared-depth models recovering most of the gap to unique-depth baselines with very small extra bytes
- better behavior at deeper recurrence counts than strict sharing achieves
- improved post-compression performance when the conditioning parameters are themselves cheap to store
Main risks
- the conditioning path becomes too large or too numerous
- gains depend on hidden extra capacity rather than true efficient specialization
- step-specific parameters become fragile under aggressive compression