Core idea
The main weakness of strict layer sharing is not that reuse is wrong. It is that different depth positions often want different roles.
A recurrent transformer therefore often needs a small amount of depth-specific behavior even if most weights remain shared.
Why this matters
Without some role signal, the shared block may be forced to perform incompatible jobs:
- early-step feature formation
- mid-step mixing and routing
- late-step cleanup or prediction shaping
Making every layer unique solves that problem, but it spends bytes aggressively. The more compact alternative is to keep the heavy tensors shared and make specialization cheap.
Cheap specialization mechanisms
Examples of low-byte specialization include:
- per-step learned scales or biases
- recurrence-step embeddings
- tiny adapters on only the most sensitive projections
- normalization gains that vary with depth step
These mechanisms are the main intuition behind phase-conditioned sharing.
Why it composes with compression work
Cheap specialization is especially attractive when paired with:
- pre-projection normalization so repeated activations remain well behaved
- outlier-aware compression so the tiny role-specific parameters do not become the new fragile bottleneck