Core idea

The main weakness of strict layer sharing is not that reuse is wrong. It is that different depth positions often want different roles.

A recurrent transformer therefore often needs a small amount of depth-specific behavior even if most weights remain shared.

Why this matters

Without some role signal, the shared block may be forced to perform incompatible jobs:

  • early-step feature formation
  • mid-step mixing and routing
  • late-step cleanup or prediction shaping

Making every layer unique solves that problem, but it spends bytes aggressively. The more compact alternative is to keep the heavy tensors shared and make specialization cheap.

Cheap specialization mechanisms

Examples of low-byte specialization include:

  • per-step learned scales or biases
  • recurrence-step embeddings
  • tiny adapters on only the most sensitive projections
  • normalization gains that vary with depth step

These mechanisms are the main intuition behind phase-conditioned sharing.

Why it composes with compression work

Cheap specialization is especially attractive when paired with: