Shared Depth Needs Cheap Specialization

Core idea

The main weakness of strict layer sharing is not that reuse is wrong. It is that different depth positions often want different roles.

A recurrent transformer therefore often needs a small amount of depth-specific behavior even if most weights remain shared.

Why this matters

Without some role signal, the shared block may be forced to perform incompatible jobs:

early-step feature formation
mid-step mixing and routing
late-step cleanup or prediction shaping

Making every layer unique solves that problem, but it spends bytes aggressively. The more compact alternative is to keep the heavy tensors shared and make specialization cheap.

Cheap specialization mechanisms

Examples of low-byte specialization include:

per-step learned scales or biases
recurrence-step embeddings
tiny adapters on only the most sensitive projections
normalization gains that vary with depth step

These mechanisms are the main intuition behind phase-conditioned sharing.

Why it composes with compression work

Cheap specialization is especially attractive when paired with:

pre-projection normalization so repeated activations remain well behaved
outlier-aware compression so the tiny role-specific parameters do not become the new fragile bottleneck

Parameter Golf Research Garden

Section Tree

Shared Depth Needs Cheap Specialization

Core idea

Why this matters

Cheap specialization mechanisms

Why it composes with compression work

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Shared Depth Needs Cheap Specialization

Core idea

Why this matters

Cheap specialization mechanisms

Why it composes with compression work

Related

Graph View

Table of Contents

Referenced by

Recent notes