Hypothesis
A shared-depth model may compress better if many of its matrices are reconstructed from a single global codebook bank plus sparse residual corrections, rather than each tensor being quantized independently.
The concrete bet is that recursive architectures already create repeated structure. A compression scheme should exploit that repetition directly instead of pretending every matrix is unrelated.
Mechanism sketch
A concrete version would use:
- one or a few learned codebook banks shared across most recurrent-block matrices
- small per-matrix index tensors or mixing coefficients
- optional sparse residuals only for the worst channels or blocks
- the same codebooks reused across attention, MLP, and possibly the LM head where shape permits
This is not standard scalar quantization. It is a shared representation format for a shared-weight model.
Why this might work
The literature hints at this from multiple angles:
- ClusComp argues clustering-style compression is especially attractive when outliers make scalar quantization brittle (Liao et al., 2025)
- Additive Quantization argues extreme compression is often a representation-design problem, not just a bitwidth problem (Egiazarian et al., 2024)
- Fine-grained Parameter Sharing suggests reuse should be expressed through shared bases rather than only whole-layer tying (Üyük et al., 2024)
The novel connection is that a recursive model and a shared codebook are the same strategic move at two different levels:
- the architecture reuses computation
- the codec reuses representational atoms
Evidence threads
- Recursive and shared-parameter architectures already values repeated structure.
- Quantization and outliers says non-uniform formats become more attractive once scalar rounding stops behaving well.
- Outlier-aware compression suggests structured exceptions can be better than blanket precision increases.
What would falsify it
This hypothesis should be downgraded if:
- global codebooks are too rigid and lose too much post-roundtrip quality
- per-matrix indices and metadata eat the expected savings
- the gains disappear once actual artifact compression is measured rather than raw tensor bytes
- independent per-tensor quantization plus a tiny residual outperforms the shared-codebook approach at the same final size
Why it matters under the 16 MB cap
The 16 MB limit punishes repeated metadata and repeated exceptions. A global codebook approach could win precisely by amortizing structure across many tensors.
If successful, it would mean the best codec for shared-depth models is itself shared-depth in spirit: a small reusable dictionary plus targeted corrections, not a separate quantization story for every matrix.
Related
- Recursive width scaling
- Sparse outlier preservation
- Recursive and shared-parameter architectures
- Quantization and outliers
- Outlier-aware compression