Global Codebook Recursive Backbone

Hypothesis

A shared-depth model may compress better if many of its matrices are reconstructed from a single global codebook bank plus sparse residual corrections, rather than each tensor being quantized independently.

The concrete bet is that recursive architectures already create repeated structure. A compression scheme should exploit that repetition directly instead of pretending every matrix is unrelated.

Mechanism sketch

A concrete version would use:

one or a few learned codebook banks shared across most recurrent-block matrices
small per-matrix index tensors or mixing coefficients
optional sparse residuals only for the worst channels or blocks
the same codebooks reused across attention, MLP, and possibly the LM head where shape permits

This is not standard scalar quantization. It is a shared representation format for a shared-weight model.

Why this might work

The literature hints at this from multiple angles:

ClusComp argues clustering-style compression is especially attractive when outliers make scalar quantization brittle (Liao et al., 2025)
Additive Quantization argues extreme compression is often a representation-design problem, not just a bitwidth problem (Egiazarian et al., 2024)
Fine-grained Parameter Sharing suggests reuse should be expressed through shared bases rather than only whole-layer tying (Üyük et al., 2024)

The novel connection is that a recursive model and a shared codebook are the same strategic move at two different levels:

the architecture reuses computation
the codec reuses representational atoms

Evidence threads

Recursive and shared-parameter architectures already values repeated structure.
Quantization and outliers says non-uniform formats become more attractive once scalar rounding stops behaving well.
Outlier-aware compression suggests structured exceptions can be better than blanket precision increases.

What would falsify it

This hypothesis should be downgraded if:

global codebooks are too rigid and lose too much post-roundtrip quality
per-matrix indices and metadata eat the expected savings
the gains disappear once actual artifact compression is measured rather than raw tensor bytes
independent per-tensor quantization plus a tiny residual outperforms the shared-codebook approach at the same final size

Why it matters under the 16 MB cap

The 16 MB limit punishes repeated metadata and repeated exceptions. A global codebook approach could win precisely by amortizing structure across many tensors.

If successful, it would mean the best codec for shared-depth models is itself shared-depth in spirit: a small reusable dictionary plus targeted corrections, not a separate quantization story for every matrix.

Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2024). Extreme Compression of Large Language Models via Additive Quantization. arXiv Preprint arXiv:2401.06118. https://arxiv.org/abs/2401.06118

Liao, B., Herold, C., Hashemi, S. H., Vasilev, S., Khadivi, S., & Monz, C. (2025). ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning. arXiv Preprint arXiv:2503.13089. https://arxiv.org/abs/2503.13089

Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816

Parameter Golf Research Garden

Section Tree

Global Codebook Recursive Backbone

Hypothesis

Mechanism sketch

Why this might work

Evidence threads

What would falsify it

Why it matters under the 16 MB cap

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Global Codebook Recursive Backbone

Hypothesis

Mechanism sketch

Why this might work

Evidence threads

What would falsify it

Why it matters under the 16 MB cap

Related

Graph View

Table of Contents

Referenced by

Recent notes