Hypothesis

A shared-depth model may compress better if many of its matrices are reconstructed from a single global codebook bank plus sparse residual corrections, rather than each tensor being quantized independently.

The concrete bet is that recursive architectures already create repeated structure. A compression scheme should exploit that repetition directly instead of pretending every matrix is unrelated.

Mechanism sketch

A concrete version would use:

  • one or a few learned codebook banks shared across most recurrent-block matrices
  • small per-matrix index tensors or mixing coefficients
  • optional sparse residuals only for the worst channels or blocks
  • the same codebooks reused across attention, MLP, and possibly the LM head where shape permits

This is not standard scalar quantization. It is a shared representation format for a shared-weight model.

Why this might work

The literature hints at this from multiple angles:

The novel connection is that a recursive model and a shared codebook are the same strategic move at two different levels:

  • the architecture reuses computation
  • the codec reuses representational atoms

Evidence threads

What would falsify it

This hypothesis should be downgraded if:

  1. global codebooks are too rigid and lose too much post-roundtrip quality
  2. per-matrix indices and metadata eat the expected savings
  3. the gains disappear once actual artifact compression is measured rather than raw tensor bytes
  4. independent per-tensor quantization plus a tiny residual outperforms the shared-codebook approach at the same final size

Why it matters under the 16 MB cap

The 16 MB limit punishes repeated metadata and repeated exceptions. A global codebook approach could win precisely by amortizing structure across many tensors.

If successful, it would mean the best codec for shared-depth models is itself shared-depth in spirit: a small reusable dictionary plus targeted corrections, not a separate quantization story for every matrix.

Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2024). Extreme Compression of Large Language Models via Additive Quantization. arXiv Preprint arXiv:2401.06118. https://arxiv.org/abs/2401.06118
Liao, B., Herold, C., Hashemi, S. H., Vasilev, S., Khadivi, S., & Monz, C. (2025). ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning. arXiv Preprint arXiv:2503.13089. https://arxiv.org/abs/2503.13089
Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816