This is the lane most directly tied to the real objective: the compressed artifact must still behave like a good language model.

Core question

How do we make the final stored model behave more like the trained model without spending too many bytes on precision, metadata, or special cases?

Why this lane matters

A compact LLM can fail in at least three different ways:

  1. the weights themselves do not survive low-bit export
  2. a small number of outliers dominate degradation
  3. training learns representations that look good before export but are hostile to the final codec

This lane treats compression as part of the model design, not a last-mile afterthought.

Central papers

Four recurring mechanisms

1. Activation shaping

Keep the inputs to fragile projections well behaved.

Main page:

2. Selective precision

Protect the tiny subset of weights or channels that break under uniform compression.

Main page:

3. Structured non-uniform compression

Use codebooks, clustering, or additive schemes when scalar rounding is too blunt.

Main page:

4. Compression-aware architecture

Prefer model structures whose important information is easier to quantize in the first place.

Natural bridge:

Important caution

A method can look excellent in generic quantization papers and still fail here if it:

  • adds too much metadata
  • protects too many weights to stay worthwhile
  • assumes calibration or deployment conditions we do not control
  • improves nominal perplexity while leaving the size-quality frontier unchanged

Most relevant open questions

  • when does activation shaping matter more than weight selection?
  • which tensors are so fragile that selective precision pays for itself?
  • when do structured codebooks outperform simple low-bit plus sparse residuals?
  • which architecture choices make outliers less severe rather than merely easier to patch?
Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2024). Extreme Compression of Large Language Models via Additive Quantization. arXiv Preprint arXiv:2401.06118. https://arxiv.org/abs/2401.06118
Liao, B., Herold, C., Hashemi, S. H., Vasilev, S., Khadivi, S., & Monz, C. (2025). ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning. arXiv Preprint arXiv:2503.13089. https://arxiv.org/abs/2503.13089
Panferov, A., Chen, J., Tabesh, S., Castro, R. L., Nikdan, M., & Alistarh, D. (2025). QuEST: Stable Training of LLMs with 1-Bit Weights and Activations. arXiv Preprint arXiv:2502.05003. https://arxiv.org/abs/2502.05003
Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823
Zhang, W., Liu, B., Hu, Y., Bai, X., Zhang, W., & Cui, B. (2026). pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training. arXiv Preprint arXiv:2602.22592. https://arxiv.org/abs/2602.22592