Observation

At moderate precision, uniform quantization can be good enough. At extreme compression, it often stops being the right abstraction.

The problem is not just average error. A small set of outlier values can dominate downstream degradation.

Evidence across papers

Practical lesson

When the budget is extremely tight, the right question is often not “what global bit-width should we use?” but:

which tiny subset of the model cannot survive the cheap path?

Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2024). Extreme Compression of Large Language Models via Additive Quantization. arXiv Preprint arXiv:2401.06118. https://arxiv.org/abs/2401.06118
Liao, B., Herold, C., Hashemi, S. H., Vasilev, S., Khadivi, S., & Monz, C. (2025). ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning. arXiv Preprint arXiv:2503.13089. https://arxiv.org/abs/2503.13089
Ramachandran, A., Kundu, S., & Krishna, T. (2024). MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization. arXiv Preprint arXiv:2411.05282. https://arxiv.org/abs/2411.05282
Zhang, W., Liu, B., Hu, Y., Bai, X., Zhang, W., & Cui, B. (2026). pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training. arXiv Preprint arXiv:2602.22592. https://arxiv.org/abs/2602.22592
Zhao, J., Zhang, M., Wang, M., Shang, Y., Zhang, K., Guan, W., Wang, Y., & Zhang, M. (2025). PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models. arXiv Preprint arXiv:2502.13179. https://arxiv.org/abs/2502.13179