Observation
At moderate precision, uniform quantization can be good enough. At extreme compression, it often stops being the right abstraction.
The problem is not just average error. A small set of outlier values can dominate downstream degradation.
Evidence across papers
- ClusComp argues that newer LLMs are getting harder to quantize largely because outliers are becoming more common. (Liao et al., 2025)
- pQuant calls the failure mode parameter democratization: important parameters lose their special status under uniform low-bit treatment. (Zhang et al., 2026)
- PTQ1.61 and MicroScopiQ show that post-training low-bit methods also end up needing structured ways to preserve salient channels or outlier values. (Ramachandran et al., 2024; Zhao et al., 2025)
- Additive Quantization and ClusComp both point toward structured, non-uniform formats when scalar quantization becomes too lossy. (Egiazarian et al., 2024; Liao et al., 2025)
Practical lesson
When the budget is extremely tight, the right question is often not “what global bit-width should we use?” but:
which tiny subset of the model cannot survive the cheap path?
Related
- Sparse outlier preservation
- Recurrent wide architecture
- Decoupled precision
- Quantization and outliers
Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2024). Extreme Compression of Large Language Models via Additive Quantization. arXiv Preprint arXiv:2401.06118. https://arxiv.org/abs/2401.06118
Liao, B., Herold, C., Hashemi, S. H., Vasilev, S., Khadivi, S., & Monz, C. (2025). ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning. arXiv Preprint arXiv:2503.13089. https://arxiv.org/abs/2503.13089
Ramachandran, A., Kundu, S., & Krishna, T. (2024). MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization. arXiv Preprint arXiv:2411.05282. https://arxiv.org/abs/2411.05282
Zhang, W., Liu, B., Hu, Y., Bai, X., Zhang, W., & Cui, B. (2026). pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training. arXiv Preprint arXiv:2602.22592. https://arxiv.org/abs/2602.22592
Zhao, J., Zhang, M., Wang, M., Shang, Y., Zhang, K., Guan, W., Wang, Y., & Zhang, M. (2025). PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models. arXiv Preprint arXiv:2502.13179. https://arxiv.org/abs/2502.13179