Quantization, Outliers, and Compression-Aware Training

This is the lane most directly tied to the real objective: the compressed artifact must still behave like a good language model.

Core question

How do we make the final stored model behave more like the trained model without spending too many bytes on precision, metadata, or special cases?

Why this lane matters

A compact LLM can fail in at least three different ways:

the weights themselves do not survive low-bit export
a small number of outliers dominate degradation
training learns representations that look good before export but are hostile to the final codec

This lane treats compression as part of the model design, not a last-mile afterthought.

Central papers

Extra RMSNorm (Steinmetz et al., 2025)
pQuant (Zhang et al., 2026)
QuEST (Panferov et al., 2025)
AQLM (Egiazarian et al., 2024)
ClusComp (Liao et al., 2025)

Four recurring mechanisms

1. Activation shaping

Keep the inputs to fragile projections well behaved.

Main page:

2. Selective precision

Protect the tiny subset of weights or channels that break under uniform compression.

Main page:

3. Structured non-uniform compression

Use codebooks, clustering, or additive schemes when scalar rounding is too blunt.

Main page:

Outlier-aware compression

4. Compression-aware architecture

Prefer model structures whose important information is easier to quantize in the first place.

Natural bridge:

Recursive and shared-parameter architectures

Important caution

A method can look excellent in generic quantization papers and still fail here if it:

adds too much metadata
protects too many weights to stay worthwhile
assumes calibration or deployment conditions we do not control
improves nominal perplexity while leaving the size-quality frontier unchanged

Most relevant open questions

when does activation shaping matter more than weight selection?
which tensors are so fragile that selective precision pays for itself?
when do structured codebooks outperform simple low-bit plus sparse residuals?
which architecture choices make outliers less severe rather than merely easier to patch?

Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2024). Extreme Compression of Large Language Models via Additive Quantization. arXiv Preprint arXiv:2401.06118. https://arxiv.org/abs/2401.06118

Liao, B., Herold, C., Hashemi, S. H., Vasilev, S., Khadivi, S., & Monz, C. (2025). ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning. arXiv Preprint arXiv:2503.13089. https://arxiv.org/abs/2503.13089

Panferov, A., Chen, J., Tabesh, S., Castro, R. L., Nikdan, M., & Alistarh, D. (2025). QuEST: Stable Training of LLMs with 1-Bit Weights and Activations. arXiv Preprint arXiv:2502.05003. https://arxiv.org/abs/2502.05003

Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823

Zhang, W., Liu, B., Hu, Y., Bai, X., Zhang, W., & Cui, B. (2026). pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training. arXiv Preprint arXiv:2602.22592. https://arxiv.org/abs/2602.22592

Parameter Golf Research Garden

Section Tree

Quantization, Outliers, and Compression-Aware Training

Core question

Why this lane matters

Central papers

Four recurring mechanisms

1. Activation shaping

2. Selective precision

3. Structured non-uniform compression

4. Compression-aware architecture

Important caution

Most relevant open questions

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Quantization, Outliers, and Compression-Aware Training

Core question

Why this lane matters

Central papers

Four recurring mechanisms

1. Activation shaping

2. Selective precision

3. Structured non-uniform compression

4. Compression-aware architecture

Important caution

Most relevant open questions

Related

Graph View

Table of Contents

Referenced by

Recent notes