This is the lane most directly tied to the real objective: the compressed artifact must still behave like a good language model.
Core question
How do we make the final stored model behave more like the trained model without spending too many bytes on precision, metadata, or special cases?
Why this lane matters
A compact LLM can fail in at least three different ways:
- the weights themselves do not survive low-bit export
- a small number of outliers dominate degradation
- training learns representations that look good before export but are hostile to the final codec
This lane treats compression as part of the model design, not a last-mile afterthought.
Central papers
- Extra RMSNorm (Steinmetz et al., 2025)
- pQuant (Zhang et al., 2026)
- QuEST (Panferov et al., 2025)
- AQLM (Egiazarian et al., 2024)
- ClusComp (Liao et al., 2025)
Four recurring mechanisms
1. Activation shaping
Keep the inputs to fragile projections well behaved.
Main page:
2. Selective precision
Protect the tiny subset of weights or channels that break under uniform compression.
Main page:
3. Structured non-uniform compression
Use codebooks, clustering, or additive schemes when scalar rounding is too blunt.
Main page:
4. Compression-aware architecture
Prefer model structures whose important information is easier to quantize in the first place.
Natural bridge:
Important caution
A method can look excellent in generic quantization papers and still fail here if it:
- adds too much metadata
- protects too many weights to stay worthwhile
- assumes calibration or deployment conditions we do not control
- improves nominal perplexity while leaving the size-quality frontier unchanged
Most relevant open questions
- when does activation shaping matter more than weight selection?
- which tensors are so fragile that selective precision pays for itself?
- when do structured codebooks outperform simple low-bit plus sparse residuals?
- which architecture choices make outliers less severe rather than merely easier to patch?
Related
- RMSNorm stabilized scaling
- Sparse outlier preservation
- Normalization before projections
- Outlier-aware compression
- Decoupled precision