Learned Weight Codecs and Compressible Training

A new seam is opening beyond ordinary PTQ debates:

maybe the strongest next gains will come not from a better handcrafted low-bit format, but from making the model itself more naturally compressible and letting learned codecs exploit that structure.

This seam is where several recent papers unexpectedly meet.

Why this seam matters now

Recent work is attacking compression from directions that used to feel separate:

BackSlash pushes rate constraints into training. (Wu et al., 2025)
NuMuon pushes optimizer dynamics toward low-rank, compressible weights. (Dolatabadi et al., 2026)
Neural Weight Compression treats weights as a learned-codec modality. (Ryu et al., 2025)
LittleBit shows that ultra-low-bit success may require factorized latent structure, not only better scalar quantization. (Lee et al., 2025)
Getting Free Bits Back from Rotational Symmetries in LLMs reminds us that some savings may come from removing redundant descriptions before distortion even enters. (He et al., 2024)

This is more than a paper list. It suggests a deeper shift in how to think about the artifact.

The central synthesis

Older compression thinking often assumes a pipeline like:

train a standard model
quantize or prune it cleverly
pay whatever storage layout that implies

The new seam suggests a stronger alternative:

train a model to become structurally compressible
represent it with a learned or symmetry-aware storage format
spend bytes only where the structure truly breaks

That reframes compression as an artifact co-design problem.

Why this is different from ordinary low-bit work

The important change is where the intelligence lives.

Old emphasis

better rounding
better saliency heuristics
better exception handling after the model already exists

New emphasis

better weight geometry during training
better learned or factorized representations
better elimination of redundant descriptions
better alignment between optimizer, structure, and codec

This does not make older PTQ work obsolete. It says the next frontier may lie one layer upstream.

A falsifiable thesis

Thesis: once a model is already near the edge of aggressive low-bit compression, the next meaningful gains come from shaping the compressibility manifold during training and exploiting it with richer artifact formats.

What would support it

models trained with compressibility-aware objectives beat equally sized post-hoc compressed baselines
learned codecs outperform handcrafted formats once codec overhead is honestly counted
structured/factorized representations keep improving at equal final bytes where scalar methods saturate
symmetry-aware or bits-back style tricks recover nontrivial free savings on top of already strong pipelines

What would falsify it

codec or optimizer overhead dominates the saved bytes
learned representations win only at bitrates irrelevant to the challenge
handcrafted quantizers plus tiny exception paths still dominate at the real artifact scale
training-induced compressibility improves proxies but not post-roundtrip val_bpb

Why this connects directly to the moonshots

This frontier is the strongest current evidence base behind:

Those pages were not written in a vacuum. The newest literature is starting to point in the same direction, even if each paper only sees one slice of the shift.

Bottom line

The bleeding edge is starting to treat weight storage less like “apply a smaller number format” and more like:

shape the weights during training
encode them as structured objects
remove redundant descriptions
and pay explicit byte cost only where structure fails

If that seam is real, it is one of the best places to search for genuinely non-obvious Parameter Golf gains.

Dolatabadi, H. M., Ajanthan, T., Ramasinghe, S., Hewa Koneputugodage, C. P., Siriwardhana, S., Shevchenko, V., Pajak, K., Snewin, J., Avraham, G., & Long, A. (2026). NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training. arXiv Preprint arXiv:2603.03597. https://arxiv.org/abs/2603.03597

He, J., Flamich, G., & Hernández-Lobato, J. M. (2024). Getting Free Bits Back from Rotational Symmetries in LLMs. arXiv Preprint arXiv:2410.01309. https://arxiv.org/abs/2410.01309

Lee, B., Kim, D., You, Y., & Kim, Y. (2025). LittleBit: Ultra Low-Bit Quantization via Latent Factorization. arXiv Preprint arXiv:2506.13771. https://arxiv.org/abs/2506.13771

Ryu, J., Kim, M., Shin, S., Choi, H. M., Oh, D., & Lee, J. (2025). Neural Weight Compression for Language Models. arXiv Preprint arXiv:2510.11234. https://arxiv.org/abs/2510.11234

Wu, J., Wen, J., & Han, Y. (2025). BackSlash: Rate Constrained Optimized Training of Large Language Models. arXiv Preprint arXiv:2504.16968. https://arxiv.org/abs/2504.16968

Parameter Golf Research Garden

Section Tree

Learned Weight Codecs and Compressible Training

Why this seam matters now

The central synthesis

Why this is different from ordinary low-bit work

Old emphasis

New emphasis

A falsifiable thesis

What would support it

What would falsify it

Why this connects directly to the moonshots

Bottom line

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Learned Weight Codecs and Compressible Training

Why this seam matters now

The central synthesis

Why this is different from ordinary low-bit work

Old emphasis

New emphasis

A falsifiable thesis

What would support it

What would falsify it

Why this connects directly to the moonshots

Bottom line

Related

Graph View

Table of Contents

Referenced by

Recent notes