Entropy-Friendly Model Structure

A recurring mistake in compression discussion is to compare methods mainly by nominal bit-width or reconstruction quality before the final artifact format is counted.

The more interesting seam across Additive Quantization, ClusComp, Fine-grained Parameter Sharing, PTQ1.61, and MicroScopiQ is this:

the winning model may be the one whose structure is easiest for the whole storage pipeline to exploit, not the one with the prettiest local quantizer.

Why this seam matters now

Several recent papers are converging on “non-uniformity,” but they do not all mean the same thing.

Additive Quantization says extreme compression can favor codebooks over scalar quantization. (Egiazarian et al., 2024)
ClusComp says clustering-like structure can beat uniform low-bit treatment when outliers dominate. (Liao et al., 2025)
Fine-grained Parameter Sharing says shared bases plus sparse factors can outperform naive all-or-nothing tying. (Üyük et al., 2024)
PTQ1.61 and MicroScopiQ both emphasize that saliency protection only works if its overhead stays controlled. (Ramachandran et al., 2024; Zhao et al., 2025)

Put together, these papers point past “low-bit weights” toward a different question:

what kinds of model structure naturally compress well once values, indices, bases, masks, and repeated patterns all enter the artifact?

The key synthesis

There are at least three kinds of useful regularity:

1. Value regularity

Weights fall onto a small set of repeated or codebook-like values.

Natural paper bridge:

2. Basis regularity

Multiple tensors are explained by a small shared basis plus cheap corrections.

Natural paper bridge:

3. Exception regularity

The “special cases” that need better treatment form patterns that can themselves be compressed well.

Natural paper bridge:

The frontier claim is that these three regularities may compose better than they look in separate literatures.

A falsifiable thesis

Thesis: under a strict size cap, models whose deviations from cheap structure are regular and clustered will beat models with slightly better local quantization error but more irregular metadata.

In other words, artifact success may depend more on entropy of exceptions than on average quantization error.

What would support it

clustered or basis-based exception formats beat equally sized random sparse exceptions after final compression
repeated bases and repeated codebooks survive the artifact pipeline unusually well
methods with modestly worse pre-artifact error win after the full codec is applied

What would falsify it

final compression tracks nominal quantizer quality closely, with little bonus for regular structure
metadata costs stay negligible even for irregular exception patterns
regularity constraints hurt model quality too much to pay back

The strongest new idea hiding here

A promising Parameter Golf direction is storage-native model design.

That means we stop thinking of compression as the final stage applied to a finished model. Instead we ask whether the model can be designed so its learned structure already looks favorable to downstream coding.

Examples of what that might mean:

a shared recurrent block with tiny regular phase adapters instead of many unique layers
grouped or codebook-like residuals whose indices repeat heavily
protected subsets chosen in coarse blocks or rows rather than arbitrary masks
basis-plus-delta decompositions where both the basis and the correction formats repeat across tensors

This is where quantization and outliers meets recursive sharing in a way the current graph only hints at.

Why this frontier is different from “use better compression”

“Use better compression” usually means a smarter algorithm at the end. This frontier instead says the model should be judged by how well it cooperates with compression.

That shifts the research question from:

which post-hoc codec is best?

to:

what learned structure keeps both values and metadata low-entropy?

The second question is much closer to a true Parameter Golf objective.

Important cautions

This seam is easy to romanticize. It can fail in boring ways:

codebooks and indices may cost more than they save
regularity constraints may destroy important rare structure
the “entropy-friendly” model may become harder to train than a less structured baseline
gains may be specific to one artifact pipeline and not robust

So the point is not to assume regularity helps. The point is to measure whether regularity survives accounting.

Experiments this frontier suggests

compare coarse-block exceptions against fine-grained sparse exceptions at equal final bytes
measure pre-artifact error versus post-artifact score to find ranking reversals
test basis-plus-delta formats where the basis is shared across layers or tensor families
compare repeated small codebooks against larger tensor-specific codebooks
analyze whether shared-depth models create more compressible repetition than non-shared models with the same score before compression

Bottom line

The important question may no longer be “how few bits per weight?”

It may be:

how much useful regularity can the whole model expose to the storage pipeline without giving away too much quality?

Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2024). Extreme Compression of Large Language Models via Additive Quantization. arXiv Preprint arXiv:2401.06118. https://arxiv.org/abs/2401.06118

Liao, B., Herold, C., Hashemi, S. H., Vasilev, S., Khadivi, S., & Monz, C. (2025). ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning. arXiv Preprint arXiv:2503.13089. https://arxiv.org/abs/2503.13089

Ramachandran, A., Kundu, S., & Krishna, T. (2024). MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization. arXiv Preprint arXiv:2411.05282. https://arxiv.org/abs/2411.05282

Üyük, C., Lasby, M., Yassin, M., Evci, U., & Ioannou, Y. (2024). Learning Parameter Sharing with Tensor Decompositions and Sparsity. arXiv Preprint arXiv:2411.09816. https://arxiv.org/abs/2411.09816

Zhao, J., Zhang, M., Wang, M., Shang, Y., Zhang, K., Guan, W., Wang, Y., & Zhang, M. (2025). PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models. arXiv Preprint arXiv:2502.13179. https://arxiv.org/abs/2502.13179

Parameter Golf Research Garden

Section Tree

Entropy-Friendly Model Structure

Why this seam matters now

The key synthesis

1. Value regularity

2. Basis regularity

3. Exception regularity

A falsifiable thesis

What would support it

What would falsify it

The strongest new idea hiding here

Why this frontier is different from “use better compression”

Important cautions

Experiments this frontier suggests

Bottom line

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Entropy-Friendly Model Structure

Why this seam matters now

The key synthesis

1. Value regularity

2. Basis regularity

3. Exception regularity

A falsifiable thesis

What would support it

What would falsify it

The strongest new idea hiding here

Why this frontier is different from “use better compression”

Important cautions

Experiments this frontier suggests

Bottom line

Related

Graph View

Table of Contents

Referenced by

Recent notes