Entropy-Weighted Vocabulary Rescue

Hypothesis

Most of the post-compression damage in the embedding / LM-head path may come from a small subset of vocabulary rows rather than from the whole matrix.

If so, the right move is not to protect the entire head, but to identify and preserve only rows associated with tokens that are simultaneously:

high frequency or high loss contribution
high entropy / easily confused in context
unusually sensitive to quantization

Mechanism sketch

A disciplined version of the idea would:

quantize the full embedding / head normally
compute a token-row sensitivity score from validation loss change, row error, or gradient statistics
keep only the top-ranked rows in a higher-precision residual path
spend the saved bytes on a larger trunk, better norms, or simply more headroom under the cap

This is basically pQuant logic applied at the vocabulary-row level rather than the whole-tensor level.

Why this might work

The supporting evidence comes from three directions:

pQuant argues that parameter sensitivity is extremely uneven, so uniform precision is wasteful (Zhang et al., 2026)
Vocabulary Compression argues the output side becomes a major burden early, so even small head savings are meaningful (Vennam et al., 2024)
Beyond Text Compression argues tokenizer quality is not captured by compression ratio alone, implying some tokens matter much more than average (Lotz et al., 2025)

The new connection is that tokenizer difficulty and quantization sensitivity may align. If a small set of tokens already dominates uncertainty, those may also be the rows worth rescuing.

Evidence threads

Tokenizer and vocabulary efficiency says vocab design is a first-order budget choice.
Quantization and outliers says selective precision usually beats democratic precision when sensitivity is skewed.
The LM head is part of the compression problem makes head-side exceptions more plausible than protecting arbitrary backbone tensors.

What would falsify it

This idea should be rejected if:

quantization harm is diffuse across the vocabulary rather than concentrated in a small row set
the selected rows are unstable across data slices, making the rescue set brittle
row metadata and exception handling cost too many bytes relative to the gain
protecting rows helps token-level loss but does not improve final bits-per-byte on the scored artifact

Why it matters under the 16 MB cap

Protecting even a few hundred vocabulary rows can be dramatically cheaper than raising precision for the full head or embedding matrix.

That matters because the cap punishes broad generosity. If head damage is truly concentrated, row-level rescue could be one of the highest-leverage uses of leftover bytes in the entire submission.

Lotz, J. F., Lopes, A. V., Peitz, S., Setiawan, H., & Emili, L. (2025). Beyond Text Compression: Evaluating Tokenizers Across Scales. arXiv Preprint arXiv:2506.03101. https://arxiv.org/abs/2506.03101

Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371

Zhang, W., Liu, B., Hu, Y., Bai, X., Zhang, W., & Cui, B. (2026). pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training. arXiv Preprint arXiv:2602.22592. https://arxiv.org/abs/2602.22592

Parameter Golf Research Garden

Section Tree

Entropy-Weighted Vocabulary Rescue

Hypothesis

Mechanism sketch

Why this might work

Evidence threads

What would falsify it

Why it matters under the 16 MB cap

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Entropy-Weighted Vocabulary Rescue

Hypothesis

Mechanism sketch

Why this might work

Evidence threads

What would falsify it

Why it matters under the 16 MB cap

Related

Graph View

Table of Contents

Referenced by

Recent notes