Hypothesis

Most of the post-compression damage in the embedding / LM-head path may come from a small subset of vocabulary rows rather than from the whole matrix.

If so, the right move is not to protect the entire head, but to identify and preserve only rows associated with tokens that are simultaneously:

  • high frequency or high loss contribution
  • high entropy / easily confused in context
  • unusually sensitive to quantization

Mechanism sketch

A disciplined version of the idea would:

  • quantize the full embedding / head normally
  • compute a token-row sensitivity score from validation loss change, row error, or gradient statistics
  • keep only the top-ranked rows in a higher-precision residual path
  • spend the saved bytes on a larger trunk, better norms, or simply more headroom under the cap

This is basically pQuant logic applied at the vocabulary-row level rather than the whole-tensor level.

Why this might work

The supporting evidence comes from three directions:

The new connection is that tokenizer difficulty and quantization sensitivity may align. If a small set of tokens already dominates uncertainty, those may also be the rows worth rescuing.

Evidence threads

What would falsify it

This idea should be rejected if:

  1. quantization harm is diffuse across the vocabulary rather than concentrated in a small row set
  2. the selected rows are unstable across data slices, making the rescue set brittle
  3. row metadata and exception handling cost too many bytes relative to the gain
  4. protecting rows helps token-level loss but does not improve final bits-per-byte on the scored artifact

Why it matters under the 16 MB cap

Protecting even a few hundred vocabulary rows can be dramatically cheaper than raising precision for the full head or embedding matrix.

That matters because the cap punishes broad generosity. If head damage is truly concentrated, row-level rescue could be one of the highest-leverage uses of leftover bytes in the entire submission.

Lotz, J. F., Lopes, A. V., Peitz, S., Setiawan, H., & Emili, L. (2025). Beyond Text Compression: Evaluating Tokenizers Across Scales. arXiv Preprint arXiv:2506.03101. https://arxiv.org/abs/2506.03101
Vennam, S., Joishy, A., & Kumaraguru, P. (2024). LLM Vocabulary Compression for Low-Compute Environments. arXiv Preprint arXiv:2411.06371. https://arxiv.org/abs/2411.06371
Zhang, W., Liu, B., Hu, Y., Bai, X., Zhang, W., & Cui, B. (2026). pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training. arXiv Preprint arXiv:2602.22592. https://arxiv.org/abs/2602.22592