(Shao et al., 2025)

Sources: arXiv:2505.10202 · alphaXiv overview

Core contribution

VQ-Logits attacks the output bottleneck directly by replacing the full vocabulary-sized logits projection with a compact vector-quantized codebook. The key claim is that the output head can often be compressed more structurally than a plain low-rank or tied-embedding treatment suggests, because many vocabulary items can share a smaller predictive basis.

Why this matters for Parameter Golf

This is one of the clearest papers in the shelf for output-head compression. It makes the output side feel like a real artifact-budget lever rather than a theoretical annoyance. In a tiny model, saving head bytes can be as meaningful as shaving another fraction of a bit from the trunk.

What to import

  • The logits projection is a compressible structure, not just a fixed tax.
  • A compact codebook may buy better head tradeoffs than naive low-rank factorization alone.
  • Vocabulary semantics and output parameterization should be designed together.

What not to over-import

The paper does not prove that codebook-style output compression wins after all bookkeeping, mapping, and implementation overheads are counted inside a strict challenge artifact. The stable lesson is that the head deserves its own structured compression search, not that any specific VQ scheme is automatically byte-optimal.

Parameter Golf translation

VQ-Logits motivates asking:

  • how many bytes are tied up in the head versus the trunk,
  • whether head restructuring can buy more than another round of backbone compression,
  • and whether tokenizer changes should be evaluated jointly with output-head compression schemes.
Shao, J., Huang, H., Wu, J., Cheng, Y., Wu, Z., Shan, Y., & Zheng, M. (2025). VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits. arXiv Preprint arXiv:2505.10202. https://arxiv.org/abs/2505.10202