Paper Index

This index is a working research shelf for Parameter Golf, not a comprehensive bibliography. The goal is to keep the papers that most directly sharpen current lanes, hypotheses, and implementation notes.

Most central to the current search

Extra RMSNorm — cheap architectural stabilization before low-bit projections; strongest support for RMSNorm stabilized scaling (Steinmetz et al., 2025)
pQuant — explicit argument that uniform low-bit treatment wastes bytes on the wrong parameters; central to sparse outlier preservation (Zhang et al., 2026)
ClusComp — clustering and structured reuse as an alternative to scalar uniform quantization (Liao et al., 2025)
Relaxed Recursive Transformers — strongest direct bridge from standard transformers to recursive/shared-depth models (Bae et al., 2024)
Radio — the clearest information-theoretic framing of byte allocation under a hard storage budget (Young, 2025)

Quantization, outliers, and compression-aware training

Best read alongside quantization and outlier handling, outlier-aware compression, decoupled precision, and normalization before projections.

Best read alongside recursive and shared-parameter architectures, recursive width scaling, recurrent wide architecture, and recursive layer sharing.

Tokenizer, vocabulary, and output-head efficiency

This cluster connects most strongly to tokenizer and vocabulary efficiency and tokenizer efficiency.

Compute budgeting and inference-time tradeoffs

This is the paper trail behind training economics and evaluation-time compute.

Useful synthesis paths

If you are exploring low-bit stability first: Extra RMSNorm → QuEST → BitNet b1.58
If you are exploring selective precision: pQuant → OWQ → AWQ → PTQ1.61 → MicroScopiQ
If you are exploring explicit byte-allocation logic: Radio → OWQ → ReALLM
If you are exploring bleeding-edge artifact-native compression: BackSlash → NuMuon → Neural Weight Compression → LittleBit → Getting Free Bits Back from Rotational Symmetries in LLMs
If you are exploring structured non-uniform compression: AQLM → ClusComp → QuaRot
If you are exploring shared-depth architectures: Universal Transformers → ALBERT → Dynamic Layer Tying → Relaxed Recursive Transformers → MoEUT
If you are exploring tokenizer / output-side levers: Beyond Text Compression → Fast Vocabulary Transfer → ReTok → Vocabulary Compression

Section Tree

Paper Index

Most central to the current search

Quantization, outliers, and compression-aware training

Recursive sharing, recurrent depth, and parameter reuse

Tokenizer, vocabulary, and output-head efficiency

Compute budgeting and inference-time tradeoffs

Useful synthesis paths

Meta

Graph View