This index is a working research shelf for Parameter Golf, not a comprehensive bibliography. The goal is to keep the papers that most directly sharpen current lanes, hypotheses, and implementation notes.
Most central to the current search
- Extra RMSNorm — cheap architectural stabilization before low-bit projections; strongest support for RMSNorm stabilized scaling (Steinmetz et al., 2025)
- pQuant — explicit argument that uniform low-bit treatment wastes bytes on the wrong parameters; central to sparse outlier preservation (Zhang et al., 2026)
- ClusComp — clustering and structured reuse as an alternative to scalar uniform quantization (Liao et al., 2025)
- Relaxed Recursive Transformers — strongest direct bridge from standard transformers to recursive/shared-depth models (Bae et al., 2024)
- Radio — the clearest information-theoretic framing of byte allocation under a hard storage budget (Young, 2025)
Quantization, outliers, and compression-aware training
- Extra RMSNorm
- pQuant
- QuEST
- Radio
- OWQ
- ReALLM
- BackSlash
- NuMuon
- LittleBit
- Neural Weight Compression
- AQLM
- ClusComp
- PTQ1.61
- MicroScopiQ
- AWQ
- QuaRot
- BitNet b1.58
- Getting Free Bits Back from Rotational Symmetries in LLMs
Best read alongside quantization and outlier handling, outlier-aware compression, decoupled precision, and normalization before projections.
Recursive sharing, recurrent depth, and parameter reuse
- Relaxed Recursive Transformers
- Dynamic Layer Tying
- MoEUT
- Fine-grained Parameter Sharing
- Universal Transformers
- ALBERT
- ClusComp
Best read alongside recursive and shared-parameter architectures, recursive width scaling, recurrent wide architecture, and recursive layer sharing.
Tokenizer, vocabulary, and output-head efficiency
- ReTok
- Fast Vocabulary Transfer
- Vocabulary Compression for Low-Compute Environments
- Beyond Text Compression
- Plan Early
- ALBERT
This cluster connects most strongly to tokenizer and vocabulary efficiency and tokenizer efficiency.
Compute budgeting and inference-time tradeoffs
- Computational Bottlenecks of Training SLMs
- Inference Scaling Laws
- Plan Early
- Universal Transformers
- MoEUT
This is the paper trail behind training economics and evaluation-time compute.
Useful synthesis paths
- If you are exploring low-bit stability first: Extra RMSNorm → QuEST → BitNet b1.58
- If you are exploring selective precision: pQuant → OWQ → AWQ → PTQ1.61 → MicroScopiQ
- If you are exploring explicit byte-allocation logic: Radio → OWQ → ReALLM
- If you are exploring bleeding-edge artifact-native compression: BackSlash → NuMuon → Neural Weight Compression → LittleBit → Getting Free Bits Back from Rotational Symmetries in LLMs
- If you are exploring structured non-uniform compression: AQLM → ClusComp → QuaRot
- If you are exploring shared-depth architectures: Universal Transformers → ALBERT → Dynamic Layer Tying → Relaxed Recursive Transformers → MoEUT
- If you are exploring tokenizer / output-side levers: Beyond Text Compression → Fast Vocabulary Transfer → ReTok → Vocabulary Compression