Inference Scaling Laws

(Wu et al., 2024)

Sources: arXiv:2408.00724 · alphaXiv overview

Core contribution

The paper argues that inference has its own scaling laws: under a fixed test-time compute budget, the best system is often not the largest model you can store, but a smaller model paired with more intelligent search or compute allocation. In other words, parameter count and inference quality do not scale independently of evaluation strategy.

Why this matters for Parameter Golf

This is the anchor paper for evaluation-time compute. Parameter Golf is unusual because the artifact budget is hard, but evaluation still leaves room for extra computation. That makes this paper more than a generic inference note; it directly reframes the objective from “store the strongest model” to “use bytes and runtime jointly.”

What to import

Inference compute is a design variable.
Better search can dominate more stored parameters.
Harder examples benefit more from intelligent inference. This matters because benchmark averages can hide tails where extra compute pays off disproportionately.

What not to over-import

The paper is not a license to throw complicated search into every submission. Parameter Golf still has runtime, reproducibility, and simplicity constraints. Some search strategies also help reasoning tasks more than generic language modeling. The value here is the budgeting principle, not a promise that any search heuristic will win.

Best synthesis links

Sharpens recurrent wide architecture by making the compute-for-storage exchange explicit.
Pairs naturally with MoEUT and Universal Transformers, where recurrent/shared models may spend compute more flexibly than static deep stacks.
Complements Plan Early, which makes a similar argument from the perspective of task specialization and low inference budgets.

Parameter Golf translation

This paper motivates asking:

when should saved bytes be spent on more width versus more evaluation-time passes?
can a compact recurrent model plus light test-time refinement dominate a larger frozen baseline?
which benchmark slices reward smarter inference rather than only bigger storage?

Wu, Y., Sun, Z., Li, S., Welleck, S., & Yang, Y. (2024). Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models. arXiv Preprint arXiv:2408.00724. https://arxiv.org/abs/2408.00724

Parameter Golf Research Garden

Section Tree

Inference Scaling Laws

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Inference Scaling Laws

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes