Evaluation-Time Compute and Inference Scaling

Parameter Golf leaves room for evaluation-time ingenuity. That means the right question is not only “what should we store?” but also “what useful computation can a compact model perform after it has been loaded?”

Core question

Can a smaller artifact with bounded extra refinement, planning, or reranking beat a larger static model that stores more knowledge directly?

Why this lane matters

Under a hard artifact cap, evaluation-time compute is one of the few remaining levers once compression has gone far enough. It is especially natural for compact recurrent models, where the same core block can be reused for both representation and refinement.

Central papers

Main patterns to watch

Run a compact model for multiple passes instead of storing a larger one.

See:

Iterative refinement over stored depth

2. Recurrent reasoning

Use a shared block that can spend more compute on difficult cases without changing stored bytes.

See:

Recurrent wide architecture

3. Planning or reranking

Use a small model to generate candidates, then spend extra compute choosing among them.

Practical constraints

extra inference steps still have to fit wall-clock limits
gains must survive realistic task distributions rather than only toy prompts
the method should remain reproducible and architecturally coherent, not a brittle pile of special cases

Most relevant questions

when is extra test-time compute better than extra stored depth?
which compact architectures can make best use of repeated refinement?
how much of the benefit comes from true reasoning versus simple reranking?
can evaluation-time compute compensate for smaller vocabularies or more aggressive compression?

Grangier, D., Katharopoulos, A., Ablin, P., & Hannun, A. (2024). Need a Small Specialized Language Model? Plan Early! arXiv Preprint arXiv:2402.01093. https://arxiv.org/abs/2402.01093

Wu, Y., Sun, Z., Li, S., Welleck, S., & Yang, Y. (2024). Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models. arXiv Preprint arXiv:2408.00724. https://arxiv.org/abs/2408.00724

Parameter Golf Research Garden

Section Tree

Evaluation-Time Compute and Inference Scaling

Core question

Why this lane matters

Central papers

Main patterns to watch

1. Iterative refinement

2. Recurrent reasoning

3. Planning or reranking

Practical constraints

Most relevant questions

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Evaluation-Time Compute and Inference Scaling

Core question

Why this lane matters

Central papers

Main patterns to watch

1. Iterative refinement

2. Recurrent reasoning

3. Planning or reranking

Practical constraints

Most relevant questions

Related

Graph View

Table of Contents

Referenced by

Recent notes