The usual reading of Inference Scaling Laws is that smaller models can sometimes beat larger ones under fixed inference compute if they use better search. That is already interesting.
But for Parameter Golf there is a sharper interpretation:
extra evaluation-time compute may be able to act like a partial decompressor for a strongly storage-limited model.
That makes Inference Scaling Laws, Plan Early, MoEUT, and Relaxed Recursive Transformers part of the same frontier rather than separate conversations.
Why this seam matters now
Once a hard artifact cap is binding, many standard levers are exhausted. We cannot simply keep more unique weights. That leaves a short list of remaining options:
- reuse stored weights more intelligently
- spend more compute at evaluation
- use iterative refinement instead of storing all capacity explicitly
The existing graph already points toward this through evaluation-time compute and recurrent wide architecture, but the recent papers justify a more concrete frontier thesis.
The central synthesis
Inference Scaling Laws
This paper says compute-optimal inference is not the same as “largest model you can afford.” Smarter inference can beat a larger model under fixed compute. (Wu et al., 2024)
Plan Early
This paper says low inference budgets should be designed for from the start, with asymmetric train/inference choices and budget-aware specialization. (Grangier et al., 2024)
MoEUT and Relaxed Recursive Transformers
These papers say repeated/shared computation can remain competitive when the model keeps just enough extra flexibility. (Bae et al., 2024; Csordás et al., 2024)
Put together, they imply a deeper possibility:
a compact recurrent model may use bounded refinement passes to recover some of the function that extra stored depth or width would otherwise provide.
That is why the title says “decompression.” The model is not literally decompressing bytes. It is behaviorally reconstructing capacity through repeated computation.
A falsifiable thesis
Thesis: for storage-limited compact models, a small number of targeted refinement passes on uncertain predictions can beat spending the same byte budget on extra unique parameters.
The word targeted matters. Blindly running more passes everywhere may just waste time. The promising version is selective:
- more compute on hard tokens
- more compute where entropy or disagreement is high
- more compute where a recurrent block can revise its own coarse first pass
What would support it
- a shared or recurrent model gains more from bounded refinement than a non-shared model of similar final size
- selective refinement on hard positions beats uniform extra passes
- the best refinement budget is small and saturates quickly, which would make it practically usable
What would falsify it
- extra passes mostly amplify existing errors instead of correcting them
- gains come only from generic reranking that is too slow or brittle to matter
- the same compute would be better spent on a slightly larger static model whenever bytes allow it
The strongest new idea hiding here
A powerful way to think about this seam is behavioral decompression:
- stored weights define a compact prior
- refinement passes reconstruct context-specific detail on demand
- the model spends computation instead of bytes on the cases that need it most
That is especially natural for recursive sharing because the same block is already designed to be reused. In that setting, test-time refinement is not an awkward add-on. It is an extension of the model’s basic operating principle.
This also connects to Compression interfaces for shared depth. If the recurrent/shared block has stable normalization and light specialization, it is more plausible that extra passes will refine rather than drift.
Why this frontier is risky but real
This seam has high upside and high nonsense risk.
It is easy to imagine elaborate inference-time schemes that look clever and collapse under actual constraints from the challenge. The useful frontier is narrower:
- few passes, not many
- deterministic or reproducible behavior, not exotic search towers
- genuine error correction, not benchmark gaming
The reason it still deserves attention is that the storage cap makes some form of compute-for-bytes substitution almost unavoidable if the simpler compression lanes saturate.
Experiments this frontier suggests
- compare extra unique depth against extra recurrent refinement at equal final bytes
- gate refinement by token entropy or disagreement and compare against uniform multi-pass inference
- test whether shared-depth models with normalization/interfacing tricks gain more from extra passes than plain baselines
- measure where refinement helps: early tokens, rare tokens, long-range dependencies, or locally ambiguous continuations
- check whether refinement still helps after aggressive compression or whether it only rescues cleaner models
A useful failure criterion
If refinement only helps when the base model is already strong and lightly compressed, then it is not really a Parameter Golf frontier. It is just ordinary inference-time search wearing a compact-model costume.
The frontier survives only if bounded refinement remains useful because storage is tight, not despite it.
Bottom line
The question is not just whether inference-time compute helps.
It is whether a compact model can use a little extra compute to reconstruct missing capacity on demand, making stored bytes and evaluation-time computation partly interchangeable.
If that works, it opens a qualitatively different path than “compress the same static model harder.”
Related
- Evaluation-time compute and inference scaling
- Recursive and shared-parameter architectures
- Recurrent wide architecture
- Compression interfaces for shared depth
- Inference Scaling Laws