The usual reading of Inference Scaling Laws is that smaller models can sometimes beat larger ones under fixed inference compute if they use better search. That is already interesting.

But for Parameter Golf there is a sharper interpretation:

extra evaluation-time compute may be able to act like a partial decompressor for a strongly storage-limited model.

That makes Inference Scaling Laws, Plan Early, MoEUT, and Relaxed Recursive Transformers part of the same frontier rather than separate conversations.

Why this seam matters now

Once a hard artifact cap is binding, many standard levers are exhausted. We cannot simply keep more unique weights. That leaves a short list of remaining options:

  • reuse stored weights more intelligently
  • spend more compute at evaluation
  • use iterative refinement instead of storing all capacity explicitly

The existing graph already points toward this through evaluation-time compute and recurrent wide architecture, but the recent papers justify a more concrete frontier thesis.

The central synthesis

Inference Scaling Laws

This paper says compute-optimal inference is not the same as “largest model you can afford.” Smarter inference can beat a larger model under fixed compute. (Wu et al., 2024)

Plan Early

This paper says low inference budgets should be designed for from the start, with asymmetric train/inference choices and budget-aware specialization. (Grangier et al., 2024)

MoEUT and Relaxed Recursive Transformers

These papers say repeated/shared computation can remain competitive when the model keeps just enough extra flexibility. (Bae et al., 2024; Csordás et al., 2024)

Put together, they imply a deeper possibility:

a compact recurrent model may use bounded refinement passes to recover some of the function that extra stored depth or width would otherwise provide.

That is why the title says “decompression.” The model is not literally decompressing bytes. It is behaviorally reconstructing capacity through repeated computation.

A falsifiable thesis

Thesis: for storage-limited compact models, a small number of targeted refinement passes on uncertain predictions can beat spending the same byte budget on extra unique parameters.

The word targeted matters. Blindly running more passes everywhere may just waste time. The promising version is selective:

  • more compute on hard tokens
  • more compute where entropy or disagreement is high
  • more compute where a recurrent block can revise its own coarse first pass

What would support it

  • a shared or recurrent model gains more from bounded refinement than a non-shared model of similar final size
  • selective refinement on hard positions beats uniform extra passes
  • the best refinement budget is small and saturates quickly, which would make it practically usable

What would falsify it

  • extra passes mostly amplify existing errors instead of correcting them
  • gains come only from generic reranking that is too slow or brittle to matter
  • the same compute would be better spent on a slightly larger static model whenever bytes allow it

The strongest new idea hiding here

A powerful way to think about this seam is behavioral decompression:

  • stored weights define a compact prior
  • refinement passes reconstruct context-specific detail on demand
  • the model spends computation instead of bytes on the cases that need it most

That is especially natural for recursive sharing because the same block is already designed to be reused. In that setting, test-time refinement is not an awkward add-on. It is an extension of the model’s basic operating principle.

This also connects to Compression interfaces for shared depth. If the recurrent/shared block has stable normalization and light specialization, it is more plausible that extra passes will refine rather than drift.

Why this frontier is risky but real

This seam has high upside and high nonsense risk.

It is easy to imagine elaborate inference-time schemes that look clever and collapse under actual constraints from the challenge. The useful frontier is narrower:

  • few passes, not many
  • deterministic or reproducible behavior, not exotic search towers
  • genuine error correction, not benchmark gaming

The reason it still deserves attention is that the storage cap makes some form of compute-for-bytes substitution almost unavoidable if the simpler compression lanes saturate.

Experiments this frontier suggests

  1. compare extra unique depth against extra recurrent refinement at equal final bytes
  2. gate refinement by token entropy or disagreement and compare against uniform multi-pass inference
  3. test whether shared-depth models with normalization/interfacing tricks gain more from extra passes than plain baselines
  4. measure where refinement helps: early tokens, rare tokens, long-range dependencies, or locally ambiguous continuations
  5. check whether refinement still helps after aggressive compression or whether it only rescues cleaner models

A useful failure criterion

If refinement only helps when the base model is already strong and lightly compressed, then it is not really a Parameter Golf frontier. It is just ordinary inference-time search wearing a compact-model costume.

The frontier survives only if bounded refinement remains useful because storage is tight, not despite it.

Bottom line

The question is not just whether inference-time compute helps.

It is whether a compact model can use a little extra compute to reconstruct missing capacity on demand, making stored bytes and evaluation-time computation partly interchangeable.

If that works, it opens a qualitatively different path than “compress the same static model harder.”

Bae, S., Fisch, A., Harutyunyan, H., Ji, Z., Kim, S., & Schuster, T. (2024). Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA. arXiv Preprint arXiv:2410.20672. https://arxiv.org/abs/2410.20672
Csordás, R., Irie, K., Schmidhuber, J., Potts, C., & Manning, C. D. (2024). MoEUT: Mixture-of-Experts Universal Transformers. arXiv Preprint arXiv:2405.16039. https://arxiv.org/abs/2405.16039
Grangier, D., Katharopoulos, A., Ablin, P., & Hannun, A. (2024). Need a Small Specialized Language Model? Plan Early! arXiv Preprint arXiv:2402.01093. https://arxiv.org/abs/2402.01093
Wu, Y., Sun, Z., Li, S., Welleck, S., & Yang, Y. (2024). Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models. arXiv Preprint arXiv:2408.00724. https://arxiv.org/abs/2408.00724