Hypothesis
A compact recurrent model may outperform a larger static model if it can spend a small, bounded amount of extra evaluation-time compute on refinement, planning, or reranking.
Why this is plausible
Under a hard artifact cap, stored parameters and evaluation-time compute are partly substitutable resources. A model that already reuses shared blocks across depth is especially well positioned to exploit this exchange.
This makes iterative refinement a natural extension of:
Candidate forms
- extra recurrent passes on difficult examples
- shallow self-refinement loops before final prediction
- generate-then-rerank behavior inside the same compact model family
- planning-style intermediate computation rather than a single forward path
What would support it
- a smaller artifact overtaking a larger static baseline once limited extra compute is allowed
- recurrent architectures benefiting more than non-recurrent ones from extra inference steps
- the quality gain per extra inference step staying favorable for at least a short range
Main risks
- wall-clock costs overwhelm the quality gain
- improvements come from simple reranking tricks that do not generalize
- the underlying model is too weak for refinement to rescue it