An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

(Steinmetz et al., 2025)

Sources: arXiv:2505.08823 · alphaXiv overview

Core contribution

The paper’s central result is intentionally simple: inserting an extra RMSNorm before each linear layer can materially improve stability and quality when finetuning into the 1.58-bit regime. Its importance is not just the specific metric gain but the argument that a tiny architectural change can relieve a major low-bit failure mode.

Why this matters for Parameter Golf

This is arguably the cleanest paper-level support for RMSNorm stabilized scaling. It is attractive because it is:

small enough to compose with other ideas
local enough to test without redesigning the whole system
mechanistically plausible for a challenge that scores post-roundtrip compressed quality

If activation scale volatility is part of why low-bit export hurts, then improving the compression interface before projections is unusually leveraged in this setting.

What to import

Input normalization can matter more than quantizer cleverness. Better-behaved activations may let cheap quantization do its job.
Low-bit stability is often architectural, not merely optimizer-level.
Simple interventions compose well. This kind of change can support recursive sharing, selective precision, or aggressive export.

What not to over-import

The paper does not prove that every extra normalization layer is good, nor that any local improvement will persist under stronger workloads. It also leaves open whether gains come from a generally better compression interface or from a narrower optimization benefit under the studied setup.

Best synthesis links

Directly anchors normalization before projections.
Complements QuEST, which attacks similar low-bit instability from the training-dynamics side.
Serves as a stabilizing partner for recursive width scaling, where repeated reuse of the same block may amplify scale problems.

Parameter Golf translation

This paper argues for prioritizing experiments that change the distribution seen by fragile projections before spending time on more elaborate export machinery. In practice, that means trying normalization-side fixes before assuming the answer must be smarter codebooks, more protected residuals, or more complex training heuristics.

Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823

Parameter Golf Research Garden

Section Tree

An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes