RMSNorm Stabilized Scaling

Hypothesis

Adding an extra RMSNorm before quantization-sensitive projections should improve final compressed quality because it reduces activation-scale volatility before the most fragile low-bit operations. (Steinmetz et al., 2025)

Why this is plausible

Extra RMSNorm shows a simple normalization change can make ternary / 1.58-bit training much more stable.
In a roundtrip-compressed setting, any change that reduces quantization damage is unusually leveraged.
This hypothesis composes naturally with shared-parameter models, where repeated reuse of the same block may make stable feature scales even more valuable.

What would count as support

consistent gains in post-roundtrip quality
reduced sensitivity to learning-rate or clipping choices
smaller pre-to-post quantization degradation

What would weaken it

gains vanish as the model or compression setup becomes more demanding
normalization helps train loss but not compressed artifact quality
extra normalization improves only because it changes optimization, not because it is generally low-bit robust

Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823

Parameter Golf Research Garden

Section Tree

RMSNorm Stabilized Scaling

Hypothesis

Why this is plausible

What would count as support

What would weaken it

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

RMSNorm Stabilized Scaling

Hypothesis

Why this is plausible

What would count as support

What would weaken it

Related

Graph View

Table of Contents

Referenced by

Recent notes