Hypothesis
Adding an extra RMSNorm before quantization-sensitive projections should improve final compressed quality because it reduces activation-scale volatility before the most fragile low-bit operations. (Steinmetz et al., 2025)
Why this is plausible
- Extra RMSNorm shows a simple normalization change can make ternary / 1.58-bit training much more stable.
- In a roundtrip-compressed setting, any change that reduces quantization damage is unusually leveraged.
- This hypothesis composes naturally with shared-parameter models, where repeated reuse of the same block may make stable feature scales even more valuable.
What would count as support
- consistent gains in post-roundtrip quality
- reduced sensitivity to learning-rate or clipping choices
- smaller pre-to-post quantization degradation
What would weaken it
- gains vanish as the model or compression setup becomes more demanding
- normalization helps train loss but not compressed artifact quality
- extra normalization improves only because it changes optimization, not because it is generally low-bit robust
Related
- Extra RMSNorm
- Quantization and outliers
- Normalization before projections
- Recurrent wide architecture
- Hypothesis ledger
Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823