Normalization Before Projections

The pattern

One recurring theme in compact-model papers is that the projection itself is not always the fragile part — the fragility often starts with the scale of the inputs reaching it.

When activations swing too widely, low-bit weights or coarse quantizers have to absorb scale variation they were never good at representing. Pre-projection normalization reduces that burden.

Why this matters in Parameter Golf

The challenge rewards the behavior of the final compressed artifact, not the prettiest full-precision checkpoint. That means any change that narrows the gap between train-time representations and post-compression behavior is unusually valuable.

Evidence trail

Extra RMSNorm argues that adding an extra RMSNorm before linear layers is enough to stabilize 1.58-bit fine-tuning. (Steinmetz et al., 2025)
QuEST reaches a similar conclusion from a different angle: distribution fitting and stable low-bit dynamics matter as much as raw model size. (Panferov et al., 2025)
BitNet b1.58 also sits in the same family of ideas, where RMSNorm and constrained projection-friendly representations become central rather than incidental. (Wang et al., 2024)

Working takeaway

Pre-projection normalization is best viewed as a compression interface:

it reduces activation outliers
it makes small or ternary weights easier to optimize
it composes well with shared-depth models and outlier-aware compression

What to watch for

This pattern can still fail if:

it adds compute or architectural complexity without helping compressed quality
it simply masks an optimizer problem that disappears under stronger runs
it improves pre-quant metrics but not final artifact metrics

Panferov, A., Chen, J., Tabesh, S., Castro, R. L., Nikdan, M., & Alistarh, D. (2025). QuEST: Stable Training of LLMs with 1-Bit Weights and Activations. arXiv Preprint arXiv:2502.05003. https://arxiv.org/abs/2502.05003

Steinmetz, C., Childress, G., Herbst, A., Jones, G., Singh, J., Vang, E., & Weinstock, K. (2025). An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits. arXiv Preprint arXiv:2505.08823. https://arxiv.org/abs/2505.08823

Wang, H., Ma, S., Ma, L., Wang, L., Wang, W., Huang, S., Dong, L., Wang, R., Xue, J., & Wei, F. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv Preprint arXiv:2402.17764. https://arxiv.org/abs/2402.17764

Parameter Golf Research Garden

Section Tree

Normalization Before Projections

The pattern

Why this matters in Parameter Golf

Evidence trail

Working takeaway

What to watch for

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

Normalization Before Projections

The pattern

Why this matters in Parameter Golf

Evidence trail

Working takeaway

What to watch for

Related

Graph View

Table of Contents

Referenced by

Recent notes