BitNet b1.58

(Wang et al., 2024)

Sources: arXiv:2402.17764 · alphaXiv overview

Core contribution

BitNet b1.58 makes the strongest high-level claim in this shelf: language models need not merely survive extreme quantization after training; they can be trained so that ternary weights are the intended operating regime from the beginning. The paper combines constrained weights, RMSNorm-heavy design choices, and training recipes that let {-1, 0, 1} weights remain competitive at useful scale.

Why this matters for Parameter Golf

Parameter Golf strongly rewards any design that treats byte pressure as a first-class architectural constraint. BitNet is one of the clearest demonstrations that the low-bit regime has its own recipes, failure modes, and scaling laws. That makes it a valuable baseline for deciding when post-hoc export is fundamentally leaving too much on the table.

What to import

Native low-bit design beats low-bit afterthoughts. Some constraints should shape architecture and optimization from the start.
Normalization is part of the core recipe. BitNet makes the same broad point as Extra RMSNorm: scale control before projections is not cosmetic.
Scale changes the answer. The viability of very low-bit training is not fixed; it shifts with width, depth, and training setup.

What not to over-import

The paper is not proof that a tiny local research loop can simply jump to 1.58-bit training and win. Much of the BitNet story is about the co-design of training recipe and architecture at substantial scale. For this garden, the main value is conceptual: it resets the default assumption about what counts as a realistic target.

Best synthesis links

Extends normalization before projections by showing that RMS-friendly signal flow is central to ultra-low-bit viability.
Provides a stronger “native regime” framing than QuEST, which focuses more directly on stabilizing very low-bit dynamics.
Serves as an outer bound for RMSNorm stabilized scaling: if normalization is essential even in native ternary training, it may be even more leveraged in compressed export settings.

Parameter Golf translation

BitNet suggests three research postures:

treat aggressive quantization as an architecture problem, not only an export problem
expect normalization and projection design to matter disproportionately
evaluate whether a model family is inherently friendly to harsh weight constraints before piling on compression tricks

Wang, H., Ma, S., Ma, L., Wang, L., Wang, W., Huang, S., Dong, L., Wang, R., Xue, J., & Wei, F. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv Preprint arXiv:2402.17764. https://arxiv.org/abs/2402.17764

Parameter Golf Research Garden

Section Tree

BitNet b1.58

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

Section Tree

BitNet b1.58

Core contribution

Why this matters for Parameter Golf

What to import

What not to over-import

Best synthesis links

Parameter Golf translation

Related

Graph View

Table of Contents

Referenced by

Recent notes