Tag: paper

Mar 20, 2026

papers

The LM Head is a Gradient Bottleneck

Paper note on the language-model head as an optimization bottleneck, not only a storage bottleneck.

Mar 20, 2026

papers

Mamba-PTQ

Paper note on activation outliers in recurrent state-space language models and why quantization difficulty survives architectural changes.

Mar 20, 2026

papers

Titans

Paper note on neural long-term memory that learns what to memorize at test time, extending evaluation-time compute beyond plain search.

Mar 20, 2026

papers

Transformers are SSMs

Paper note on structured state space duality and why transformer intuition can transfer into linear-time sequence models.

Mar 20, 2026

papers

VQ-Logits

Paper note on compressing the language-model output bottleneck by replacing the full logits projection with a compact codebook.

Mar 19, 2026

papers

BackSlash

Paper note on integrating rate-constrained compression pressure directly into LLM training rather than treating compression only as a post-training step.

Mar 19, 2026

papers

LittleBit

Paper note on pushing beyond 1-bit LLM compression by factorizing weight matrices into binary latent factors with learned multi-scale compensation.

Mar 19, 2026

papers

Neural Weight Compression

Paper note on using a single learned neural codec to compress whole LLM-scale weight sets instead of relying only on handcrafted quantization formats.

Mar 19, 2026

papers

NuMuon

Paper note on making LLM training explicitly produce more low-rank, compressible weights by constraining Muon updates with a nuclear-norm budget.

Mar 19, 2026

papers

Getting Free Bits Back from Rotational Symmetries in LLMs

Paper note on exploiting weight-space symmetries with bits-back coding so some model bytes can be saved without changing predictions.

Mar 19, 2026

papers

Dynamic Layer Tying

Paper note on using reinforcement learning during training to decide which transformer layers should share weights and which should remain independent.

Mar 19, 2026

papers

Fast Vocabulary Transfer

Paper note on shrinking and retargeting the tokenizer and embedding table to a domain so the model uses fewer vocabulary bytes and shorter sequences.

Mar 19, 2026

papers

OWQ

Paper note on preserving a tiny set of outlier-sensitive weight columns in high precision while quantizing the rest of the model aggressively.

Mar 19, 2026

papers

Radio

Paper note on applying rate-distortion theory directly to language-model compression instead of treating bit allocation as a heuristic afterthought.

Mar 19, 2026

papers

ReALLM

Paper note on compressing language-model matrices into residual low-rank structure plus a shared neural decoder over vector-quantized latent representations.

Mar 19, 2026

papers

Extreme Compression via Additive Quantization

Paper note on AQLM and why codebook-style additive quantization becomes attractive once scalar low-bit methods start wasting error budget on the wrong directions.

Mar 19, 2026

papers

ALBERT

Paper note on cross-layer parameter sharing and factorized embeddings as two clean ways to reduce stored parameters without simply shrinking hidden capacity.

Mar 19, 2026

papers

AWQ

Paper note on activation-aware weight quantization and the claim that a tiny set of salient channels dominates low-bit error.

Mar 19, 2026

papers

BitNet b1.58

Paper note on training ternary 1.58-bit language models from scratch and why ultra-low-bit modeling should be treated as a native design regime.

Mar 19, 2026

papers

ClusComp

Paper note on clustering-based compression as a way to exploit weight structure and outlier concentration when uniform quantization gets brittle.

Mar 19, 2026

papers

An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

Paper note on the claim that an extra RMSNorm before linear projections is a disproportionately strong stabilizer for extreme low-bit finetuning.

Mar 19, 2026

papers

Fine-grained Parameter Sharing

Paper note on learning structured parameter sharing with tensor decompositions and sparsity instead of treating sharing as all-or-nothing layer tying.

Mar 19, 2026

papers

Inference Scaling Laws

Paper note on compute-optimal inference and why smaller models plus better evaluation-time search can beat larger models under fixed budgets.

Mar 19, 2026

papers

MicroScopiQ

Paper note on hardware-aware outlier-preserving quantization and why selective protection must still respect deployment efficiency.

Mar 19, 2026

papers

MoEUT

Paper note on making Universal Transformers competitive through parameter sharing plus sparse expert capacity.

Mar 19, 2026

papers

Need a Small Specialized Language Model? Plan Early!

Paper note on designing for low inference budgets from the beginning rather than shrinking a generic large-model plan at the end.

Mar 19, 2026

papers

pQuant

Paper note on decoupled low-bit training with a tiny high-precision branch for the parameters that matter most.

Mar 19, 2026

papers

PTQ1.61

Paper note on pushing post-training quantization below 2 bits by preserving salient structure with unusually low overhead.

Mar 19, 2026

papers

QuaRot

Paper note on using rotations to remove hidden-state outliers so that weights, activations, and KV cache can all be quantized more uniformly.

Mar 19, 2026

papers

QuEST

Paper note on stabilizing 1-bit weight-and-activation training through better low-bit distribution fitting and more trustworthy gradients.

Mar 19, 2026

papers

Relaxed Recursive Transformers

Paper note on converting pretrained transformers into recursive/shared-parameter models with lightweight depth-specific relaxation.

Mar 19, 2026

papers

ReTok

Paper note on replacing a pretrained model tokenizer while retraining only embeddings and the LM head.

Mar 19, 2026

papers

Computational Bottlenecks of Training SLMs

Paper note on why compact-model training has a different systems bottleneck profile than many big-model intuitions suggest.

Mar 19, 2026

papers

Beyond Text Compression

Paper note on tokenizer evaluation across scales and why compression alone is not enough to rank tokenizers.

Mar 19, 2026

papers

Universal Transformers

Paper note on recurrent self-attentive depth, dynamic halting, and the idea that transformers can trade stored depth for repeated computation.

Mar 19, 2026

papers

Vocabulary Compression for Low-Compute Environments

Paper note on reducing output-layer memory and logits cost by restructuring vocabulary prediction.

Parameter Golf Research Garden

Section Tree

Tag: paper

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs

VQ-Logits

BackSlash

LittleBit

Neural Weight Compression

NuMuon

Getting Free Bits Back from Rotational Symmetries in LLMs

Dynamic Layer Tying

Fast Vocabulary Transfer

OWQ

Radio

ReALLM

Extreme Compression via Additive Quantization

ALBERT

AWQ

BitNet b1.58

ClusComp

An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits

Fine-grained Parameter Sharing

Inference Scaling Laws

MicroScopiQ

MoEUT

Need a Small Specialized Language Model? Plan Early!

pQuant

PTQ1.61

QuaRot

QuEST

Relaxed Recursive Transformers

ReTok

Computational Bottlenecks of Training SLMs

Beyond Text Compression

Universal Transformers

Vocabulary Compression for Low-Compute Environments

Graph View