31 items with this tag.
papers
Paper note on integrating rate-constrained compression pressure directly into LLM training rather than treating compression only as a post-training step.
papers
Paper note on pushing beyond 1-bit LLM compression by factorizing weight matrices into binary latent factors with learned multi-scale compensation.
papers
Paper note on using a single learned neural codec to compress whole LLM-scale weight sets instead of relying only on handcrafted quantization formats.
papers
Paper note on making LLM training explicitly produce more low-rank, compressible weights by constraining Muon updates with a nuclear-norm budget.
papers
Paper note on exploiting weight-space symmetries with bits-back coding so some model bytes can be saved without changing predictions.
papers
Paper note on using reinforcement learning during training to decide which transformer layers should share weights and which should remain independent.
papers
Paper note on shrinking and retargeting the tokenizer and embedding table to a domain so the model uses fewer vocabulary bytes and shorter sequences.
papers
Paper note on preserving a tiny set of outlier-sensitive weight columns in high precision while quantizing the rest of the model aggressively.
papers
Paper note on applying rate-distortion theory directly to language-model compression instead of treating bit allocation as a heuristic afterthought.
papers
Paper note on compressing language-model matrices into residual low-rank structure plus a shared neural decoder over vector-quantized latent representations.
papers
Paper note on AQLM and why codebook-style additive quantization becomes attractive once scalar low-bit methods start wasting error budget on the wrong directions.
papers
Paper note on cross-layer parameter sharing and factorized embeddings as two clean ways to reduce stored parameters without simply shrinking hidden capacity.
papers
Paper note on activation-aware weight quantization and the claim that a tiny set of salient channels dominates low-bit error.
papers
Paper note on training ternary 1.58-bit language models from scratch and why ultra-low-bit modeling should be treated as a native design regime.
papers
Paper note on clustering-based compression as a way to exploit weight structure and outlier concentration when uniform quantization gets brittle.
papers
Paper note on the claim that an extra RMSNorm before linear projections is a disproportionately strong stabilizer for extreme low-bit finetuning.
papers
Paper note on learning structured parameter sharing with tensor decompositions and sparsity instead of treating sharing as all-or-nothing layer tying.
papers
Paper note on compute-optimal inference and why smaller models plus better evaluation-time search can beat larger models under fixed budgets.
papers
Paper note on hardware-aware outlier-preserving quantization and why selective protection must still respect deployment efficiency.
papers
Paper note on making Universal Transformers competitive through parameter sharing plus sparse expert capacity.
papers
Paper note on designing for low inference budgets from the beginning rather than shrinking a generic large-model plan at the end.
papers
Paper note on decoupled low-bit training with a tiny high-precision branch for the parameters that matter most.
papers
Paper note on pushing post-training quantization below 2 bits by preserving salient structure with unusually low overhead.
papers
Paper note on using rotations to remove hidden-state outliers so that weights, activations, and KV cache can all be quantized more uniformly.
papers
Paper note on stabilizing 1-bit weight-and-activation training through better low-bit distribution fitting and more trustworthy gradients.
papers
Paper note on converting pretrained transformers into recursive/shared-parameter models with lightweight depth-specific relaxation.
papers
Paper note on replacing a pretrained model tokenizer while retraining only embeddings and the LM head.
papers
Paper note on why compact-model training has a different systems bottleneck profile than many big-model intuitions suggest.
papers
Paper note on tokenizer evaluation across scales and why compression alone is not enough to rank tokenizers.
papers
Paper note on recurrent self-attentive depth, dynamic halting, and the idea that transformers can trade stored depth for repeated computation.
papers
Paper note on reducing output-layer memory and logits cost by restructuring vocabulary prediction.