This page summarizes the local research chronology already visible in this repository’s kept commits and autoresearch.jsonl.

It is intentionally different from challenge history, which only summarizes the thin public record.

Phase 0: make the loop real at all

The earliest local work was not about squeezing the last few basis points out of the model. It was about making a reproducible local loop exist.

Important turning points included:

  • artifact-phase observability
  • benchmark supervision / timeouts
  • enough schedule simplification to get valid runs instead of repeated timeouts
  • fixing the Metal graph / evaluation plumbing so the trainer stopped stalling

This is the phase that made later research possible at all.

Phase 1: proxy-loop shaping won before fancy modeling did

Once the pipeline ran, one of the biggest early sources of improvement was simply making the local proxy useful.

The broad direction was:

  • add capped validation prefixes for cheap breadth runs
  • make breadth fast enough to iterate on
  • then step up the schedule gradually rather than guessing the final shape all at once

Representative kept improvements from autoresearch.jsonl:

  • breadth improved from roughly 2.882.752.642.582.53
  • confirm then improved from roughly 2.552.51

That phase matters because it established a core local lesson:

better research throughput and a saner proxy loop were worth more than early speculative architecture changes.

Phase 2: compression-aware training tweaks beat many export-only tweaks

After the loop stabilized, the strongest local wins came from training-side robustness rather than from broad artifact-format surgery.

Especially important kept changes were:

  • GRAD_CLIP_NORM=1.0
  • lowering TIED_EMBED_LR to 0.04
  • lowering SCALAR_LR to 0.035
  • optimizer smoothing via BETA2 rising through 0.97, 0.98, 0.99, 0.995

These moves repeatedly beat many small export-only variants.

Phase 3: many selective export tweaks mostly failed or stayed flat

The logs also show a useful negative result: a lot of obvious compression-side interventions did not clearly beat the stronger training-side setup.

Examples that mostly failed, regressed, or stayed flat:

  • clip-percentile nudges with little material benefit
  • fp16 passthrough for medium-size KV projections
  • fp16 passthrough for tied embeddings
  • no-clipping variants
  • several narrowly targeted protected-tensor tests

This phase is important because it sharpens the real research question. The challenge is not solved by sprinkling high precision onto arbitrary tensors.

Phase 4: optimizer smoothing became a real local frontier

One of the cleanest confirmed local trajectories was the optimizer-smoothing ladder:

  • BETA2 0.97
  • BETA2 0.98
  • BETA2 0.99
  • BETA2 0.995

Each step improved confirm val_bpb in the logs while holding bytes roughly steady or slightly better.

That makes optimizer smoothing part of the actual local lineage, not just an implementation detail.

Phase 5: speculative shared-depth work was promising but fragile

Local history also includes a clear speculative branch around shared depth and recurrence.

This branch produced:

The important correction is that this lane is not imaginary; it has already been explored locally, even if it has not yet produced the cleanest kept run lineage.

Phase 6: AlphaXiv-driven architecture reading produced a concrete win

A later local phase explicitly mined recent papers and converted them into experiment ideas.

The clearest success so far is Extra RMSNorm, which produced:

  • a breadth improvement from about 2.64912.6108
  • then a confirm improvement from about 2.43982.4260

This is one of the strongest examples in the repo of the KB actually feeding the benchmark loop.

What this chronology says overall

The local lineage so far suggests a sequence:

  1. make the loop runnable
  2. make the proxy meaningful
  3. win with training-side robustness before exotic export tricks
  4. keep recurrence/shared-depth alive as a real but fragile lane
  5. use literature synthesis to guide architecture tweaks with better priors