Goal

Run a continuous local research loop against benchmark/train_gpt_mlx.py and improve post-roundtrip val_bpb while keeping artifact size and runtime visible.

This loop is for local discovery, falsification, and idea triage. It is not a claim of leaderboard performance.

Primary metric

  • val_bpb lower is better

Secondary metrics

  • total_bytes
  • artifact_headroom_bytes
  • artifact_cap_used_pct
  • step_ms
  • eval_ms
  • pre_quant_val_bpb
  • delta_bpb_roundtrip
  • train_tokens_consumed
  • train_shards_touched_est
  • train_shard_coverage_pct

Benchmark entry point

Always run the benchmark through:

bash autoresearch.sh

That script uses the cached sp1024 dataset/tokenizer from .upstream/parameter-golf, runs the tracked local trainer snapshot, and prints parseable METRIC name=value lines for Pi.

Research profiles

The local benchmark has three profiles:

  • PROFILE=breadth: cheap proxy runs for early idea search
  • PROFILE=confirm: stronger proxy runs for candidates that survive breadth
  • PROFILE=full: expensive local runs on the full validation split

Default profile is breadth.

Important: breadth and confirm are proxy measurements. Only full is intended to approximate the real local final metric. Do not compare results across profiles as if they were the same experiment.

Files Pi may edit

  • benchmark/train_gpt_mlx.py
  • autoresearch.sh
  • autoresearch.checks.sh
  • autoresearch.md
  • autoresearch.ideas.md
  • scripts/extract_parameter_golf_metrics.py

Files Pi must not edit

  • .upstream/
  • .autogolf/
  • .pi-sessions/
  • logs/
  • dataset/tokenizer contents
  • README.md
  • Justfile
  • scripts/bootstrap-parameter-golf.sh
  • scripts/run-pi-autoresearch.sh
  • scripts/tmux-pi-autoresearch.sh

Operating rules

  • Prefer small, falsifiable changes over rewrites.
  • Favor fast local loops and directional signal.
  • Default to PROFILE=breadth.
  • Promote to PROFILE=confirm only after a breadth run clearly improves recent results or opens a strong new direction.
  • Use PROFILE=full sparingly, only for top candidates.
  • Do not run multiple MLX benchmark processes at once on this machine unless explicitly asked.
  • Treat the 16,000,000-byte artifact cap as a hard constraint, not a soft preference.
  • Use materially better primary metrics before keeping a run. Repo config currently sets minImprovement=0.002.
  • Always look at artifact headroom and roundtrip degradation, not only raw val_bpb.
  • Treat train-token budget and shard coverage as fidelity metrics; longer runs are only more meaningful if they also improve workload fidelity.
  • Preserve the parseable log lines:
    • final_int8_zlib_roundtrip_exact
    • Total submission size int8+zlib
    • the last step: validation line
  • Do not introduce network access into the benchmark itself.
  • If a run is worse, ambiguous, or crashes, log it and move on.
  • Update autoresearch.ideas.md when a direction looks promising, exhausted, or invalid.

Current benchmark assumptions

  • Cached data path: .upstream/parameter-golf/data/datasets/fineweb10B_sp1024
  • Cached tokenizer path: .upstream/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
  • Upstream source snapshot reference: benchmark/UPSTREAM_COMMIT.txt

Search order

  1. Small schedule and runtime changes in autoresearch.sh
  2. Small trainer edits in benchmark/train_gpt_mlx.py
  3. Compression-aware export and clipping ideas
  4. Shared-weight, recurrent, or other structural ideas only after the loop is stable

Research method

  1. Run cheap breadth experiments to rank ideas quickly.
  2. Record promising directions and dead ends in autoresearch.ideas.md.
  3. Promote only the best few candidates to confirm.
  4. Use full sparingly.
  5. Avoid spending many expensive runs on one idea before it has won in breadth.

Why this page lives in meta

This page describes the research harness and operating procedure, not an LLM mechanism. Keep it in meta so the main garden can stay focused on challenge framing, hypotheses, lanes, and papers.