Goal
Run a continuous local research loop against benchmark/train_gpt_mlx.py and improve post-roundtrip val_bpb while keeping artifact size and runtime visible.
This loop is for local discovery, falsification, and idea triage. It is not a claim of leaderboard performance.
Primary metric
val_bpblower is better
Secondary metrics
total_bytesartifact_headroom_bytesartifact_cap_used_pctstep_mseval_mspre_quant_val_bpbdelta_bpb_roundtriptrain_tokens_consumedtrain_shards_touched_esttrain_shard_coverage_pct
Benchmark entry point
Always run the benchmark through:
bash autoresearch.shThat script uses the cached sp1024 dataset/tokenizer from .upstream/parameter-golf, runs the tracked local trainer snapshot, and prints parseable METRIC name=value lines for Pi.
Research profiles
The local benchmark has three profiles:
PROFILE=breadth: cheap proxy runs for early idea searchPROFILE=confirm: stronger proxy runs for candidates that survive breadthPROFILE=full: expensive local runs on the full validation split
Default profile is breadth.
Important: breadth and confirm are proxy measurements. Only full is intended to approximate the real local final metric. Do not compare results across profiles as if they were the same experiment.
Files Pi may edit
benchmark/train_gpt_mlx.pyautoresearch.shautoresearch.checks.shautoresearch.mdautoresearch.ideas.mdscripts/extract_parameter_golf_metrics.py
Files Pi must not edit
.upstream/.autogolf/.pi-sessions/logs/- dataset/tokenizer contents
README.mdJustfilescripts/bootstrap-parameter-golf.shscripts/run-pi-autoresearch.shscripts/tmux-pi-autoresearch.sh
Operating rules
- Prefer small, falsifiable changes over rewrites.
- Favor fast local loops and directional signal.
- Default to
PROFILE=breadth. - Promote to
PROFILE=confirmonly after a breadth run clearly improves recent results or opens a strong new direction. - Use
PROFILE=fullsparingly, only for top candidates. - Do not run multiple MLX benchmark processes at once on this machine unless explicitly asked.
- Treat the 16,000,000-byte artifact cap as a hard constraint, not a soft preference.
- Use materially better primary metrics before keeping a run. Repo config currently sets
minImprovement=0.002. - Always look at artifact headroom and roundtrip degradation, not only raw
val_bpb. - Treat train-token budget and shard coverage as fidelity metrics; longer runs are only more meaningful if they also improve workload fidelity.
- Preserve the parseable log lines:
final_int8_zlib_roundtrip_exactTotal submission size int8+zlib- the last
step:validation line
- Do not introduce network access into the benchmark itself.
- If a run is worse, ambiguous, or crashes, log it and move on.
- Update
autoresearch.ideas.mdwhen a direction looks promising, exhausted, or invalid.
Current benchmark assumptions
- Cached data path:
.upstream/parameter-golf/data/datasets/fineweb10B_sp1024 - Cached tokenizer path:
.upstream/parameter-golf/data/tokenizers/fineweb_1024_bpe.model - Upstream source snapshot reference:
benchmark/UPSTREAM_COMMIT.txt
Search order
- Small schedule and runtime changes in
autoresearch.sh - Small trainer edits in
benchmark/train_gpt_mlx.py - Compression-aware export and clipping ideas
- Shared-weight, recurrent, or other structural ideas only after the loop is stable
Research method
- Run cheap breadth experiments to rank ideas quickly.
- Record promising directions and dead ends in
autoresearch.ideas.md. - Promote only the best few candidates to
confirm. - Use
fullsparingly. - Avoid spending many expensive runs on one idea before it has won in breadth.
Why this page lives in meta
This page describes the research harness and operating procedure, not an LLM mechanism. Keep it in meta so the main garden can stay focused on challenge framing, hypotheses, lanes, and papers.