Local Research Protocol

Goal

Run a continuous local research loop against benchmark/train_gpt_mlx.py and improve post-roundtrip val_bpb while keeping artifact size and runtime visible.

This loop is for local discovery, falsification, and idea triage. It is not a claim of leaderboard performance.

Primary metric

val_bpb lower is better

Secondary metrics

total_bytes
artifact_headroom_bytes
artifact_cap_used_pct
step_ms
eval_ms
pre_quant_val_bpb
delta_bpb_roundtrip
train_tokens_consumed
train_shards_touched_est
train_shard_coverage_pct

Benchmark entry point

Always run the benchmark through:

bash autoresearch.sh

That script uses the cached sp1024 dataset/tokenizer from .upstream/parameter-golf, runs the tracked local trainer snapshot, and prints parseable METRIC name=value lines for Pi.

Research profiles

The local benchmark has three profiles:

PROFILE=breadth: cheap proxy runs for early idea search
PROFILE=confirm: stronger proxy runs for candidates that survive breadth
PROFILE=full: expensive local runs on the full validation split

Default profile is breadth.

Important: breadth and confirm are proxy measurements. Only full is intended to approximate the real local final metric. Do not compare results across profiles as if they were the same experiment.

Files Pi may edit

benchmark/train_gpt_mlx.py
autoresearch.sh
autoresearch.checks.sh
autoresearch.md
autoresearch.ideas.md
scripts/extract_parameter_golf_metrics.py

Files Pi must not edit

.upstream/
.autogolf/
.pi-sessions/
logs/
dataset/tokenizer contents
README.md
Justfile
scripts/bootstrap-parameter-golf.sh
scripts/run-pi-autoresearch.sh
scripts/tmux-pi-autoresearch.sh

Operating rules

Prefer small, falsifiable changes over rewrites.
Favor fast local loops and directional signal.
Default to PROFILE=breadth.
Promote to PROFILE=confirm only after a breadth run clearly improves recent results or opens a strong new direction.
Use PROFILE=full sparingly, only for top candidates.
Do not run multiple MLX benchmark processes at once on this machine unless explicitly asked.
Treat the 16,000,000-byte artifact cap as a hard constraint, not a soft preference.
Use materially better primary metrics before keeping a run. Repo config currently sets minImprovement=0.002.
Always look at artifact headroom and roundtrip degradation, not only raw val_bpb.
Treat train-token budget and shard coverage as fidelity metrics; longer runs are only more meaningful if they also improve workload fidelity.
Preserve the parseable log lines:
- final_int8_zlib_roundtrip_exact
- Total submission size int8+zlib
- the last step: validation line
Do not introduce network access into the benchmark itself.
If a run is worse, ambiguous, or crashes, log it and move on.
Update autoresearch.ideas.md when a direction looks promising, exhausted, or invalid.

Current benchmark assumptions

Cached data path: .upstream/parameter-golf/data/datasets/fineweb10B_sp1024
Cached tokenizer path: .upstream/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
Upstream source snapshot reference: benchmark/UPSTREAM_COMMIT.txt

Search order

Small schedule and runtime changes in autoresearch.sh
Small trainer edits in benchmark/train_gpt_mlx.py
Compression-aware export and clipping ideas
Shared-weight, recurrent, or other structural ideas only after the loop is stable

Research method

Run cheap breadth experiments to rank ideas quickly.
Record promising directions and dead ends in autoresearch.ideas.md.
Promote only the best few candidates to confirm.
Use full sparingly.
Avoid spending many expensive runs on one idea before it has won in breadth.

Why this page lives in meta

This page describes the research harness and operating procedure, not an LLM mechanism. Keep it in meta so the main garden can stay focused on challenge framing, hypotheses, lanes, and papers.

Parameter Golf Research Garden

Section Tree

Local Research Protocol

Goal

Primary metric

Secondary metrics

Benchmark entry point

Research profiles

Files Pi may edit

Files Pi must not edit

Operating rules

Current benchmark assumptions

Search order

Research method

Why this page lives in meta

Graph View

Table of Contents

Referenced by

Recent notes

Public Runs

History and Public Runs

Public Research Directions

Paper Index

The LM Head is a Gradient Bottleneck

Mamba-PTQ

Titans

Transformers are SSMs