Run one Autolab benchmark experiment safely on Hugging Face Jobs. Use when a planner, reviewer, or experiment worker is preparing, auditing, launching, or reviewing a single train.py hypothesis against the current local promoted master.
87
83%
Does it follow best practices?
Impact
99%
1.94xAverage score across 3 eval scenarios
Passed
No known issues
Use this for any single Autolab experiment that should result in exactly one managed benchmark run.
. ~/.autolab/credentialsuv run scripts/refresh_master.py --fetch-dagtrain.py for the single intended hypothesis.uv run scripts/hf_job.py preflightuv run scripts/hf_job.py launch --mode experimentuv run scripts/hf_job.py logs <JOB_ID> --follow --output /tmp/autolab-run.loguv run scripts/parse_metric.py /tmp/autolab-run.loguv run scripts/submit_patch.py --comment "..."val_bpb beats current master.train_orig.py as the refreshed local-master base. If preflight reports
multiple known hypothesis categories, stop and inspect the diff before
launching.main and origin/main when judging freshness. In this rig
repo those refs describe control-plane history, not the benchmark master. The
comparable base is whatever refresh_master.py just wrote into
train_orig.py, research/live/master.json, and research/results.tsv.uv run scripts/hf_job.py launch --mode prepare from an
experiment-scoped worktree. prepare is shared bootstrap work, not
per-experiment work.uv run scripts/hf_job.py preflight --json
Use this when you need to inspect the diff preview, active conflicts, or
detected change categories programmatically.uv run scripts/trackio_reporter.py summary --max-jobs 25
Use this to confirm the experiment id or hypothesis is not already active.0448a7c
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.