Run one Autolab benchmark experiment safely on Hugging Face Jobs. Use when a planner, reviewer, or experiment worker is preparing, auditing, launching, or reviewing a single train.py hypothesis against the current local promoted master.
87
83%
Does it follow best practices?
Impact
99%
1.94xAverage score across 3 eval scenarios
Passed
No known issues
Full experiment workflow script
Source credentials
100%
100%
Refresh master with dag
100%
100%
Only train.py edited
100%
75%
Preflight before launch
100%
100%
Launch mode experiment
100%
100%
Log output to /tmp
100%
100%
Parse metric from log
100%
100%
Record run with submit_patch
100%
100%
Promotion condition val_bpb
100%
100%
Correct step ordering
100%
100%
Exactly one launch call
100%
100%
Guardrails and pre-launch safety checks
Correct freshness baseline
0%
100%
Ignore git main
10%
100%
Stop on multiple hypothesis categories
91%
100%
Stale workspace: stop and rewrite
0%
100%
Prepare mode prohibited
0%
100%
Credentials sourced in script
0%
100%
Refresh master in script
0%
100%
Preflight in script
0%
100%
Script stops before launch
100%
100%
train_orig.py named as base
0%
100%
Duplicate detection and programmatic preflight
trackio_reporter used
0%
100%
max-jobs flag present
0%
100%
Preflight json flag used
0%
100%
JSON output captured
50%
100%
Duplicate check purpose
41%
100%
Preflight json purpose
50%
100%
No duplicate launch rule
100%
100%
uv run used for scripts
0%
100%
Non-zero exit on duplicate
100%
100%
Single-experiment scope
100%
100%
0448a7c
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.