Run bounded automated experiment iterations by recording baselines, applying hypothesis patches, comparing metrics, protecting regression guards, and deciding keep, discard, rollback, or block. Use when $autoresearch is named or a repo/skill needs evidence-backed research, metric tracking, or safe optimisation loops.
69
85%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Bounded evidence loop: baseline, hypothesize, patch, score, decide, record. Humans set goal, metric, scope, and stop condition; the agent runs reversible hypotheses inside those bounds.
$autoresearch.Owns the experiment contract, ledger, and keep/discard/block recommendation; parent thread owns final decision. Fixed surfaces are benchmark harness, evaluator, data prep, datasets, tokenizer files, and guard commands. Block on unclear metric, boundary, runtime, guard semantics, network/dependency/destructive approvals, contract edits, or unbounded runs.
Target path, boundaries, run tag, metric direction, verify/guard commands, stop condition, evidence path, train/selection/test split policy, and optional evaluator contract or min_delta policy.
Ledger plus closeout: hypotheses, patches, commands, scores, baseline, best delta, guard status, changed files, blockers, and schema_version when schema-bound. For skill optimization contracts, also produce best_skill.md, rejected-edits.jsonl, and promotion.json before recommending a canonical edit.
references/discovery-interview.md when the request is underspecified.jscraik/autoresearch, read README.md, program.md, prepare.py, and train.py; normally edit only train.py.noise_runs, aggregation, min_delta, and confirmation rule.git status, commits, and last kept diff.Verify, optional Guard, then keep/discard/crash/block with evidence and update the ledger.references/contract.yaml declares optimization.enabled, treat that block as the authority for split visibility, edit budget, protected paths, anti-cheat checks, and promotion. Write candidates under the evidence root; do not overwrite canonical SKILL.md until the promotion contract passes review.noise_runs, aggregation or median policy, min_delta, and the confirmation rule before keep/discard.blocked or not ready, recommend rewrite or eval-design work, then stop.run_tag: 2026-05-16-skill-quality
hypothesis: "Adding binary expected_signals improves smoke eval pass rate."
patch: "references/evals.yaml only"
baseline: {command: "./bin/ask evals run Skills/agent-ops/foo --mode smoke --runner discovery-smoke --json --robot", score: "6/8"}
verify: {command: "./bin/ask evals run Skills/agent-ops/foo --mode smoke --runner discovery-smoke --json --robot", score: "8/8"}
guard: {command: "./bin/ask skills audit Skills/agent-ops/foo --level strict --json --robot", status: pass}
decision: keep
reason: "delta >= min_delta and guard passed"$ uv run train.py --steps 200 --json
{"val_bpb":1.742,"status":"pass"}
$ apply_patch # hypothesis: smaller learning-rate warmup
$ uv run train.py --steps 200 --json
{"val_bpb":1.719,"status":"pass"}
$ uv run pytest tests/regression_guard.py
1 passedDecision: keep only if baseline - candidate >= min_delta, guard passes, and the ledger records the patch.
Repair the smallest failing hypothesis, parser, command, or ledger entry first; rerun that gate before broad validation. Preserve fixed evaluator/data surfaces and provenance. Mark blocked with the exact missing permission, runtime, credential, metric, corpus, or toolchain.
Baseline exists before any kept change; every decision has command output, metric evidence, ledger status, guard status, and residual risk. Skill optimization additionally requires rejected-edit buffer evidence, protected-path anti-cheat status, a best-candidate artifact, and a reviewed promotion manifest before canonical source changes are recommended.
min_delta, or accepting subjective claims without a metric/binary rubric.uv run train.py, and keep only lower val_bpb changes."8e7e19d
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.