Run bounded automated experiment iterations by recording baselines, applying hypothesis patches, comparing metrics, protecting regression guards, and deciding keep, discard, rollback, or block. Use when $autoresearch is named or a repo/skill needs evidence-backed research, metric tracking, or safe optimisation loops.
63
73%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./Skills/agent-ops/autoresearch/SKILL.mdBounded evidence loop: baseline, hypothesize, patch, score, decide, record. Humans set goal, metric, scope, and stop condition; the agent runs reversible hypotheses inside those bounds.
$autoresearch.Owns the experiment contract, ledger, and keep/discard/block recommendation; parent thread owns final decision. Fixed surfaces are benchmark harness, evaluator, data prep, datasets, tokenizer files, and guard commands. Block on unclear metric, boundary, runtime, guard semantics, network/dependency/destructive approvals, contract edits, or unbounded runs.
Target path, boundaries, run tag, metric direction, verify/guard commands, stop condition, evidence path, and optional evaluator contract or min_delta policy.
Ledger plus closeout: hypotheses, patches, commands, scores, baseline, best delta, guard status, changed files, blockers, and schema_version when schema-bound.
references/discovery-interview.md when the request is underspecified.jscraik/autoresearch, read README.md, program.md, prepare.py, and train.py; normally edit only train.py.noise_runs, aggregation, min_delta, and confirmation rule.git status, commits, and last kept diff.Verify, optional Guard, then keep/discard/crash/block with evidence and update the ledger.noise_runs, aggregation or median policy, min_delta, and the confirmation rule before keep/discard.blocked or not ready, recommend rewrite or eval-design work, then stop.run_tag: 2026-05-16-skill-quality
hypothesis: "Adding binary expected_signals improves smoke eval pass rate."
patch: "references/evals.yaml only"
baseline: {command: "./bin/ask evals run Skills/agent-ops/foo --mode smoke --runner discovery-smoke --json --robot", score: "6/8"}
verify: {command: "./bin/ask evals run Skills/agent-ops/foo --mode smoke --runner discovery-smoke --json --robot", score: "8/8"}
guard: {command: "./bin/ask skills audit Skills/agent-ops/foo --level strict --json --robot", status: pass}
decision: keep
reason: "delta >= min_delta and guard passed"$ uv run train.py --steps 200 --json
{"val_bpb":1.742,"status":"pass"}
$ apply_patch # hypothesis: smaller learning-rate warmup
$ uv run train.py --steps 200 --json
{"val_bpb":1.719,"status":"pass"}
$ uv run pytest tests/regression_guard.py
1 passedDecision: keep only if baseline - candidate >= min_delta, guard passes, and the ledger records the patch.
Repair the smallest failing hypothesis, parser, command, or ledger entry first; rerun that gate before broad validation. Preserve fixed evaluator/data surfaces and provenance. Mark blocked with the exact missing permission, runtime, credential, metric, corpus, or toolchain.
Baseline exists before any kept change; every decision has command output, metric evidence, ledger status, guard status, and residual risk.
min_delta, or accepting subjective claims without a metric/binary rubric.uv run train.py, and keep only lower val_bpb changes."4c78f98
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.