autoresearch

Run bounded automated experiment iterations by recording baselines, applying hypothesis patches, comparing metrics, protecting regression guards, and deciding keep, discard, rollback, or block. Use when $autoresearch is named or a repo/skill needs evidence-backed research, metric tracking, or safe optimisation loops.

Quality

73%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./Skills/agent-ops/autoresearch/SKILL.md

Quality

Discovery

85%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a strong description that clearly articulates a specific workflow (bounded experiment iterations) with concrete actions and an explicit 'Use when' clause. Its main weakness is that some trigger terms are technical or domain-specific ('hypothesis patches', 'regression guards') rather than natural user language, which could reduce discoverability for users who don't know the exact terminology.

Suggestions

Add more natural-language trigger terms users might say, such as 'run experiment', 'benchmark', 'A/B test', 'try different approaches and measure results', or 'optimize with data'.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: 'recording baselines', 'applying hypothesis patches', 'comparing metrics', 'protecting regression guards', and 'deciding keep, discard, rollback, or block'. These are detailed, actionable steps.	3 / 3
Completeness	Clearly answers both 'what' (run bounded automated experiment iterations with specific steps) and 'when' (explicit 'Use when' clause covering '$autoresearch' naming, evidence-backed research needs, metric tracking, or safe optimisation loops).	3 / 3
Trigger Term Quality	Includes some useful trigger terms like '$autoresearch', 'metric tracking', 'evidence-backed research', and 'optimisation loops', but many are somewhat technical. Natural user phrases like 'run experiment', 'A/B test', 'try and measure', or 'benchmark' are missing. The '$autoresearch' keyword is a specific command trigger but not something users would naturally say.	2 / 3
Distinctiveness Conflict Risk	The combination of experiment iteration loops, hypothesis patches, regression guards, and the specific '$autoresearch' trigger creates a very distinct niche. This is unlikely to conflict with general coding, testing, or research skills.	3 / 3
	Total	11 / 12 Passed

Implementation

62%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured skill for a complex, multi-step experimental workflow with strong workflow clarity and good safety constraints. Its main weaknesses are moderate verbosity with some overlapping sections (Gotchas/Anti-Patterns/Constraints/Avoid) and actionability that leans more toward policy statements than fully executable guidance. The progressive disclosure structure is present but would benefit from verified bundle files and clearer navigation cues.

Suggestions

Consolidate overlapping sections (Avoid, Constraints, Gotchas, Anti-Patterns) into a single 'Constraints & Pitfalls' section to reduce redundancy and improve conciseness.

Make the workflow steps more actionable by providing concrete command examples or templates for abstract steps like 'Define parser contract, guard command, held-out checks, noise_runs, aggregation, min_delta, and confirmation rule.'

Add brief one-line descriptions to each progressive disclosure reference explaining when/why to consult it (e.g., 'references/contract.yaml — machine-readable version of the experiment contract; read when automating or auditing runs').

Dimension	Reasoning	Score
Conciseness	Generally efficient and avoids explaining basic concepts, but some sections are somewhat verbose or redundant (e.g., 'Avoid', 'Anti-Patterns', and 'Gotchas' overlap in intent; the Discovery Interview section adds little actionable value). Several bullet points could be tightened.	2 / 3
Actionability	Provides concrete examples (ledger YAML, shell iteration example, decision criteria), but much of the guidance remains at the policy/principle level rather than executable steps. The workflow steps mix concrete commands with abstract directives like 'Define parser contract, guard command, held-out checks' without showing how.	2 / 3
Workflow Clarity	The 8-step workflow is clearly sequenced with explicit validation checkpoints (baseline first, verify + guard at each iteration, fail-fast at first failed gate, repair before proceeding). Feedback loops are well-defined: keep/discard/crash/block decisions with evidence, and the repair behavior section adds error recovery guidance.	3 / 3
Progressive Disclosure	The Progressive Disclosure section at the end references several external files (references/autoresearch-project.md, references/contract.yaml, references/evals.yaml, references/task-profile.json, references/discovery-interview.md), which is good structure. However, no bundle files are provided to verify these exist, and the main body still contains substantial inline detail that could be offloaded. The references are listed but not clearly signaled with context about when to consult each one.	2 / 3
	Total	9 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: jscraik/Agent-Skills
Commit: 4c78f98

Reviewed: about 22 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.