CtrlK
BlogDocsLog inGet started
Tessl Logo

autoresearch

Run bounded automated experiment iterations by recording baselines, applying hypothesis patches, comparing metrics, protecting regression guards, and deciding keep, discard, rollback, or block. Use when $autoresearch is named or a repo/skill needs evidence-backed research, metric tracking, or safe optimisation loops.

69

Quality

85%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Content

85%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, well-structured skill that provides clear executable guidance for running bounded experiment loops. Its main strengths are the explicit workflow with validation checkpoints, concrete ledger and iteration examples, and well-organized progressive disclosure. The primary weakness is moderate verbosity with some redundancy across sections (Avoid/Anti-Patterns, Constraints/Decision Language/Gotchas) that could be consolidated to save tokens.

Suggestions

Consolidate overlapping sections: merge 'Avoid' into 'Anti-Patterns', and deduplicate guard/regression rules that appear in both 'Decision Language', 'Constraints', and 'Gotchas'.

DimensionReasoningScore

Conciseness

The skill is fairly dense and avoids explaining basic concepts, but some sections are verbose or redundant (e.g., 'Decision Language' overlaps with 'Constraints' and 'Gotchas'; the 'Avoid' and 'Anti-Patterns' sections partially overlap). Several bullet points could be tightened without losing meaning.

2 / 3

Actionability

The skill provides concrete executable examples (shell commands, YAML ledger entries, specific verify/guard commands), a clear iteration example with real command output, and specific decision criteria (baseline - candidate >= min_delta). The ledger entry template and iteration example are copy-paste ready.

3 / 3

Workflow Clarity

The workflow is clearly sequenced (9 numbered steps) with explicit validation checkpoints (baseline first, verify then guard, keep/discard/crash/block with evidence). Feedback loops are well-defined: fail fast at first failed gate, repair smallest failing unit, rerun before proceeding. The decision criteria include explicit guard regression checks.

3 / 3

Progressive Disclosure

The skill ends with a clear 'Progressive Disclosure' section pointing to one-level-deep references (references/autoresearch-project.md, references/contract.yaml, references/evals.yaml, references/task-profile.json). The main body serves as an effective overview, and the discovery interview section also references references/discovery-interview.md. Navigation is well-signaled.

3 / 3

Total

11

/

12

Passed

Description

85%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a strong description that clearly articulates specific capabilities and includes an explicit 'Use when' clause with a distinctive trigger term ($autoresearch). The main weakness is that some trigger terms are jargon-heavy and may not match how users naturally phrase requests for experimentation or optimization workflows. The description is concise and uses proper third-person voice.

Suggestions

Add more natural user-facing trigger terms like 'experiment', 'A/B test', 'benchmark comparison', or 'iterate on performance' to improve discoverability for users who don't know the $autoresearch keyword.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: 'recording baselines', 'applying hypothesis patches', 'comparing metrics', 'protecting regression guards', and 'deciding keep, discard, rollback, or block'. These are detailed, actionable steps.

3 / 3

Completeness

Clearly answers both 'what' (run bounded automated experiment iterations with specific steps) and 'when' (explicit 'Use when' clause covering '$autoresearch', evidence-backed research, metric tracking, or safe optimisation loops).

3 / 3

Trigger Term Quality

Includes some useful trigger terms like '$autoresearch', 'metric tracking', 'evidence-backed research', and 'optimisation loops', but many terms are specialized jargon. A user might naturally say 'experiment', 'A/B test', 'benchmark', or 'iterate' which are missing. The '$autoresearch' keyword is a good explicit trigger but not a natural user term.

2 / 3

Distinctiveness Conflict Risk

The combination of automated experimentation, hypothesis patches, regression guards, and the specific '$autoresearch' trigger creates a very distinct niche. This is unlikely to conflict with general coding, testing, or research skills.

3 / 3

Total

11

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

metadata_field

'metadata' should map string keys to string values

Warning

Total

10

/

11

Passed

Repository
jscraik/Agent-Skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.