CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-benchmark-suite

Agent skill for benchmark-suite - invoke with $agent-benchmark-suite

41

2.17x
Quality

Does it follow best practices?

Impact

89%

2.17x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Content

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill body is a long, monolithic catalog of non-executable JavaScript class stubs plus a few usable CLI commands, with no real workflow sequencing or validation checkpoints. It is token-heavy for the actionable value it delivers and makes no use of progressive disclosure.

Suggestions

Replace the stub class blocks with lean, executable examples (or a single runnable script in scripts/) and cut restating comments to respect the token budget.

Add an explicit ordered workflow with validation checkpoints for the risky batch operations (e.g., run baseline -> execute suite -> validate against SLA -> only gate deployment on pass).

Move the large benchmark definitions and reference material into references/ files and have SKILL.md point to them one level deep instead of inlining everything.

DimensionReasoningScore

Conciseness

The body is ~670 lines dominated by large JavaScript class blocks padded with restating comments like '// Advanced benchmarking system' and '// Comprehensive regression detection system', explaining structure Claude already understands. This matches the verbose/padded anchor; it is not level 2 because nearly every section is over-elaborated rather than 'mostly efficient'.

1 / 3

Actionability

Concrete elements exist — executable 'npx claude-flow' commands and real benchmark target definitions — but the JavaScript classes reference undefined collaborators (ThroughputBenchmark, MLRegressionDetector, mcp.*) and are effectively stubs/pseudocode. This fits 'some concrete guidance but incomplete; pseudocode instead of executable code', and falls short of level 3 because the core code is not copy-paste executable.

2 / 3

Workflow Clarity

Content is organized into named capabilities and a commands section, giving a loose sequence, but there is no ordered multi-step workflow and no validation checkpoints for the destructive/batch operations (load/stress tests, regression gating) the rubric flags. It is above level 1 because some structure exists, but capped below 3 by the missing feedback loops.

2 / 3

Progressive Disclosure

No references/scripts/assets bundle exists and the entire body is a single monolithic file with all class definitions inline, matching the 'monolithic wall' anchor. It is not level 2 because nothing is split out or signaled as a separate reference despite the far-over-50-line length.

1 / 3

Total

6

/

12

Passed

Description

0%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is a placeholder-style string with no concrete capabilities and no usage triggers, scoring at the floor on every dimension. It reads as scaffolding rather than a meaningful skill description.

Suggestions

Replace the description with concrete actions, e.g. 'Runs performance benchmark suites, detects regressions against baselines, and validates SLA/scalability targets.'

Add an explicit trigger clause such as 'Use when the user asks to benchmark performance, check for regressions, or validate scalability/load-test results.'

Drop the 'invoke with $agent-benchmark-suite' invocation syntax in favor of natural trigger terms users would actually say.

DimensionReasoningScore

Specificity

The phrase 'Agent skill for benchmark-suite' names a domain but lists no concrete actions, matching the vague/abstract anchor ("Helps with documents"). It does not reach level 2 because it specifies no operations like 'detect regressions' or 'validate performance'.

1 / 3

Completeness

It answers neither 'what does this do' (no concrete actions) nor 'when should Claude use it' (no 'Use when...' trigger), satisfying the missing-both anchor. It cannot be level 2 since neither the what nor the when is present.

1 / 3

Trigger Term Quality

The only trigger-like phrase is 'invoke with $agent-benchmark-suite', which is technical invocation syntax rather than natural keywords a user would say; this matches the jargon/overly-generic anchor. It is not level 2 because no common natural variations (e.g., 'performance', 'benchmarks', 'regression') appear.

1 / 3

Distinctiveness Conflict Risk

'Agent skill for benchmark-suite' is generic and gives no distinct triggers, so it would overlap with other optimization/monitoring skills. It falls short of level 2 because there is no niche-specific language to set it apart.

1 / 3

Total

4

/

12

Passed

Validation

93%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation15 / 16 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (670 lines); consider splitting into references/ and linking

Warning

Total

15

/

16

Passed

Repository
ruvnet/claude-flow
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.