agent-benchmark-suite

Agent skill for benchmark-suite - invoke with $agent-benchmark-suite

2.17x

Quality

—

Does it follow best practices?

Impact

89%

2.17x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Quality

Content

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill body is a long, monolithic catalog of non-executable JavaScript class stubs plus a few usable CLI commands, with no real workflow sequencing or validation checkpoints. It is token-heavy for the actionable value it delivers and makes no use of progressive disclosure.

Suggestions

Replace the stub class blocks with lean, executable examples (or a single runnable script in scripts/) and cut restating comments to respect the token budget.

Add an explicit ordered workflow with validation checkpoints for the risky batch operations (e.g., run baseline -> execute suite -> validate against SLA -> only gate deployment on pass).

Move the large benchmark definitions and reference material into references/ files and have SKILL.md point to them one level deep instead of inlining everything.

Dimension	Reasoning	Score
Conciseness	The body is ~670 lines dominated by large JavaScript class blocks padded with restating comments like '// Advanced benchmarking system' and '// Comprehensive regression detection system', explaining structure Claude already understands. This matches the verbose/padded anchor; it is not level 2 because nearly every section is over-elaborated rather than 'mostly efficient'.	1 / 3
Actionability	Concrete elements exist — executable 'npx claude-flow' commands and real benchmark target definitions — but the JavaScript classes reference undefined collaborators (ThroughputBenchmark, MLRegressionDetector, mcp.*) and are effectively stubs/pseudocode. This fits 'some concrete guidance but incomplete; pseudocode instead of executable code', and falls short of level 3 because the core code is not copy-paste executable.	2 / 3
Workflow Clarity	Content is organized into named capabilities and a commands section, giving a loose sequence, but there is no ordered multi-step workflow and no validation checkpoints for the destructive/batch operations (load/stress tests, regression gating) the rubric flags. It is above level 1 because some structure exists, but capped below 3 by the missing feedback loops.	2 / 3
Progressive Disclosure	No references/scripts/assets bundle exists and the entire body is a single monolithic file with all class definitions inline, matching the 'monolithic wall' anchor. It is not level 2 because nothing is split out or signaled as a separate reference despite the far-over-50-line length.	1 / 3
	Total	6 / 12 Passed

Description

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is a placeholder-style string with no concrete capabilities and no usage triggers, scoring at the floor on every dimension. It reads as scaffolding rather than a meaningful skill description.

Suggestions

Replace the description with concrete actions, e.g. 'Runs performance benchmark suites, detects regressions against baselines, and validates SLA/scalability targets.'

Add an explicit trigger clause such as 'Use when the user asks to benchmark performance, check for regressions, or validate scalability/load-test results.'

Drop the 'invoke with $agent-benchmark-suite' invocation syntax in favor of natural trigger terms users would actually say.

Dimension	Reasoning	Score
Specificity	The phrase 'Agent skill for benchmark-suite' names a domain but lists no concrete actions, matching the vague/abstract anchor ("Helps with documents"). It does not reach level 2 because it specifies no operations like 'detect regressions' or 'validate performance'.	1 / 3
Completeness	It answers neither 'what does this do' (no concrete actions) nor 'when should Claude use it' (no 'Use when...' trigger), satisfying the missing-both anchor. It cannot be level 2 since neither the what nor the when is present.	1 / 3
Trigger Term Quality	The only trigger-like phrase is 'invoke with $agent-benchmark-suite', which is technical invocation syntax rather than natural keywords a user would say; this matches the jargon/overly-generic anchor. It is not level 2 because no common natural variations (e.g., 'performance', 'benchmarks', 'regression') appear.	1 / 3
Distinctiveness Conflict Risk	'Agent skill for benchmark-suite' is generic and gives no distinct triggers, so it would overlap with other optimization/monitoring skills. It falls short of level 2 because there is no niche-specific language to set it apart.	1 / 3
	Total	4 / 12 Passed

Validation

93%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 15 / 16 Passed

Validation for skill structure

Criteria	Description	Result
skill_md_line_count	SKILL.md is long (670 lines); consider splitting into references/ and linking	Warning

	Total	15 / 16 Passed

Repository: ruvnet/claude-flow
Commit: 03d4f84

Reviewed: 2 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.