CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-benchmark-suite

Agent skill for benchmark-suite - invoke with $agent-benchmark-suite

33

2.17x
Quality

0%

Does it follow best practices?

Impact

89%

2.17x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./.agents/skills/agent-benchmark-suite/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

0%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an extremely minimal description that fails on all dimensions. It provides no information about what the skill does, when it should be used, or what triggers should activate it. It reads more like an internal label than a functional description that Claude could use for skill selection.

Suggestions

Describe the concrete actions the skill performs (e.g., 'Runs performance benchmarks, generates comparison reports, tracks regression metrics').

Add an explicit 'Use when...' clause with natural trigger terms users would say (e.g., 'Use when the user asks to run benchmarks, measure performance, compare test results, or check for regressions').

Specify the domain or technology scope to make the skill distinguishable from other potential testing or analysis skills.

DimensionReasoningScore

Specificity

The description contains no concrete actions whatsoever. 'Agent skill for benchmark-suite' is entirely vague and does not describe what the skill actually does.

1 / 3

Completeness

Neither 'what does this do' nor 'when should Claude use it' is answered. The description only states it's an 'agent skill' and how to invoke it, providing no functional or contextual information.

1 / 3

Trigger Term Quality

The only keyword is 'benchmark-suite', which is a technical/internal name rather than a natural term a user would say. There are no natural language trigger terms like 'run benchmarks', 'performance testing', etc.

1 / 3

Distinctiveness Conflict Risk

The description is so generic that it provides almost no distinguishing information. 'Agent skill for benchmark-suite' could overlap with any benchmarking, testing, or performance-related skill without clear differentiation.

1 / 3

Total

4

/

12

Passed

Implementation

0%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is an architectural design document masquerading as an actionable skill. It presents hundreds of lines of non-executable pseudocode defining hypothetical class hierarchies and MCP integrations without any concrete, runnable instructions. The content fails on all dimensions: it's extremely verbose, entirely non-actionable, lacks workflow clarity, and has no progressive disclosure structure.

Suggestions

Replace pseudocode class hierarchies with actual executable commands or scripts that Claude can run to perform benchmarking (e.g., real CLI commands with expected outputs).

Add a clear step-by-step workflow: 1) Configure benchmark, 2) Run benchmark, 3) Validate results, 4) Compare with baseline—with explicit validation checkpoints and error handling.

Cut content by 80%+ by removing illustrative class definitions and focusing only on what Claude needs to know that it doesn't already (specific tool APIs, project-specific configurations, concrete thresholds).

Split into a concise SKILL.md overview with references to separate files for benchmark definitions, regression detection configuration, and CLI reference.

DimensionReasoningScore

Conciseness

Extremely verbose with ~500+ lines of non-executable pseudocode. Defines entire class hierarchies (ComprehensiveBenchmarkSuite, RegressionDetector, AutomatedPerformanceTester, PerformanceValidator) that are illustrative rather than functional. Explains concepts like warmup phases, CUSUM algorithms, and load testing patterns that Claude already understands. The 'Agent Profile' section with bullet points about specialization adds zero actionable value.

1 / 3

Actionability

Despite the massive amount of code, none of it is executable—classes reference undefined constructors (new ThroughputBenchmark(), new MLRegressionDetector(), etc.), MCP calls use hypothetical APIs (mcp.benchmark_run, mcp.metrics_collect), and the CLI commands (npx claude-flow benchmark-run) are unverifiable. This is architectural pseudocode dressed up as implementation, not copy-paste ready guidance.

1 / 3

Workflow Clarity

There is no clear step-by-step workflow for actually running benchmarks. The content describes what classes and methods would do conceptually but never provides a concrete sequence like 'first do X, then validate Y, then proceed to Z.' No validation checkpoints or error recovery steps are defined for the user/agent to follow.

1 / 3

Progressive Disclosure

Monolithic wall of code with no references to external files and no meaningful content hierarchy. All sections dump large code blocks inline without any navigation structure. The document is ~500 lines with no indication of what's essential vs. advanced, and no links to separate reference materials.

1 / 3

Total

4

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (670 lines); consider splitting into references/ and linking

Warning

Total

10

/

11

Passed

Repository
ruvnet/claude-flow
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.