agent-benchmark-suite

Agent skill for benchmark-suite - invoke with $agent-benchmark-suite

2.17x

Quality

Does it follow best practices?

Impact

89%

2.17x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./.agents/skills/agent-benchmark-suite/SKILL.md

Quality

Content

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is an extremely verbose, non-actionable document that presents fictional class architectures and placeholder code rather than executable instructions. It reads like a software design document or API wishlist rather than a practical skill that Claude can follow. The code references numerous non-existent classes and MCP methods with no grounding in real, usable tools.

Suggestions

Replace all pseudocode class stubs with actual executable code or concrete CLI commands that Claude can run, including real tool/library references

Reduce content to under 100 lines with a clear sequential workflow: what to benchmark, how to run it, how to interpret results, and what to do on regression detection

Add explicit validation checkpoints (e.g., 'verify benchmark completed successfully before comparing to baseline') and error recovery steps

Extract detailed benchmark definitions and configuration examples into separate reference files, keeping SKILL.md as a concise overview with navigation links

Dimension	Reasoning	Score
Conciseness	Extremely verbose at ~500+ lines. Massive code blocks are non-executable pseudocode defining class stubs with placeholder methods (e.g., `new ThroughputBenchmark()`, `this.trainAnomalyModel()`) that don't actually exist. Explains concepts Claude already knows (what load testing is, what SLA validation means). The 'Agent Profile' section is pure padding.	1 / 3
Actionability	None of the code is executable—it's all pseudocode referencing non-existent classes and methods (e.g., `new StatisticalRegressionDetector()`, `mcp.benchmark_run()`). The CLI commands reference `npx claude-flow` but provide no indication these actually exist or how to set them up. Nothing is copy-paste ready or practically usable.	1 / 3
Workflow Clarity	Despite being a multi-step benchmarking process, there is no clear sequential workflow for Claude to follow. The code shows conceptual architecture but no actual step-by-step process with validation checkpoints. The operational commands section lists commands without sequencing or verification steps.	1 / 3
Progressive Disclosure	Monolithic wall of text with no references to external files and no bundle files provided. All content is inline with no organization into separate reference documents. The massive code blocks should be split out, and the main skill should be a concise overview with clear navigation.	1 / 3
	Total	4 / 12 Passed

Description

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an extremely minimal description that fails on all dimensions. It provides no information about what the skill does, when it should be used, or what triggers should activate it. It reads more like an internal label than a functional description that Claude could use for skill selection.

Suggestions

Add concrete actions describing what the skill does (e.g., 'Runs performance benchmarks, generates comparison reports, tracks regression metrics').

Add an explicit 'Use when...' clause with natural trigger terms (e.g., 'Use when the user asks to run benchmarks, measure performance, compare test results, or evaluate system speed').

Replace the invocation instruction ('invoke with $agent-benchmark-suite') with functional context — invocation syntax is not useful for skill selection and wastes description space.

Dimension	Reasoning	Score
Specificity	The description contains no concrete actions whatsoever. 'Agent skill for benchmark-suite' is entirely vague and does not describe what the skill actually does.	1 / 3
Completeness	Neither 'what does this do' nor 'when should Claude use it' is answered. The description only states it's an 'agent skill' and how to invoke it, providing no functional or contextual information.	1 / 3
Trigger Term Quality	The only keyword is 'benchmark-suite', which is a technical/internal name rather than a natural term a user would say. There are no natural language trigger terms like 'run benchmarks', 'performance testing', etc.	1 / 3
Distinctiveness Conflict Risk	The description is so generic that it provides almost no distinguishing information. 'Agent skill for benchmark-suite' could overlap with any benchmarking, testing, or performance-related skill without clear differentiation.	1 / 3
	Total	4 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
skill_md_line_count	SKILL.md is long (670 lines); consider splitting into references/ and linking	Warning

	Total	10 / 11 Passed

Repository: ruvnet/ruflo
Commit: cc8830d

Reviewed: 2 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.