Agent skill for benchmark-suite - invoke with $agent-benchmark-suite
33
0%
Does it follow best practices?
Impact
89%
2.17xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./.agents/skills/agent-benchmark-suite/SKILL.mdQuality
Discovery
0%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an extremely minimal description that fails on all dimensions. It provides no information about what the skill does, when it should be used, or what distinguishes it from other skills. It reads more like an internal label than a functional description.
Suggestions
Describe the concrete actions this skill performs (e.g., 'Runs performance benchmarks, generates comparison reports, tracks regression metrics').
Add an explicit 'Use when...' clause with natural trigger terms (e.g., 'Use when the user asks to run benchmarks, measure performance, compare test results, or evaluate system throughput').
Specify the domain or type of benchmarks to distinguish this from other testing or evaluation skills (e.g., 'for code performance benchmarks' vs 'for ML model evaluation').
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description contains no concrete actions whatsoever. 'Agent skill for benchmark-suite' is entirely vague and does not describe what the skill actually does. | 1 / 3 |
Completeness | Neither 'what does this do' nor 'when should Claude use it' is answered. The description only states it's an 'agent skill' and how to invoke it, providing no functional or contextual information. | 1 / 3 |
Trigger Term Quality | The only keyword is 'benchmark-suite', which is a technical/internal name rather than a natural term a user would say. There are no natural language trigger terms like 'run benchmarks', 'performance testing', etc. | 1 / 3 |
Distinctiveness Conflict Risk | The description is so generic that it provides no distinguishing characteristics. 'Agent skill for benchmark-suite' could overlap with any benchmarking, testing, or performance-related skill without clear differentiation. | 1 / 3 |
Total | 4 / 12 Passed |
Implementation
0%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is an architectural design document masquerading as an actionable skill. It presents hundreds of lines of non-executable pseudocode defining hypothetical class hierarchies and API calls that cannot be used as-is. It lacks any concrete workflow, validation steps, or progressive disclosure, making it essentially unusable as guidance for Claude.
Suggestions
Replace pseudocode class hierarchies with a concrete, executable example showing how to actually run a benchmark (e.g., a real script using real tools like autocannon, k6, or built-in Node.js perf_hooks).
Add a clear step-by-step workflow: 1) Configure benchmark, 2) Run benchmark, 3) Validate results against thresholds, 4) Compare with baseline, with explicit validation checkpoints at each step.
Cut content by 80%+ — remove the illustrative class definitions and focus on the actual CLI commands and their expected inputs/outputs, with one concrete end-to-end example.
Split advanced topics (regression detection algorithms, scalability analysis) into separate referenced files, keeping SKILL.md as a concise overview with quick-start instructions.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose with ~500+ lines of non-executable pseudocode. Defines entire class hierarchies (ComprehensiveBenchmarkSuite, RegressionDetector, AutomatedPerformanceTester, PerformanceValidator) that are illustrative rather than functional. Explains concepts like warmup phases, CUSUM algorithms, and load testing patterns that Claude already understands. The 'Agent Profile' section with bullet points about specialization adds zero actionable value. | 1 / 3 |
Actionability | Despite the massive amount of code, none of it is executable—classes reference undefined constructors (new ThroughputBenchmark(), new MLRegressionDetector(), etc.), MCP calls use hypothetical APIs (mcp.benchmark_run, mcp.metrics_collect), and the CLI commands (npx claude-flow benchmark-run) are unverifiable. This is architectural pseudocode dressed up as implementation, not copy-paste ready guidance. | 1 / 3 |
Workflow Clarity | There is no clear step-by-step workflow for actually running benchmarks. The content describes what classes and methods would do conceptually but never provides a concrete sequence like 'first do X, then validate Y, then proceed to Z.' No validation checkpoints or error recovery steps are defined for the user/agent to follow. | 1 / 3 |
Progressive Disclosure | Monolithic wall of code with no references to external files and no meaningful content hierarchy. All sections dump large code blocks inline without any navigation structure. The document is ~500 lines with no indication of what's essential vs. advanced, and no links to separate reference materials. | 1 / 3 |
Total | 4 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
skill_md_line_count | SKILL.md is long (670 lines); consider splitting into references/ and linking | Warning |
Total | 10 / 11 Passed | |
398f7c2
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.