Agent skill for benchmark-suite - invoke with $agent-benchmark-suite
33
0%
Does it follow best practices?
Impact
89%
2.17xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./.agents/skills/agent-benchmark-suite/SKILL.mdQuality
Discovery
0%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an extremely weak description that fails on every dimension. It provides no information about what the skill does, when it should be used, or what distinguishes it from other skills. It reads more like a label than a functional description.
Suggestions
Describe the concrete actions this skill performs (e.g., 'Runs performance benchmarks, compares results across test suites, generates benchmark reports').
Add an explicit 'Use when...' clause with natural trigger terms (e.g., 'Use when the user asks to run benchmarks, measure performance, compare test results, or evaluate system metrics').
Remove the invocation command from the description and replace it with capability-focused language that helps Claude distinguish this skill from others.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description provides no concrete actions whatsoever. 'Agent skill for benchmark-suite' is entirely vague and does not describe what the skill actually does. | 1 / 3 |
Completeness | Neither 'what does this do' nor 'when should Claude use it' is answered. There is no description of capabilities and no 'Use when...' clause or equivalent guidance. | 1 / 3 |
Trigger Term Quality | The only keyword is 'benchmark-suite', which is a technical/internal term. There are no natural language terms a user would say when needing this skill. The invocation command '$agent-benchmark-suite' is not a trigger term. | 1 / 3 |
Distinctiveness Conflict Risk | The description is so vague that it provides no distinguishing characteristics. 'Agent skill for benchmark-suite' could overlap with any benchmarking, testing, or evaluation-related skill. | 1 / 3 |
Total | 4 / 12 Passed |
Implementation
0%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is an extremely verbose, non-executable specification document masquerading as an actionable skill. It defines elaborate class hierarchies with undefined dependencies, provides no working code, lacks any clear workflow or validation steps, and dumps everything into a single monolithic file. It reads more like an aspirational architecture document than practical guidance Claude could follow.
Suggestions
Replace the pseudo-class definitions with actual executable code or concrete CLI command sequences that Claude can run, including expected outputs and error handling.
Add a clear step-by-step workflow (e.g., 1. Run baseline benchmark → 2. Make changes → 3. Run comparison benchmark → 4. Validate no regressions) with explicit validation checkpoints.
Reduce content by 80%+ — remove class boilerplate, agent profile metadata, and architectural abstractions; focus on the 5-10 concrete commands/actions Claude needs to perform benchmarking.
Split detailed benchmark definitions and validation criteria into separate reference files, keeping SKILL.md as a concise overview with clear navigation links.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose at ~500+ lines. Massive amounts of non-executable pseudocode defining class structures Claude already understands conceptually. Explains obvious patterns (warmup/cooldown, ramp-up/ramp-down), repeats structural boilerplate extensively, and includes an 'Agent Profile' section that adds no actionable value. | 1 / 3 |
Actionability | Despite the volume of code, none of it is executable — all classes reference undefined constructors (ThroughputBenchmark, LatencyBenchmark, etc.), making everything pseudocode dressed as real code. The CLI commands (npx claude-flow benchmark-run) are plausible but unverifiable with no bundle files. There's no concrete, copy-paste-ready example that would actually run. | 1 / 3 |
Workflow Clarity | There is no clear step-by-step workflow for how to actually perform benchmarking. The code shows class methods but never sequences them into a usable workflow with validation checkpoints. The operational commands section lists commands but doesn't explain when to use each or how they connect. No feedback loops or error recovery guidance. | 1 / 3 |
Progressive Disclosure | Monolithic wall of text with no references to external files and no bundle files to support it. All content is inline with no logical separation. The massive code blocks for regression detection, performance validation, and automated testing should be in separate reference files, with the SKILL.md providing a concise overview and navigation. | 1 / 3 |
Total | 4 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
skill_md_line_count | SKILL.md is long (670 lines); consider splitting into references/ and linking | Warning |
Total | 10 / 11 Passed | |
d29d87f
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.