CtrlK
BlogDocsLog inGet started
Tessl Logo

tdg-personal/benchmark

Use this skill to measure performance baselines, detect regressions before/after PRs, and compare stack alternatives.

53

Quality

53%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

Quality

Discovery

50%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description conveys a reasonable sense of purpose around performance measurement and regression detection, but lacks concrete specificity about what tools or operations are involved. It also lacks an explicit 'Use when...' clause with natural trigger terms, which limits Claude's ability to reliably select this skill from a large pool. The domain is somewhat distinct but could benefit from sharper differentiation.

Suggestions

Add an explicit 'Use when...' clause with natural trigger terms like 'benchmark', 'profiling', 'load test', 'latency', 'throughput', or 'performance testing'.

List more concrete actions such as 'run benchmarks', 'generate performance reports', 'compare response times across implementations' to increase specificity.

Clarify the technology scope (e.g., web APIs, database queries, frontend rendering) to reduce overlap with generic testing or CI/CD skills.

DimensionReasoningScore

Specificity

Names some actions ('measure performance baselines', 'detect regressions', 'compare stack alternatives') but they remain somewhat abstract—no concrete operations like 'run benchmarks', 'generate flame graphs', or 'profile memory usage' are mentioned.

2 / 3

Completeness

It answers 'what' (measure baselines, detect regressions, compare stacks) reasonably well, but the 'when' guidance is only implied through the 'Use this skill to...' framing rather than providing explicit trigger conditions like 'Use when the user asks about benchmarking or performance testing'.

2 / 3

Trigger Term Quality

Includes some relevant terms like 'performance baselines', 'regressions', 'PRs', and 'stack alternatives', but misses common natural-language triggers users would say such as 'benchmark', 'profiling', 'load test', 'latency', 'throughput', or 'speed'.

2 / 3

Distinctiveness Conflict Risk

The performance/regression/PR focus gives it some distinctiveness, but terms like 'detect regressions' and 'compare stack alternatives' are broad enough to overlap with CI/CD skills, testing skills, or general code review skills.

2 / 3

Total

8

/

12

Passed

Implementation

35%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads more like a feature specification or product description than an actionable skill for Claude. It describes what should be measured across four modes but provides no executable code, no concrete tool usage instructions, and no actual implementation details. The Before/After comparison mode is the most actionable section but still relies on undefined commands.

Suggestions

Add executable code or concrete tool-usage instructions for at least one mode (e.g., actual JavaScript for measuring Core Web Vitals via browser MCP, or actual shell commands for build performance measurement).

Define what '/benchmark baseline' and '/benchmark compare' actually do — provide the implementation or reference a script that Claude should create/use.

Add validation steps: what to do when metrics are outside expected ranges, how to verify measurements are reliable (e.g., run multiple iterations, discard outliers).

Remove explanations of well-known acronyms (LCP, CLS, FCP, TTFB) and instead focus on project-specific thresholds and how to act on results.

DimensionReasoningScore

Conciseness

The content is reasonably structured but includes some unnecessary explanation (e.g., spelling out what each Core Web Vital acronym stands for, which Claude already knows). The four modes are clearly laid out but could be tighter.

2 / 3

Actionability

The skill describes what to measure but provides no executable code, no actual commands, no scripts, and no concrete implementation. The numbered lists read as high-level descriptions rather than actionable instructions Claude can follow. '/benchmark baseline' and '/benchmark compare' are referenced but never defined or implemented.

1 / 3

Workflow Clarity

Mode 4 (Before/After Comparison) provides a clear sequence, and the numbered steps in each mode give some structure. However, there are no validation checkpoints, no error handling, and no feedback loops for when measurements fail or produce unexpected results.

2 / 3

Progressive Disclosure

The content is organized into clear sections (modes, output, integration), which is good. However, it mentions storing baselines in `.ecc/benchmarks/` and pairing with other tools without linking to any reference files. The four modes could benefit from being split into separate detailed files with the SKILL.md serving as an overview.

2 / 3

Total

7

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Reviewed

Table of Contents