CtrlK
BlogDocsLog inGet started
Tessl Logo

benchmark-e2e

End-to-end benchmark suite for vercel-plugin. Runs realistic projects through skill injection, launches dev servers, verifies everything works, analyzes conversation logs, and produces an improvement report for overnight self-improvement loops.

62

Quality

72%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./.claude/skills/benchmark-e2e/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured skill with strong actionability and clear workflow sequencing. The pipeline stages, data contracts, and self-improvement cycle are all concrete and executable. The main weakness is that the content is slightly verbose — the TypeScript interfaces and some explanatory sections could be more concise or split into reference files for better progressive disclosure.

Suggestions

Consider moving the TypeScript interfaces (run-manifest.json, events.jsonl, report.json) into a separate CONTRACTS.md reference file, keeping only brief descriptions inline.

Trim the 'Self-Improvement Cycle' section — steps 1-5 largely restate what's already obvious from the report.json structure and the pipeline description above it.

DimensionReasoningScore

Conciseness

The content is mostly efficient but includes some sections that could be tightened. The TypeScript interfaces are useful but the 'Overnight Automation Loop' section (a simple while loop) and the 'Self-Improvement Cycle' section explain concepts that are fairly obvious from the report structure. The events.jsonl examples are somewhat verbose.

2 / 3

Actionability

Provides fully executable commands for running the suite, concrete CLI flags with defaults, copy-paste ready bash for automation, and precise TypeScript interfaces for all data contracts. The prompt table with expected skills is specific and actionable.

3 / 3

Workflow Clarity

The four pipeline stages are clearly sequenced with explicit abort-on-failure behavior. The self-improvement cycle provides a clear feedback loop (run → read gaps → apply fixes → re-run → compare). The events.jsonl contract shows how errors propagate and cause pipeline aborts.

3 / 3

Progressive Disclosure

The content is well-structured with clear sections and headers, but it's somewhat monolithic — the TypeScript interfaces and detailed contract specifications could be split into a separate CONTRACTS.md or REFERENCE.md file. No bundle files are provided to reference, but the skill doesn't reference any external files either, keeping everything inline.

2 / 3

Total

10

/

12

Passed

Description

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description excels at specificity and distinctiveness, clearly outlining a unique benchmark suite for vercel-plugin with concrete actions. Its main weaknesses are the lack of an explicit 'Use when...' clause and somewhat technical trigger terms that may not match natural user language. Adding explicit trigger guidance would significantly improve skill selection accuracy.

Suggestions

Add a 'Use when...' clause, e.g., 'Use when the user wants to run benchmarks on the vercel-plugin, test skill injection, or trigger overnight self-improvement evaluation loops.'

Include more natural trigger terms such as 'test', 'run benchmarks', 'evaluate', 'verify plugin', or 'quality check' to improve matching with user requests.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: runs realistic projects through skill injection, launches dev servers, verifies functionality, analyzes conversation logs, and produces improvement reports.

3 / 3

Completeness

Clearly answers 'what does this do' with detailed actions, but lacks an explicit 'Use when...' clause or equivalent trigger guidance for when Claude should select this skill.

2 / 3

Trigger Term Quality

Contains some relevant keywords like 'benchmark', 'vercel-plugin', 'dev servers', 'conversation logs', and 'improvement report', but these are fairly technical/internal terms. Missing natural user-facing trigger terms like 'test', 'run benchmarks', 'evaluate plugin'.

2 / 3

Distinctiveness Conflict Risk

Highly specific to 'vercel-plugin' benchmarking with a clear niche involving skill injection, dev server verification, and self-improvement loops. Very unlikely to conflict with other skills.

3 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
vercel/vercel-plugin
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.