CtrlK
BlogDocsLog inGet started
Tessl Logo

benchmark-e2e

End-to-end benchmark suite for vercel-plugin. Runs realistic projects through skill injection, launches dev servers, verifies everything works, analyzes conversation logs, and produces an improvement report for overnight self-improvement loops.

62

Quality

72%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./.claude/skills/benchmark-e2e/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description excels at specificity and distinctiveness, clearly outlining a unique benchmark suite for vercel-plugin with concrete actions. However, it lacks an explicit 'Use when...' clause, which caps completeness, and the trigger terms are somewhat technical rather than natural language a user might employ.

Suggestions

Add an explicit 'Use when...' clause, e.g., 'Use when running benchmarks on the vercel-plugin, testing skill injection, or generating self-improvement reports.'

Include more natural trigger terms users might say, such as 'test vercel plugin', 'run benchmark suite', 'evaluate plugin performance', or 'overnight testing'.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: runs realistic projects through skill injection, launches dev servers, verifies functionality, analyzes conversation logs, and produces improvement reports.

3 / 3

Completeness

Clearly answers 'what does this do' with detailed actions, but lacks an explicit 'Use when...' clause or equivalent trigger guidance. The 'when' is only implied (overnight self-improvement loops).

2 / 3

Trigger Term Quality

Contains some relevant keywords like 'benchmark', 'vercel-plugin', 'dev servers', 'conversation logs', and 'improvement report', but these are fairly technical/internal terms. Missing natural user-facing trigger terms like 'test', 'run benchmarks', 'evaluate skills'.

2 / 3

Distinctiveness Conflict Risk

Highly specific niche targeting 'vercel-plugin' benchmarking with a clear scope around skill injection, dev server verification, and self-improvement loops. Very unlikely to conflict with other skills.

3 / 3

Total

10

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured, highly actionable skill that clearly documents a benchmark pipeline with concrete commands, data contracts, and a feedback loop. Its main weakness is moderate verbosity — some explanatory prose and the full TypeScript interfaces inline make it longer than necessary for a SKILL.md overview. The workflow is clearly sequenced with appropriate validation and error handling patterns.

Suggestions

Move the detailed TypeScript interfaces (RunManifest, ReportJson, events.jsonl schema) to a separate CONTRACTS.md reference file and link to it from the main skill

Remove explanatory sentences that restate what the structure already shows (e.g., 'The analyzer and verifier read this manifest to correlate sessions precisely instead of guessing from directory listings')

DimensionReasoningScore

Conciseness

The content is mostly efficient but includes some sections that could be tightened. The TypeScript interfaces are useful but the explanatory prose around them (e.g., 'The analyzer and verifier read this manifest to correlate sessions precisely instead of guessing from directory listings') adds little value for Claude. The Self-Improvement Cycle section restates what's already clear from the pipeline description.

2 / 3

Actionability

Provides fully executable commands (bun run scripts/benchmark-e2e.ts with flags), complete TypeScript interfaces for all data contracts, a concrete prompt table with expected skills, and copy-paste ready automation loops. Everything needed to run and interpret the benchmark is present.

3 / 3

Workflow Clarity

The pipeline stages are clearly sequenced (runner → verify → analyze → report) with explicit abort-on-failure behavior. The Self-Improvement Cycle provides a clear feedback loop (run → read gaps → apply fixes → re-run → compare) with specific validation criteria (verdict trending from fail to pass). The events.jsonl contract shows error handling patterns.

3 / 3

Progressive Disclosure

The content is well-structured with clear sections (Quick Start, Pipeline Stages, Contracts, etc.), but it's somewhat monolithic — the detailed TypeScript interfaces and event schemas could be in separate reference files. No bundle files are provided, so there's no external reference structure to leverage, but the inline contracts section is lengthy for a SKILL.md overview.

2 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
vercel/vercel-plugin
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.