CtrlK
BlogDocsLog inGet started
Tessl Logo

benchmark-agents

Advanced AI agent benchmark scenarios that push Vercel's cutting-edge platform features — Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, and multi-agent orchestration. Designed to stress-test skill injection for complex, multi-system builds.

56

Quality

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The content is highly actionable and workflow-structured with concrete commands and validation checkpoints, but it is a long monolithic document that could be tightened and would benefit from splitting reference material into separate bundle files.

Suggestions

Move the "Scenario Table", "Complexity Tiers", and "Common Issues Found in Evals" tables into separate reference files (e.g. references/scenarios.md, references/issues.md) and link them from the body to improve progressive disclosure and reduce token load.

Consolidate the 11 redundant DO NOT rules and consider relocating the version-stamped Common Issues table to a deprecated/changelog section to reduce time-sensitive noise.

Tighten the "How Evals Work" rationale into fewer bullets since the exact commands already convey the method.

DimensionReasoningScore

Conciseness

The body is mostly efficient with copy-paste commands, but it is padded by a 15-row version-stamped "Common Issues" table (time-sensitive v0.8.0–v0.9.9 entries) and redundant DO NOT rules that could be tightened, so it does not reach the lean level 3.

2 / 3

Actionability

It provides fully executable, copy-paste-ready bash commands ("npx add-plugin...", "wezterm cli spawn...", grep verification one-liners) with explicit "Copy the exact commands below. Do not improvise." guidance, matching the level 3 anchor.

3 / 3

Workflow Clarity

The eval loop is clearly sequenced (setup → launch → monitor → verify → fix → release → repeat) with explicit validation checkpoints (claim dirs, hook-firing greps, code-pattern verification) and a feedback loop, matching the level 3 anchor.

3 / 3

Progressive Disclosure

Sections are well-organized, but with no references/scripts/assets bundle present, all content is inline in a ~305-line file — the Scenario Table, Common Issues table, and Complexity Tiers are content that should be split into separate referenced files, fitting the level 2 anchor.

2 / 3

Total

10

/

12

Passed

Description

50%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description conveys a clear niche and names specific platform features, but it relies on abstract marketing language, lacks natural trigger terms users would say, and omits any explicit "Use when..." guidance. It is adequate but not exemplary.

Suggestions

Add an explicit "Use when..." trigger clause naming the situations that call for this skill (e.g. "Use when benchmarking a Vercel plugin via skill injection, running eval sessions, or verifying hook/skill coverage").

Replace abstract verbs ("push", "stress-test") with concrete actions (e.g. "launch Claude Code eval sessions, verify skill injection claims, monitor hook firing, and produce a coverage report").

Trim marketing fluff ("cutting-edge", "complex, multi-system builds") and include natural user phrasing like "run evals" or "test the plugin".

DimensionReasoningScore

Specificity

It names the domain ("AI agent benchmark scenarios") and a concrete feature list ("Workflow DevKit, AI Gateway, MCP, Chat SDK..."), but the actions ("push", "stress-test skill injection") are abstract rather than a comprehensive list of concrete actions, so it falls short of level 3.

2 / 3

Completeness

It answers "what" (benchmark scenarios that stress-test skill injection) but provides no "Use when..." clause or equivalent explicit trigger for when Claude should use it, which caps completeness at 2 per the rubric guideline.

2 / 3

Trigger Term Quality

"benchmark scenarios" and "AI agent" are natural terms, but common variations a user would say ("run evals", "test agents", "verify skill injection") are missing, and the phrase is padded with marketing jargon ("cutting-edge platform features"), keeping it below level 3.

2 / 3

Distinctiveness Conflict Risk

The Vercel plugin eval/benchmark niche is fairly specific, but with no explicit distinct triggers and marketing-heavy language ("complex, multi-system builds") it could still overlap with general testing or benchmark skills, so it does not reach level 3.

2 / 3

Total

8

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation16 / 16 Passed

Validation for skill structure

No warnings or errors.

Repository
vercel/vercel-plugin
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.