CtrlK
BlogDocsLog inGet started
Tessl Logo

benchmark-agents

Advanced AI agent benchmark scenarios that push Vercel's cutting-edge platform features — Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, and multi-agent orchestration. Designed to stress-test skill injection for complex, multi-system builds.

46

Quality

47%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./.claude/skills/benchmark-agents/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

62%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill excels at actionability and workflow clarity — every step has exact, executable commands and the eval loop is well-defined with validation checkpoints. However, it is significantly over-budget on tokens, containing extensive historical context (version-by-version bug fix tables), 12 detailed scenarios, and explanatory content that could be split into reference files or condensed. The monolithic structure hurts both conciseness and progressive disclosure.

Suggestions

Move the 'Common Issues Found in Evals' table, Scenario Table, and Complexity Tiers into separate reference files (e.g., ISSUES.md, SCENARIOS.md) and link to them from the main skill.

Condense the 'DO NOT' section into a compact checklist without explanations — Claude can infer why from the correct commands shown above.

Remove explanations of why --print mode doesn't work (Claude doesn't need to understand plugin internals, just needs the correct commands) to save ~15 lines of context.

DimensionReasoningScore

Conciseness

This skill is extremely verbose at ~350+ lines. It includes extensive tables of historical bug fixes (Common Issues Found in Evals), detailed scenario tables, complexity tiers, and lengthy explanations of why certain approaches don't work. Much of this context (e.g., explaining what --print mode does, version history of plugin fixes) could be dramatically condensed. The 'DO NOT' section alone has 11 items with explanations that could be a compact checklist.

1 / 3

Actionability

The skill provides fully executable, copy-paste-ready bash commands for every step: setup, launch, monitoring, verification, and cleanup. Commands include exact flags, environment variables, and path conventions. The verification section has concrete grep/test commands for checking generated code quality.

3 / 3

Workflow Clarity

The eval loop is clearly sequenced: setup → launch → monitor → verify → fix → release → repeat. Each phase has explicit commands and validation checkpoints (checking debug logs after 25s, verifying skill claims, inspecting generated code patterns). The Release → Eval Loop section provides a clear 8-step improvement cycle with gates before release.

3 / 3

Progressive Disclosure

The content is a monolithic document with no references to external files for detailed content. The historical issues table, 12-scenario table, complexity tiers, and prompt design rules could all be split into separate reference files. The coverage report format and scenario details are inline when they could be referenced. However, the sections are well-organized with clear headers.

2 / 3

Total

9

/

12

Passed

Description

32%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description reads like a promotional blurb rather than a functional skill description. It lists Vercel platform features but fails to specify concrete actions the skill performs and completely lacks a 'Use when...' clause, making it difficult for Claude to know when to select this skill. The buzzword-heavy language ('cutting-edge', 'stress-test skill injection') adds noise without clarity.

Suggestions

Add an explicit 'Use when...' clause describing the scenarios that should trigger this skill, e.g., 'Use when the user asks to build or test multi-agent systems on Vercel, or mentions Workflow DevKit, AI Gateway, or MCP integration.'

Replace vague phrases like 'push cutting-edge platform features' and 'stress-test skill injection' with concrete actions, e.g., 'Generates benchmark configurations, scaffolds multi-agent workflows, and tests AI Gateway routing.'

Remove marketing language ('cutting-edge', 'stress-test') and use third-person action verbs to describe what the skill actually does rather than what it is 'designed' for.

DimensionReasoningScore

Specificity

It names a domain (AI agent benchmarks on Vercel) and lists specific platform features (Workflow DevKit, AI Gateway, MCP, etc.), but does not describe concrete actions the skill performs — it describes what the scenarios are rather than what the skill does with them.

2 / 3

Completeness

The 'what' is vague (benchmark scenarios that 'push' features) and there is no 'when' clause or explicit trigger guidance at all. The description reads more like a marketing tagline than actionable selection criteria.

1 / 3

Trigger Term Quality

Includes some relevant technical keywords like 'Vercel', 'AI Gateway', 'MCP', 'Chat SDK', 'multi-agent orchestration', but these are highly specialized jargon rather than natural terms a user would say. Missing common user-facing trigger phrases.

2 / 3

Distinctiveness Conflict Risk

The mention of specific Vercel features and 'benchmark scenarios' provides some distinctiveness, but the broad scope ('complex, multi-system builds') and overlap with general Vercel or AI agent skills could cause conflicts.

2 / 3

Total

7

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
vercel/vercel-plugin
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.