Name: benchmark-agents
Rating: 44 (1 reviews)
Author: vercel

benchmark-agents

Advanced AI agent benchmark scenarios that push Vercel's cutting-edge platform features — Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, and multi-agent orchestration. Designed to stress-test skill injection for complex, multi-system builds.

Quality

43%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./.claude/skills/benchmark-agents/SKILL.md

Quality

Discovery

32%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description reads more like a marketing tagline than a functional skill description. It lists Vercel platform features but fails to specify what concrete actions Claude should perform and completely lacks explicit trigger guidance ('Use when...'). The buzzword-heavy language ('cutting-edge', 'stress-test skill injection') adds fluff without aiding skill selection.

Suggestions

Add an explicit 'Use when...' clause specifying the scenarios that should trigger this skill, e.g., 'Use when the user asks to build or test multi-system applications using Vercel's Workflow DevKit, AI Gateway, or multi-agent orchestration.'

Replace vague phrases like 'push cutting-edge platform features' and 'stress-test skill injection' with concrete actions such as 'Generates benchmark test configurations, scaffolds multi-agent workflows, and validates integration across Vercel services.'

Narrow the scope or clarify the distinct niche to reduce overlap with other Vercel-related skills — specify whether this is for generating benchmarks, running tests, or building demo applications.

Dimension	Reasoning	Score
Specificity	It names a domain (Vercel platform features) and lists specific technologies (Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, multi-agent orchestration), but does not describe concrete actions the skill performs — it describes what the scenarios are 'designed to' do rather than what actions Claude should take.	2 / 3
Completeness	The description loosely addresses 'what' (benchmark scenarios for Vercel features) but has no explicit 'when' clause or trigger guidance. There is no 'Use when...' or equivalent, and the 'what' itself is vague about actionable capabilities, capping this at 1.	1 / 3
Trigger Term Quality	Includes some relevant keywords like 'Vercel', 'AI Gateway', 'MCP', 'Chat SDK', 'multi-agent orchestration' that a user might mention, but many terms are highly technical jargon ('skill injection', 'Workflow DevKit') and common user phrasings are absent. The term 'benchmark scenarios' is niche and unlikely to be naturally said by users.	2 / 3
Distinctiveness Conflict Risk	The mention of Vercel-specific features and 'benchmark scenarios' provides some distinctiveness, but the broad scope covering many different platform features (queues, flags, sandbox, etc.) could overlap with individual skills targeting any one of those areas.	2 / 3
	Total	7 / 12 Passed

Implementation

55%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is highly actionable with excellent workflow clarity — every step has exact, executable commands and the eval loop is well-defined. However, it is severely over-long and monolithic: historical bug fix tables, 12-scenario reference tables, complexity tiers, and extensive rationale for prohibitions all belong in separate reference files. The content would benefit enormously from splitting into a concise SKILL.md overview with references to supporting documents.

Suggestions

Move the Common Issues table, Scenario Table, and Complexity Tiers into separate reference files (e.g., ISSUES.md, SCENARIOS.md) and link to them from the main skill

Condense the DO NOT list to just the commands/flags to avoid, removing the explanatory rationale (e.g., 'DO NOT use claude --print' is sufficient without explaining hook internals)

Remove the 'How Evals Work' explanatory section and fold its essential points into the Setup & Launch commands as brief inline comments

Cut the 'Why --print doesn't work' paragraph — a single line prohibition is sufficient for Claude to follow

Dimension	Reasoning	Score
Conciseness	At ~300+ lines, this skill is extremely verbose. It includes extensive tables of historical bug fixes (with version numbers), lengthy DO NOT lists explaining rationale Claude could infer, scenario tables that could be in a separate file, and explanations of why certain approaches don't work that are overly detailed. Much of this content (e.g., explaining why --print doesn't work, the Common Issues table) could be drastically condensed or moved to reference files.	1 / 3
Actionability	The skill provides fully executable, copy-paste-ready bash commands for every step: setup, launch, monitoring, verification, and cleanup. Commands include exact flags, environment variables, and grep patterns. The verification section has concrete bash checks for project structure, model usage, and gateway patterns.	3 / 3
Workflow Clarity	The workflow is clearly sequenced: setup → launch → monitor → verify → fix → release → repeat. Each phase has numbered steps with explicit commands. The monitoring section provides specific validation checks (skill claims, hook firing, PostToolUse catches). The Release → Eval Loop section provides a clear 8-step improvement cycle with gates before release.	3 / 3
Progressive Disclosure	This is a monolithic wall of text with no references to external files. The Common Issues table (15+ rows), Scenario Table (12 rows), Complexity Tiers, and detailed monitoring commands could all be split into separate reference files. Everything is inlined in a single massive document with no bundle files to support it.	1 / 3
	Total	8 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: vercel/vercel-plugin
Commit: 61f1903

Reviewed: 4 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.