Name: benchmark-sandbox
Rating: 73.6 (1 reviews)
Author: vercel

benchmark-sandbox

Run vercel-plugin eval scenarios in Vercel Sandboxes instead of local WezTerm panels. Provisions ephemeral microVMs with Claude Code + plugin pre-installed, runs benchmark prompts, extracts hook artifacts, and produces coverage reports.

2.09x

Quality

61%

Does it follow best practices?

Impact

92%

2.09x

Average score across 3 eval scenarios

Securityby

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./.claude/skills/benchmark-sandbox/SKILL.md

Quality

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description excels at specificity and distinctiveness, clearly articulating a narrow, well-defined set of actions for a specific toolchain. However, it lacks an explicit 'Use when...' clause, which caps completeness, and the trigger terms are heavily technical jargon that may not match how users naturally phrase requests.

Suggestions

Add an explicit 'Use when...' clause, e.g., 'Use when the user wants to run vercel-plugin evaluations in cloud sandboxes instead of locally, or mentions eval scenarios, plugin benchmarks, or sandbox testing.'

Include more natural language trigger variations such as 'run plugin tests remotely', 'cloud-based eval', or 'sandbox evaluation' to improve discoverability.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: provisions ephemeral microVMs, runs benchmark prompts, extracts hook artifacts, and produces coverage reports. Very detailed about what it does.	3 / 3
Completeness	Clearly answers 'what does this do' with specific actions, but lacks an explicit 'Use when...' clause or equivalent trigger guidance. The 'when' is only implied by the description of capabilities.	2 / 3
Trigger Term Quality	Contains relevant domain-specific terms like 'vercel-plugin', 'eval scenarios', 'Vercel Sandboxes', 'microVMs', 'benchmark prompts', 'coverage reports', but these are highly technical. Missing common natural language variations a user might say (e.g., 'run tests in sandbox', 'evaluate plugin').	2 / 3
Distinctiveness Conflict Risk	Highly specific niche: running vercel-plugin eval scenarios in Vercel Sandboxes with ephemeral microVMs. Very unlikely to conflict with other skills due to the narrow, well-defined domain.	3 / 3
	Total	10 / 12 Passed

Implementation

55%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is extremely thorough and actionable, providing concrete commands, schemas, and workflows for running sandbox-based evaluations. However, it is severely over-long and monolithic — it reads like comprehensive engineering documentation rather than a focused skill file. The content would benefit enormously from splitting into referenced sub-files (environment facts, scoring schemas, historical results, comparison tables) while keeping the SKILL.md as a concise overview with clear navigation.

Suggestions

Extract the 'Critical Sandbox Environment Facts', 'Known Limitations', 'Proven Results', and comparison table into separate referenced files (e.g., ENVIRONMENT.md, LIMITATIONS.md, RESULTS.md) and link from a concise overview section.

Remove or relocate the 'Proven Results (2026-03-10)' section entirely — historical benchmark data is not actionable instruction and wastes significant token budget.

Trim the 'Key Discoveries (Hard-Won)' section to only the items that affect how Claude should act (e.g., keep items 2, 3, 6) and remove narrative framing like 'Hard-Won'.

Consolidate the multiple command example blocks (appears in both 'Proven Working Script' and 'Commands' sections) into a single reference to avoid redundancy.

Dimension	Reasoning	Score
Conciseness	Extremely verbose at ~400+ lines. Contains extensive historical context ('Proven Results 2026-03-10'), hard-won discovery narratives, comparison tables with other tools, detailed polling flow diagrams, and known limitations that read more like engineering notes than actionable skill instructions. Much of this (e.g., 'Key findings', 'Proven Results', the full sandbox session flow ASCII diagram) is documentation rather than instruction.	1 / 3
Actionability	Highly actionable with concrete CLI commands, exact flags, executable TypeScript monitoring snippets, specific file paths, JSON schemas for scoring, and copy-paste ready examples throughout. The CLI flags table and command examples are immediately usable.	3 / 3
Workflow Clarity	The 3-phase pipeline is clearly sequenced with explicit validation at each phase (haiku scoring), error recovery (deploy retries up to 3x, verify fix-and-retest loops), and a detailed ASCII flow diagram showing the exact sequence including conditional gates (e.g., '>1 project file exists'). The DO NOT section provides clear guardrails for destructive/risky operations.	3 / 3
Progressive Disclosure	Monolithic wall of text with no bundle files or external references. All content — CLI reference, JSON schemas, environment facts, session flow diagrams, historical results, known limitations, comparison tables — is inlined in a single massive document. Content like the scoring schemas, scenario format, environment facts table, and proven results could easily be split into separate referenced files.	1 / 3
	Total	8 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: vercel/vercel-plugin
Commit: 61f1903

Reviewed: 4 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.