Name: benchmark-sandbox
Rating: 80.80000000000001 (1 reviews)
Author: vercel

benchmark-sandbox

Run vercel-plugin eval scenarios in Vercel Sandboxes instead of local WezTerm panels. Provisions ephemeral microVMs with Claude Code + plugin pre-installed, runs benchmark prompts, extracts hook artifacts, and produces coverage reports.

2.09x

Quality

72%

Does it follow best practices?

Impact

92%

2.09x

Average score across 3 eval scenarios

Securityby

Advisory

Suggest reviewing before use

Fix and improve this skill with Tessl

tessl review fix ./.claude/skills/benchmark-sandbox/SKILL.md

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The body is highly actionable with a clearly sequenced, validated multi-phase workflow, but it is verbose with repeated phase narratives and monolithic where content could be split into reference files.

Suggestions

Deduplicate the phase narrative: keep one canonical sequence (How It Works or Session Flow) and have the phase-detail sections cover only what differs.

Move the JSON scoring schemas, artifact-export layout, and 'Proven Results (2026-03-10)' into reference files under references/ and link to them one level deep.

Actually ship the referenced run-eval.ts (and session-end-cleanup.mjs) in a scripts/ bundle, or remove the references, so progressive disclosure points at real files.

Dimension	Reasoning	Score
Conciseness	Operational detail mostly earns its place, but ~388 lines contain heavy duplication (How It Works vs. Sandbox Session Flow vs. phase details vs. Commands repeat the same flags/flows) and inline time-sensitive data (version pins, 2026-03-10 results) not isolated in a deprecated section.	2 / 3
Actionability	Concrete copy-paste-ready guidance throughout: exact `bun run` commands, a CLI flags table, TypeScript monitoring snippets, JSON scoring schemas, and precise env var names and paths.	3 / 3
Workflow Clarity	The 3-phase pipeline is explicitly sequenced with per-phase haiku scoring checkpoints, gating ('if >1 project file exists'), deploy retry feedback loops (up to 3x), and crash-safe incremental result writes.	3 / 3
Progressive Disclosure	Well-headed but monolithic: scoring schemas, session-flow diagrams, proven results, and known limitations all live inline in one file, and referenced files like run-eval.ts / session-end-cleanup.mjs have no bundle directory to back them.	2 / 3
	Total	10 / 12 Passed

Description

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is specific and well-differentiated with strong third-person action verbs, but it omits any explicit trigger guidance, capping completeness and weakening natural trigger terms.

Suggestions

Add a 'Use when...' clause naming natural triggers, e.g. 'Use when running vercel-plugin benchmark/eval scenarios that need parallel remote sandboxes.'

Soften internal jargon ('microVMs', 'hook artifacts') or pair it with user-facing terms ('benchmark runs', 'skill coverage reports') so the description matches what users actually say.

Dimension	Reasoning	Score
Specificity	Lists multiple concrete actions — 'Provisions ephemeral microVMs', 'runs benchmark prompts', 'extracts hook artifacts', and 'produces coverage reports' — matching the multiple-specific-actions anchor.	3 / 3
Completeness	The 'what' is clearly stated across several verbs, but there is no explicit 'when to use' trigger guidance, which the rubric caps at 2.	2 / 3
Trigger Term Quality	Relevant terms like 'benchmark prompts' and 'coverage reports' appear, but phrasing leans on internal jargon ('microVMs', 'hook artifacts', 'WezTerm panels') with no 'Use when...' clause, so common natural variations are missing.	2 / 3
Distinctiveness Conflict Risk	The 'instead of local WezTerm panels' framing carves out a clear niche distinct from benchmark-agents, making accidental triggering unlikely.	3 / 3
	Total	10 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 16 / 16 Passed

Validation for skill structure

No warnings or errors.

Repository: vercel/vercel-plugin
Path: .claude/skills/benchmark-sandbox/SKILL.md
Commit: 19606ac

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.