CtrlK
BlogDocsLog inGet started
Tessl Logo

benchmark-sandbox

Run vercel-plugin eval scenarios in Vercel Sandboxes instead of local WezTerm panels. Provisions ephemeral microVMs with Claude Code + plugin pre-installed, runs benchmark prompts, extracts hook artifacts, and produces coverage reports.

80

2.09x
Quality

Does it follow best practices?

Impact

92%

2.09x

Average score across 3 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The content is highly actionable with concrete commands, schemas, and a well-sequenced validation-gated workflow. It loses points for redundancy between overlapping sections, a dated results block, and a monolithic structure that fails to push detail into reference files.

Suggestions

De-duplicate the 'How It Works' numbered list and the 'Sandbox Session Flow' ASCII diagram, and merge the overlapping 'Proven Working Script' and 'Commands' examples into one section.

Move the dated 'Proven Results (2026-03-10)' metrics into a separate reference file or a clearly marked historical section so the main skill body stays evergreen.

Extract the JSON scoring schemas and the detailed per-phase flow into reference files (e.g., SCORING.md, FLOW.md) referenced one level deep from SKILL.md to improve progressive disclosure.

DimensionReasoningScore

Conciseness

The body is information-dense with genuinely hard-won operational knowledge, but it duplicates content across 'How It Works' and 'Sandbox Session Flow', and across 'Proven Working Script' and 'Commands', and carries a dated 'Proven Results (2026-03-10)' section that penalizes conciseness.

2 / 3

Actionability

Provides fully executable, copy-paste-ready guidance: concrete `bun run run-eval.ts` invocations, a CLI flags table, JSON schemas, and `Sandbox.create({...})` / `sandbox.writeFiles()` code snippets.

3 / 3

Workflow Clarity

The 3-phase pipeline is clearly sequenced with explicit validation checkpoints — haiku scoring after each phase, a deploy retry loop (up to 3x), and a verify loop that fixes issues until all stories pass.

3 / 3

Progressive Disclosure

No bundle files exist and the ~388-line body is monolithic — JSON schemas, detailed flow diagrams, and limitations that could live in separate reference files are all inline, though sections are well-headed.

2 / 3

Total

10

/

12

Passed

Description

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is specific and well-differentiated, naming several concrete actions and contrasting itself with the WezTerm alternative. Its main weakness is the absence of an explicit 'Use when' trigger and a reliance on technical jargon over natural user phrasing.

Suggestions

Add an explicit 'Use when...' clause naming the natural situations that should trigger this skill (e.g., when the user wants to run plugin evals in parallel, or needs automated verification and deploy).

Soften the jargon with at least one common user-facing phrasing (e.g., 'run benchmark tests', 'test the plugin') alongside the technical terms to broaden trigger coverage.

DimensionReasoningScore

Specificity

Lists multiple concrete actions — 'Provisions ephemeral microVMs', 'runs benchmark prompts', 'extracts hook artifacts', 'produces coverage reports' — matching the multi-action anchor.

3 / 3

Completeness

Clearly answers 'what' the skill does but lacks any 'Use when...' or equivalent explicit trigger clause, which per the guidelines caps completeness at 2.

2 / 3

Trigger Term Quality

Contains some natural terms ('benchmark prompts', 'eval scenarios') but leans on technical jargon ('microVMs', 'hook artifacts', 'WezTerm panels') and omits common user-facing variations, so it does not reach full coverage.

2 / 3

Distinctiveness Conflict Risk

'instead of local WezTerm panels' explicitly distinguishes it from the sibling benchmark-agents skill, carving a clear niche unlikely to trigger the wrong skill.

3 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation16 / 16 Passed

Validation for skill structure

No warnings or errors.

Repository
vercel/vercel-plugin
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.