Run vercel-plugin eval scenarios in Vercel Sandboxes instead of local WezTerm panels. Provisions ephemeral microVMs with Claude Code + plugin pre-installed, runs benchmark prompts, extracts hook artifacts, and produces coverage reports.
73
61%
Does it follow best practices?
Impact
92%
2.09xAverage score across 3 eval scenarios
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./.claude/skills/benchmark-sandbox/SKILL.mdRun benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline:
--dangerously-skip-permissions --debugagent-browser to walk through user stories, fixing issues until all pass (20 min timeout)vercel deploy, and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.Skills are tracked across all 3 phases — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a haiku structured scoring step (claude -p --json-schema --model haiku) evaluates the results as structured JSON.
Use run-eval.ts — the proven eval runner:
# Run default scenarios with full 3-phase pipeline
bun run .claude/skills/benchmark-sandbox/run-eval.ts
# With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
# Keep sandboxes alive overnight with public URLs
bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8
# Build-only (skip verification and deploy)
bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy
# Run specific scenarios by slug
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone| Flag | Default | Description |
|---|---|---|
--concurrency N | 5 | Max parallel sandboxes (max 10) |
--timeout MS | 1800000 (30 min) | Per-phase timeout in ms |
--keep-alive | off | Keep sandboxes running after eval |
--keep-hours N | 8 | Hours to keep alive (with --keep-alive) |
--skip-verify | off | Skip the agent-browser verification phase |
--skip-deploy | off | Skip the Vercel deploy phase |
--scenarios a,b,c | all | Only run specific scenarios by slug |
--scenarios-file path | — | Load scenarios from a JSON file instead of built-in defaults |
Instead of hardcoding tech-specific prompts, generate scenarios dynamically as a JSON file. Prompts should describe real-world apps people want to build using user stories — no tech name-dropping. Let the plugin figure out what Vercel tech to inject.
[
{
"slug": "pet-adoption-board",
"prompt": "Build me a pet adoption listing board where shelters can post animals...",
"expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
"userStories": [
"As a visitor, I can see a grid of pet listings with photos and names",
"As a visitor, I can click a pet card to see a detail page",
"As a visitor, I can filter pets by type"
]
}
]Each scenario needs: slug (string), prompt (string), expectedSkills (string[]), userStories (tuple of exactly 3 strings).
"Link the project to my vercel-labs team. After building all files, start the dev server on port 3000 with \npx next dev --port 3000`."`Each phase gets a structured JSON score via claude -p --json-schema --model haiku --setting-sources "" running inside the sandbox. This is a separate quick pass — no tools, no hooks — just reads the phase output and returns structured data.
{
"completeness": "complete|partial|minimal|empty",
"hasApiRoutes": true,
"hasUIComponents": true,
"hasAIFeature": true,
"devServerRunning": true,
"missingFeatures": ["feature1"],
"summary": "Brief assessment"
}{
"stories": [
{ "index": 1, "status": "pass|fail", "reason": "Evidence from output" }
]
}{
"deployed": true,
"url": "https://xxx.vercel.app",
"buildSucceeded": true,
"errors": [],
"summary": "Brief assessment"
}Important: The claude -p --output-format json response wraps results — the actual schema data is in parsed.structured_output, not the top-level object.
| Property | Value |
|---|---|
| Home directory | /home/vercel-sandbox (NOT /home/user/ or /root/) |
| User | vercel-sandbox (NOT root) |
| Claude binary | /home/vercel-sandbox/.global/npm/bin/claude |
| PATH (via sh -c) | Includes ~/.global/npm/bin — claude findable by name |
| Port exposure | sandbox.domain(3000) → https://subdomain.vercel.run |
| Snapshot persistence | Files AND npm globals survive snapshot restore — use sandbox.snapshot() → Sandbox.create({ source: { type: "snapshot", snapshotId } }) |
| SDK version | @vercel/sandbox@1.8.0 (v2 beta's named sandbox endpoint returns 404 for this team) |
| Team tier | Enterprise (vercel-labs) — no known sandbox time cap |
sandbox.snapshot() preserves files AND npm globals. Use it after build to create a restore point before verify/deploy. Note: snapshotting stops the source sandbox — create a new one from the snapshot to continue.npx add-plugin <path> -s project -y --target claude-code — works because claude is in PATH after npm install -g. The --target claude-code flag is required because add-plugin can't auto-detect Claude Code without an initialized ~/.claude/ dir.sandbox.writeFiles([{ path, content: Buffer }]) — NOT runCommand heredocs. Heredocs with special characters cause 400 errors from the sandbox API.--dangerously-skip-permissions --debug. The --debug flag writes to ~/.claude/debug/.ANTHROPIC_AUTH_TOKEN — a vck_* Vercel Claude Key for AI Gateway), Vercel token from ~/.local/share/com.vercel.cli/auth.json (a vca_* token).npx vercel link --scope vercel-labs -y + npx vercel env pull once before first use.ports: [3000] in Sandbox.create() to get a public URL immediately via sandbox.domain(3000). Works on v1.8.0 — URL is assigned at creation time, before anything listens.sandbox.extendTimeout(ms) to keep sandboxes alive past their initial timeout. Verified working — extends by the requested duration. Use this for overnight keep-alive.runCommand with backgrounded processes (& or nohup) may throw ZodError on v1. Write a script file first, then execute it.session-end-cleanup.mjs hook deletes /tmp/vercel-plugin-*-seen-skills.d/ on session end. Extract artifacts BEFORE the session completes, or rely on poll history data.npm install -g agent-browser. Claude Code can use it for browser-based verification inside the sandbox.claude -p --json-schema --output-format json --model haiku works for structured scoring passes. No nesting issue when running inside a sandbox (only fails when running Claude inside Claude on the same machine).pet-adoption-board-202603101853) to avoid collisions when linking to vercel-labs team projects. These are demo projects — we generate many per day. Format: <slug>-<YYYYMMDDHHMM>.| benchmark-agents (WezTerm) | benchmark-sandbox | |
|---|---|---|
| Environment | Local macOS terminal panes | Remote Vercel Sandboxes (Amazon Linux) |
| Parallelism | Limited by local resources | Up to 10 (Hobby) or 2,000 (Pro) concurrent |
| Session type | Interactive TTY via /bin/zsh -ic | Direct sh -c invocation (PTY not required) |
| Artifact access | Direct filesystem (~/.claude/debug/) | sandbox.readFile() / poll via runCommand |
| Port exposure | localhost:3000 | Public https://sb-XXX.vercel.run URLs |
| Verification | Manual browser check | Automated agent-browser in Phase 2 |
| Deploy | Manual | Automated Phase 3 → permanent *.vercel.app URLs |
| Scoring | Manual review | Haiku structured JSON scoring per phase |
| Best for | Manual eval + iteration loop | Automated parallel coverage + verification + deploy runs |
Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } }) — no snapshotnpm install -g @anthropic-ai/claude-code vercel agent-browser (~20s per sandbox)~/.local/share/com.vercel.cli/auth.jsonsandbox.writeFiles() for 80 plugin files, then npx add-pluginnpx next dev --port 3000sandbox.extendTimeout() for verify + deploy + keep-aliveagent-browser to test user stories (20 min timeout). Prompt tells Claude to start dev server itself if not running.vercel link + vercel deploy, fixes build errors (30 min timeout)result.json immediately on completion (survives crashes)source.tar.gz of project files saved locallySandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
│
├─ npm install -g @anthropic-ai/claude-code vercel agent-browser (~20s)
├─ Write Vercel CLI auth token to ~/.local/share/com.vercel.cli/auth.json
├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/ (80 files, ~945KB)
├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
│
├─ Phase 1: BUILD
│ ├─ sandbox.writeFiles() → /tmp/prompt.txt
│ ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
│ │ (with AbortSignal.timeout(TIMEOUT_MS))
│ ├─ Poll every 20s:
│ │ ├─ ls /tmp/vercel-plugin-*-seen-skills.d/ (claimed skills)
│ │ ├─ cat /tmp/vercel-plugin-*-seen-skills.txt (seen skills snapshot)
│ │ ├─ find ~/.claude/debug -type f (debug log count)
│ │ ├─ find <project> -newer /tmp/prompt.txt (new project files)
│ │ └─ curl localhost:3000 (port status)
│ ├─ Extract build artifacts
│ └─ Haiku build score (structured JSON)
│
├─ Start dev server (if not already running)
├─ sandbox.extendTimeout(...)
│
├─ Phase 2: VERIFY (if >1 project file exists)
│ ├─ sandbox.writeFiles() → /tmp/verify.txt (agent-browser verification prompt)
│ ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
│ │ (with AbortSignal.timeout(1_200_000) — 20 min)
│ ├─ Re-extract skills (verify phase triggers more)
│ └─ Haiku verify score (per-story pass/fail JSON)
│
├─ Phase 3: DEPLOY (if >3 project files)
│ ├─ sandbox.writeFiles() → /tmp/deploy.txt
│ ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
│ │ (links to vercel-labs, deploys, fixes build errors up to 3x)
│ ├─ Extract deploy URL from output (*.vercel.app)
│ ├─ Re-extract skills (deploy phase triggers more)
│ └─ Haiku deploy score (structured JSON)
│
├─ Write <slug>/result.json immediately (crash-safe)
├─ Update aggregate results.json (complete: false until all done)
├─ Extract source.tar.gz
└─ sandbox.stop() (skipped if --keep-alive)The verify phase is the "closer" — its job is to make the app work and prove it. Key behaviors:
localhost:3000 and run npx next dev --port 3000 if neededopen → wait --load networkidle → screenshot --annotate → snapshot -i → interact → fix → re-verifySTORY_1: PASS from free textThe deploy phase uses a full Claude Code session (for skill tracking) to:
vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDDvercel deploy --yesVERCEL_TOKEN env var so CLI falls back to ~/.local/share/com.vercel.cli/auth.jsonDeploy URL is extracted by regex from Claude's output, with haiku as fallback URL extractor.
Same rules as benchmark-agents, plus sandbox-specific:
claude --print or -p flag for BUILD/VERIFY/DEPLOY phases — hooks don't fire without tool-calling sessions (use -p only for haiku scoring passes)writeFiles() — use Sandbox.create({ env: { ... } })runCommand heredocs to write file content — use sandbox.writeFiles() instead/home/user/ exists — the home dir is /home/vercel-sandbox/-YYYYMMDDHHMM to avoid collisions across runs# One-time setup: link project for OIDC sandbox auth
npx vercel link --scope vercel-labs -y
npx vercel env pull .env.local
# Auth (auto-resolved from macOS Keychain + Vercel CLI auth):
# - ANTHROPIC_API_KEY: from Keychain "ANTHROPIC_AUTH_TOKEN" (vck_* key) or env var
# - VERCEL_TOKEN: from ~/.local/share/com.vercel.cli/auth.json (vca_* token) or env var
# - ANTHROPIC_BASE_URL: defaults to https://ai-gateway.vercel.sh# Generate scenarios as JSON, then run
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
# With all phases + keep-alive for overnight
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8
# Build-only, no verification or deploy
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy
# Filter to specific slugs from file or defaults
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-cloneThe orchestrator prints live status. For manual checks on a running sandbox:
// List claimed skills
const claims = await sandbox.runCommand("sh", ["-c",
"ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);
// Check hook firing count
const hooks = await sandbox.runCommand("sh", ["-c",
"find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);
// Check port 3000
const port = await sandbox.runCommand("sh", ["-c",
"curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);
// Get public URL (after ports: [3000] in Sandbox.create)
const url = sandbox.domain(3000);Results are written to ~/dev/vercel-plugin-testing/sandbox-results/<run-id>/:
<run-id>/
results.json # Aggregate results (complete: false until all done, then true)
report.md # Markdown report with scores, coverage, URLs
<slug>/
result.json # Per-scenario result (written immediately on completion)
source.tar.gz # Project source archiveEach scenario result includes:
slug, sandboxId, success, durationMsclaimedSkills[], expectedSkills[], projectFiles[]appUrl — public https://sb-XXX.vercel.run URL (sandbox lifetime only)deployUrl — permanent https://xxx.vercel.app URL (if deploy succeeded)pollHistory[] — timestamped skill/file/port snapshotsverification — { ran, exitCode, stories: [{ index, status }], output }buildScore — haiku structured completeness assessmentdeployScore — haiku structured deploy assessmentThe markdown report (report.md / .reports/<timestamp>.md) includes:
Across 34 scenarios run in 5 batches:
| Metric | Best | Typical |
|---|---|---|
| Skills per scenario | 31 (ai-interior-designer) | 12-24 |
| Expected skill coverage | 100% (pet-adoption-board 4/4, apartment-hunting-copilot 7/7, splitwise-clone 6/6) | 50-86% |
| User stories verified | 3/3 PASS (ai-dream-journal, ai-gift-finder, ai-resume-roaster, ai-music-mood-radio, team-standup-bot, pet-adoption-board) | varies |
| Files built per scenario | 37 (student-study-groups) | 6-25 |
| Build time | 5-11 min | 5-7 min |
Key findings:
ai-sdk, shadcn, nextjs, vercel-functions are the most consistently detected skillscron-jobs, routing-middleware need Claude to write specific file patterns to triggersession-end-cleanup deletes claim dirs — use poll history for final skill countssandbox.snapshot() stops the original sandbox. Create a new sandbox from the snapshot to continue. Files and npm globals DO survive.@vercel/sandbox@2.0.0-beta.3's named sandbox endpoint returns 404 for this team. Stick with v1.8.0.sandbox.stop() — filesystem is ephemeral. Session cleanup hook may delete claim dirs before extraction.vercel-sandbox (home at /home/vercel-sandbox/). NOT /home/user/ or /root/.--dangerously-skip-permissions parity: Sandbox evals auto-approve all tool calls. WezTerm evals use normal permission flow. Coverage results may differ.runCommand timeout: Use { signal: AbortSignal.timeout(ms) } — the { timeout } option is silently ignored.*.vercel.app URL. The haiku scoring step provides a fallback URL extraction attempt.61f1903
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.