benchmark-sandbox

Run vercel-plugin eval scenarios in Vercel Sandboxes instead of local WezTerm panels. Provisions ephemeral microVMs with Claude Code + plugin pre-installed, runs benchmark prompts, extracts hook artifacts, and produces coverage reports.

2.09x

Quality

61%

Does it follow best practices?

Impact

92%

2.09x

Average score across 3 eval scenarios

Securityby

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./.claude/skills/benchmark-sandbox/SKILL.md

Benchmark Sandbox — Remote Eval via Vercel Sandboxes

Name: benchmark-sandbox
Rating: 73.6 (1 reviews)
Author: vercel

Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline:

Phase 1 (BUILD): Claude Code builds the app with --dangerously-skip-permissions --debug
Phase 2 (VERIFY): A follow-up Claude Code session uses agent-browser to walk through user stories, fixing issues until all pass (20 min timeout)
Phase 3 (DEPLOY): A third Claude Code session links to vercel-labs, runs vercel deploy, and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.

Skills are tracked across all 3 phases — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a haiku structured scoring step (claude -p --json-schema --model haiku) evaluates the results as structured JSON.

Proven Working Script

Use run-eval.ts — the proven eval runner:

# Run default scenarios with full 3-phase pipeline
bun run .claude/skills/benchmark-sandbox/run-eval.ts

# With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json

# Keep sandboxes alive overnight with public URLs
bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive --keep-hours 8

# Build-only (skip verification and deploy)
bun run .claude/skills/benchmark-sandbox/run-eval.ts --skip-verify --skip-deploy

# Run specific scenarios by slug
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone

CLI Flags

Flag	Default	Description
`--concurrency N`	5	Max parallel sandboxes (max 10)
`--timeout MS`	1800000 (30 min)	Per-phase timeout in ms
`--keep-alive`	off	Keep sandboxes running after eval
`--keep-hours N`	8	Hours to keep alive (with `--keep-alive`)
`--skip-verify`	off	Skip the agent-browser verification phase
`--skip-deploy`	off	Skip the Vercel deploy phase
`--scenarios a,b,c`	all	Only run specific scenarios by slug
`--scenarios-file path`	—	Load scenarios from a JSON file instead of built-in defaults

Dynamic Scenarios (Recommended Approach)

Instead of hardcoding tech-specific prompts, generate scenarios dynamically as a JSON file. Prompts should describe real-world apps people want to build using user stories — no tech name-dropping. Let the plugin figure out what Vercel tech to inject.

Scenario JSON Format

[
  {
    "slug": "pet-adoption-board",
    "prompt": "Build me a pet adoption listing board where shelters can post animals...",
    "expectedSkills": ["ai-sdk", "nextjs", "shadcn", "vercel-functions"],
    "userStories": [
      "As a visitor, I can see a grid of pet listings with photos and names",
      "As a visitor, I can click a pet card to see a detail page",
      "As a visitor, I can filter pets by type"
    ]
  }
]

Each scenario needs: slug (string), prompt (string), expectedSkills (string[]), userStories (tuple of exactly 3 strings).

Prompt Design Guidelines

Focus on what the user wants, not what tech to use
Describe real-world apps that solve real problems with friendly, stylish UX
Include AI features naturally (recommendations, analysis, generation)
Always end with: "Link the project to my vercel-labs team. After building all files, start the dev server on port 3000 with \npx next dev --port 3000`."`
Include storage needs (photos, uploads) to trigger vercel-storage
Include scheduled tasks (reminders, cleanup) to trigger cron-jobs
Include auth/middleware to trigger routing-middleware

Structured Scoring (Haiku)

Each phase gets a structured JSON score via claude -p --json-schema --model haiku --setting-sources "" running inside the sandbox. This is a separate quick pass — no tools, no hooks — just reads the phase output and returns structured data.

Build Score Schema

{
  "completeness": "complete|partial|minimal|empty",
  "hasApiRoutes": true,
  "hasUIComponents": true,
  "hasAIFeature": true,
  "devServerRunning": true,
  "missingFeatures": ["feature1"],
  "summary": "Brief assessment"
}

Verify Score Schema (per user story)

{
  "stories": [
    { "index": 1, "status": "pass|fail", "reason": "Evidence from output" }
  ]
}

Deploy Score Schema

{
  "deployed": true,
  "url": "https://xxx.vercel.app",
  "buildSucceeded": true,
  "errors": [],
  "summary": "Brief assessment"
}

Important: The claude -p --output-format json response wraps results — the actual schema data is in parsed.structured_output, not the top-level object.

Critical Sandbox Environment Facts

Property	Value
Home directory	`/home/vercel-sandbox` (NOT `/home/user/` or `/root/`)
User	`vercel-sandbox` (NOT `root`)
Claude binary	`/home/vercel-sandbox/.global/npm/bin/claude`
PATH (via sh -c)	Includes `~/.global/npm/bin` — claude findable by name
Port exposure	`sandbox.domain(3000)` → `https://subdomain.vercel.run`
Snapshot persistence	Files AND npm globals survive snapshot restore — use `sandbox.snapshot()` → `Sandbox.create({ source: { type: "snapshot", snapshotId } })`
SDK version	`@vercel/sandbox@1.8.0` (v2 beta's named sandbox endpoint returns 404 for this team)
Team tier	Enterprise (vercel-labs) — no known sandbox time cap

Key Discoveries (Hard-Won)

Snapshots work: sandbox.snapshot() preserves files AND npm globals. Use it after build to create a restore point before verify/deploy. Note: snapshotting stops the source sandbox — create a new one from the snapshot to continue.
Plugin install: Use npx add-plugin <path> -s project -y --target claude-code — works because claude is in PATH after npm install -g. The --target claude-code flag is required because add-plugin can't auto-detect Claude Code without an initialized ~/.claude/ dir.
File uploads: Use sandbox.writeFiles([{ path, content: Buffer }]) — NOT runCommand heredocs. Heredocs with special characters cause 400 errors from the sandbox API.
Claude flags: Always use --dangerously-skip-permissions --debug. The --debug flag writes to ~/.claude/debug/.
Auth: API key from macOS Keychain (ANTHROPIC_AUTH_TOKEN — a vck_* Vercel Claude Key for AI Gateway), Vercel token from ~/.local/share/com.vercel.cli/auth.json (a vca_* token).
OIDC for sandbox SDK: Run npx vercel link --scope vercel-labs -y + npx vercel env pull once before first use.
Port exposure: Pass ports: [3000] in Sandbox.create() to get a public URL immediately via sandbox.domain(3000). Works on v1.8.0 — URL is assigned at creation time, before anything listens.
extendTimeout: Use sandbox.extendTimeout(ms) to keep sandboxes alive past their initial timeout. Verified working — extends by the requested duration. Use this for overnight keep-alive.
Background commands: runCommand with backgrounded processes (& or nohup) may throw ZodError on v1. Write a script file first, then execute it.
Session cleanup race: The session-end-cleanup.mjs hook deletes /tmp/vercel-plugin-*-seen-skills.d/ on session end. Extract artifacts BEFORE the session completes, or rely on poll history data.
agent-browser works in sandboxes: Install via npm install -g agent-browser. Claude Code can use it for browser-based verification inside the sandbox.
No hobby tier cap: Early 301s timeouts were from lower default timeout values in earlier script iterations, not a tier limitation. Enterprise (vercel-labs) has no known sandbox time cap — sandboxes ran 10+ minutes successfully.
claude -p works inside sandboxes: claude -p --json-schema --output-format json --model haiku works for structured scoring passes. No nesting issue when running inside a sandbox (only fails when running Claude inside Claude on the same machine).
Deploy project naming: ALWAYS use timestamped slugs with minute precision (e.g., pet-adoption-board-202603101853) to avoid collisions when linking to vercel-labs team projects. These are demo projects — we generate many per day. Format: <slug>-<YYYYMMDDHHMM>.

When to Use This vs benchmark-agents

	benchmark-agents (WezTerm)	benchmark-sandbox
Environment	Local macOS terminal panes	Remote Vercel Sandboxes (Amazon Linux)
Parallelism	Limited by local resources	Up to 10 (Hobby) or 2,000 (Pro) concurrent
Session type	Interactive TTY via `/bin/zsh -ic`	Direct `sh -c` invocation (PTY not required)
Artifact access	Direct filesystem (`~/.claude/debug/`)	`sandbox.readFile()` / poll via `runCommand`
Port exposure	`localhost:3000`	Public `https://sb-XXX.vercel.run` URLs
Verification	Manual browser check	Automated agent-browser in Phase 2
Deploy	Manual	Automated Phase 3 → permanent `*.vercel.app` URLs
Scoring	Manual review	Haiku structured JSON scoring per phase
Best for	Manual eval + iteration loop	Automated parallel coverage + verification + deploy runs

How It Works

Create fresh sandbox: Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ... } }) — no snapshot
Install tools: npm install -g @anthropic-ai/claude-code vercel agent-browser (~20s per sandbox)
Auth Vercel CLI: Write token to ~/.local/share/com.vercel.cli/auth.json
Upload plugin: sandbox.writeFiles() for 80 plugin files, then npx add-plugin
Phase 1 — BUILD: Claude Code builds the app (30 min timeout)
Score build: Haiku evaluates completeness, API routes, UI, AI features
Start dev server: If not already running, start npx next dev --port 3000
Extend timeout: sandbox.extendTimeout() for verify + deploy + keep-alive
Phase 2 — VERIFY: Claude Code uses agent-browser to test user stories (20 min timeout). Prompt tells Claude to start dev server itself if not running.
Score verify: Haiku evaluates each user story as pass/fail with reasons
Re-extract skills: Skills re-collected after verify phase (agent-browser + code fixes trigger more)
Phase 3 — DEPLOY: Claude Code runs vercel link + vercel deploy, fixes build errors (30 min timeout)
Score deploy: Haiku evaluates deploy success, URL extraction, errors
Re-extract skills: Skills re-collected after deploy phase
Write incremental results: Each scenario writes its own result.json immediately on completion (survives crashes)
Extract source archive: source.tar.gz of project files saved locally
Generate report: Markdown report with build/verify/deploy scores, skill coverage, URLs

Sandbox Session Flow (Per Scenario)

Sandbox.create({ runtime: "node24", ports: [3000], env: { ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, VERCEL_PLUGIN_LOG_LEVEL: "trace" } })
  │
  ├─ npm install -g @anthropic-ai/claude-code vercel agent-browser   (~20s)
  ├─ Write Vercel CLI auth token to ~/.local/share/com.vercel.cli/auth.json
  ├─ mkdir -p /home/vercel-sandbox/<slug> && npm init -y
  ├─ sandbox.writeFiles() → /home/vercel-sandbox/vercel-plugin/  (80 files, ~945KB)
  ├─ npx add-plugin /home/vercel-sandbox/vercel-plugin -s project -y --target claude-code
  │
  ├─ Phase 1: BUILD
  │   ├─ sandbox.writeFiles() → /tmp/prompt.txt
  │   ├─ claude --dangerously-skip-permissions --debug --settings <path> "$(cat /tmp/prompt.txt)"
  │   │   (with AbortSignal.timeout(TIMEOUT_MS))
  │   ├─ Poll every 20s:
  │   │   ├─ ls /tmp/vercel-plugin-*-seen-skills.d/     (claimed skills)
  │   │   ├─ cat /tmp/vercel-plugin-*-seen-skills.txt    (seen skills snapshot)
  │   │   ├─ find ~/.claude/debug -type f                (debug log count)
  │   │   ├─ find <project> -newer /tmp/prompt.txt       (new project files)
  │   │   └─ curl localhost:3000                         (port status)
  │   ├─ Extract build artifacts
  │   └─ Haiku build score (structured JSON)
  │
  ├─ Start dev server (if not already running)
  ├─ sandbox.extendTimeout(...)
  │
  ├─ Phase 2: VERIFY (if >1 project file exists)
  │   ├─ sandbox.writeFiles() → /tmp/verify.txt  (agent-browser verification prompt)
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/verify.txt)"
  │   │   (with AbortSignal.timeout(1_200_000) — 20 min)
  │   ├─ Re-extract skills (verify phase triggers more)
  │   └─ Haiku verify score (per-story pass/fail JSON)
  │
  ├─ Phase 3: DEPLOY (if >3 project files)
  │   ├─ sandbox.writeFiles() → /tmp/deploy.txt
  │   ├─ claude --dangerously-skip-permissions --debug "$(cat /tmp/deploy.txt)"
  │   │   (links to vercel-labs, deploys, fixes build errors up to 3x)
  │   ├─ Extract deploy URL from output (*.vercel.app)
  │   ├─ Re-extract skills (deploy phase triggers more)
  │   └─ Haiku deploy score (structured JSON)
  │
  ├─ Write <slug>/result.json immediately (crash-safe)
  ├─ Update aggregate results.json (complete: false until all done)
  ├─ Extract source.tar.gz
  └─ sandbox.stop()  (skipped if --keep-alive)

Verification Phase Details

The verify phase is the "closer" — its job is to make the app work and prove it. Key behaviors:

Always runs if >1 project file exists (no longer gated on port 3000 being up)
Starts dev server itself if not already running — the prompt tells Claude to check localhost:3000 and run npx next dev --port 3000 if needed
20 minute timeout — enough for agent-browser to open pages, screenshot, interact, fix broken code, restart server, and re-verify
Triggers skill injection — the verify session creates/edits files, triggering PreToolUse and PostToolUse hooks
Uses agent-browser workflow: open → wait --load networkidle → screenshot --annotate → snapshot -i → interact → fix → re-verify
Results scored by haiku — no more parsing STORY_1: PASS from free text

Deploy Phase Details

The deploy phase uses a full Claude Code session (for skill tracking) to:

Run vercel link --yes --scope vercel-labs --project <slug>-YYYYMMDD
Run vercel deploy --yes
If build fails, fix code and retry (up to 3 attempts)
Important: unsets VERCEL_TOKEN env var so CLI falls back to ~/.local/share/com.vercel.cli/auth.json
Deployment protection is enabled by default on vercel-labs team

Deploy URL is extracted by regex from Claude's output, with haiku as fallback URL extractor.

DO NOT (Hard Rules)

Same rules as benchmark-agents, plus sandbox-specific:

DO NOT use claude --print or -p flag for BUILD/VERIFY/DEPLOY phases — hooks don't fire without tool-calling sessions (use -p only for haiku scoring passes)
DO NOT let sandboxes run without extracting artifacts — ephemeral filesystem is lost on stop
DO NOT pass API keys via writeFiles() — use Sandbox.create({ env: { ... } })
DO NOT skip snapshotting after build — it's your safety net if verify/deploy kills the sandbox
DO NOT use v2 beta SDK — named sandbox endpoint returns 404 for this team; use v1.8.0
DO NOT use runCommand heredocs to write file content — use sandbox.writeFiles() instead
DO NOT assume /home/user/ exists — the home dir is /home/vercel-sandbox/
DO NOT use simple project names without timestamps — always append -YYYYMMDDHHMM to avoid collisions across runs

Prerequisites

# One-time setup: link project for OIDC sandbox auth
npx vercel link --scope vercel-labs -y
npx vercel env pull .env.local

# Auth (auto-resolved from macOS Keychain + Vercel CLI auth):
# - ANTHROPIC_API_KEY: from Keychain "ANTHROPIC_AUTH_TOKEN" (vck_* key) or env var
# - VERCEL_TOKEN: from ~/.local/share/com.vercel.cli/auth.json (vca_* token) or env var
# - ANTHROPIC_BASE_URL: defaults to https://ai-gateway.vercel.sh

Commands

Run eval with dynamic scenarios (recommended)

# Generate scenarios as JSON, then run
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json

# With all phases + keep-alive for overnight
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --keep-alive --keep-hours 8

# Build-only, no verification or deploy
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/scenarios.json --skip-verify --skip-deploy

# Filter to specific slugs from file or defaults
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios splitwise-clone,calendly-clone

Monitoring While Running

The orchestrator prints live status. For manual checks on a running sandbox:

// List claimed skills
const claims = await sandbox.runCommand("sh", ["-c",
  "ls /tmp/vercel-plugin-*-seen-skills.d/ 2>/dev/null"
]);

// Check hook firing count
const hooks = await sandbox.runCommand("sh", ["-c",
  "find /home/vercel-sandbox/.claude/debug -name '*.txt' -exec grep -c 'executePreToolHooks' {} +"
]);

// Check port 3000
const port = await sandbox.runCommand("sh", ["-c",
  "curl -s -o /dev/null -w '%{http_code}' http://localhost:3000"
]);

// Get public URL (after ports: [3000] in Sandbox.create)
const url = sandbox.domain(3000);

Artifact Export Layout

Results are written to ~/dev/vercel-plugin-testing/sandbox-results/<run-id>/:

<run-id>/
  results.json             # Aggregate results (complete: false until all done, then true)
  report.md                # Markdown report with scores, coverage, URLs
  <slug>/
    result.json            # Per-scenario result (written immediately on completion)
    source.tar.gz          # Project source archive

Each scenario result includes:

slug, sandboxId, success, durationMs
claimedSkills[], expectedSkills[], projectFiles[]
appUrl — public https://sb-XXX.vercel.run URL (sandbox lifetime only)
deployUrl — permanent https://xxx.vercel.app URL (if deploy succeeded)
pollHistory[] — timestamped skill/file/port snapshots
verification — { ran, exitCode, stories: [{ index, status }], output }
buildScore — haiku structured completeness assessment
deployScore — haiku structured deploy assessment

The markdown report (report.md / .reports/<timestamp>.md) includes:

Summary table — slug, build status, skills, files, verify results, deploy URL, duration
Per-scenario details — build score, deploy score, verification per-story pass/fail
Skill coverage — expected vs actual per scenario, missing/bonus breakdown
Total unique skills across all scenarios

Proven Results (2026-03-10)

Across 34 scenarios run in 5 batches:

Metric	Best	Typical
Skills per scenario	31 (ai-interior-designer)	12-24
Expected skill coverage	100% (pet-adoption-board 4/4, apartment-hunting-copilot 7/7, splitwise-clone 6/6)	50-86%
User stories verified	3/3 PASS (ai-dream-journal, ai-gift-finder, ai-resume-roaster, ai-music-mood-radio, team-standup-bot, pet-adoption-board)	varies
Files built per scenario	37 (student-study-groups)	6-25
Build time	5-11 min	5-7 min

Key findings:

User-story-focused prompts (no tech name-dropping) work — plugin detects patterns from actual code
ai-sdk, shadcn, nextjs, vercel-functions are the most consistently detected skills
cron-jobs, routing-middleware need Claude to write specific file patterns to trigger
Lexical prompt inject (UserPromptSubmit) working — skills injected before any files written
session-end-cleanup deletes claim dirs — use poll history for final skill counts
Enterprise tier (vercel-labs) — no sandbox time cap; builds ran 10+ minutes

Known Limitations

Snapshot stops the source sandbox: sandbox.snapshot() stops the original sandbox. Create a new sandbox from the snapshot to continue. Files and npm globals DO survive.
v2 beta incompatible: @vercel/sandbox@2.0.0-beta.3's named sandbox endpoint returns 404 for this team. Stick with v1.8.0.
Artifact window: Must extract before sandbox.stop() — filesystem is ephemeral. Session cleanup hook may delete claim dirs before extraction.
Amazon Linux paths: User is vercel-sandbox (home at /home/vercel-sandbox/). NOT /home/user/ or /root/.
--dangerously-skip-permissions parity: Sandbox evals auto-approve all tool calls. WezTerm evals use normal permission flow. Coverage results may differ.
runCommand timeout: Use { signal: AbortSignal.timeout(ms) } — the { timeout } option is silently ignored.
BrotliDecompressionError: Transient Vercel API errors can kill sandbox creation. Retry logic recommended for production runs.
Deploy reliability: Claude Code deploy sessions sometimes fail to output a parseable *.vercel.app URL. The haiku scoring step provides a fallback URL extraction attempt.
Verify timeout: Complex apps may need the full 20 minutes for agent-browser to test all stories. Simpler apps finish in 2-5 minutes.

Repository: vercel/vercel-plugin
Commit: 61f1903

Last updated: 4 days ago
Created: 4 days ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.