CtrlK
BlogDocsLog inGet started
Tessl Logo

benchmark-e2e

End-to-end benchmark suite for vercel-plugin. Runs realistic projects through skill injection, launches dev servers, verifies everything works, analyzes conversation logs, and produces an improvement report for overnight self-improvement loops.

62

Quality

72%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./.claude/skills/benchmark-e2e/SKILL.md
SKILL.md
Quality
Evals
Security

Benchmark E2E

Single-command pipeline that creates projects, exercises skill injection via claude --print, launches dev servers, verifies they work, analyzes conversation logs, and generates actionable improvement reports.

Quick Start

# Full suite (9 projects, ~2-3 hours)
bun run scripts/benchmark-e2e.ts

# Quick mode (first 3 projects, ~30-45 min)
bun run scripts/benchmark-e2e.ts --quick

Options:

FlagDescriptionDefault
--quickRun only first 3 projectsfalse
--base <path>Override base directory~/dev/vercel-plugin-testing
--timeout <ms>Per-project timeout (forwarded to runner)900000 (15 min)

Pipeline Stages

The orchestrator chains four stages sequentially, aborting on failure:

  1. runner — Creates test dirs, installs plugin, runs claude --print with VERCEL_PLUGIN_LOG_LEVEL=trace
  2. verify — Detects package manager, launches dev server, polls for 200 with non-empty HTML
  3. analyze — Matches JSONL sessions to projects via run-manifest.json, extracts metrics
  4. report — Generates report.md and report.json with scorecards and recommendations

Contracts

run-manifest.json

Written by the runner at <base>/results/run-manifest.json. Links all downstream stages to the same run.

interface BenchmarkRunManifest {
  runId: string;           // UUID for this pipeline run
  timestamp: string;       // ISO 8601
  baseDir: string;         // Absolute path to base directory
  projects: Array<{
    slug: string;          // e.g. "01-recipe-platform"
    cwd: string;           // Absolute path to project dir
    promptHash: string;    // SHA hash of the prompt text
    expectedSkills: string[];
  }>;
}

The analyzer and verifier read this manifest to correlate sessions precisely instead of guessing from directory listings.

events.jsonl

The orchestrator writes NDJSON events to <base>/results/events.jsonl tracking pipeline lifecycle:

// Each line is one JSON object:
{ "stage": "pipeline", "event": "start", "timestamp": "...", "data": { "baseDir": "...", "quick": false } }
{ "stage": "runner",   "event": "start", "timestamp": "...", "data": { "script": "...", "args": [...] } }
{ "stage": "runner",   "event": "complete", "timestamp": "...", "data": { "exitCode": 0, "durationMs": 120000 } }
// On failure:
{ "stage": "verify",   "event": "error", "timestamp": "...", "data": { "exitCode": 1, "durationMs": 5000, "slug": "04-conference-tickets" } }
{ "stage": "pipeline", "event": "abort", "timestamp": "...", "data": { "failedStage": "verify", "exitCode": 1, "slug": "04-conference-tickets" } }

report.json

Machine-readable report at <base>/results/report.json for programmatic consumption:

interface ReportJson {
  runId: string | null;
  timestamp: string;
  verdict: "pass" | "partial" | "fail";
  gaps: Array<{
    slug: string;
    expected: string[];
    actual: string[];
    missing: string[];
  }>;
  recommendations: string[];
  suggestedPatterns: Array<{
    skill: string;   // Skill that was expected but not injected
    glob: string;    // Suggested pathPattern glob
    tool: string;    // Tool name that should trigger injection
  }>;
}

Overnight Automation Loop

Run the pipeline repeatedly with a cooldown between iterations:

while true; do
  bun run scripts/benchmark-e2e.ts
  sleep 3600
done

Each run produces timestamped report.json and report.md files. Compare across runs to track improvement.

Self-Improvement Cycle

The pipeline enables a closed feedback loop:

  1. Runbun run scripts/benchmark-e2e.ts exercises the plugin against realistic projects
  2. Read gapsreport.json lists which skills were expected but never injected, with exact slugs
  3. Apply fixes — Use suggestedPatterns entries (copy-pasteable YAML) to add missing frontmatter patterns; use recommendations to fix hook logic
  4. Re-run — Execute the pipeline again to verify the gaps are closed
  5. Compare — Diff report.json across runs: verdict should trend from "fail""partial""pass"

For overnight automation, combine with the loop above. Wake up to reports showing exactly what improved and what still needs work.

Prompt Table

Prompts never name specific technologies — they describe the product and features, letting the plugin infer which skills to inject.

#SlugExpected Skills
01recipe-platformauth, vercel-storage, nextjs
02trivia-gamevercel-storage, nextjs
03code-review-botai-sdk, nextjs
04conference-ticketspayments, email, auth
05content-aggregatorcron-jobs, ai-sdk
06finance-trackercron-jobs, email
07multi-tenant-blogrouting-middleware, cms, auth
08status-pagecron-jobs, vercel-storage, observability
09dog-walking-saaspayments, auth, vercel-storage, env-vars

Cleanup

rm -rf ~/dev/vercel-plugin-testing
Repository
vercel/vercel-plugin
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.