Verification
Prove your own changes work on real surfaces. The agent that wrote the code must not verify it in the same context.
Sources
Contents
Before You Start
- Check readiness grade: C+ = proceed. D/F = invoke
agent-readiness setup first
- Can you boot the app?
- Can you interact with it? (Playwright CLI for UI, curl for APIs, CLI invocation)
- Can you verify your own work from a fresh evaluator context or separate subagent?
- If not, flag as a readiness gap — don't improvise a one-off check
Checks
Real Surface
- Run the shipped CLI, service, or UI flow with representative inputs
- For UI: Playwright CLI or CDP — inspect behavior, structure, legibility, responsiveness
- For services: hit the real local endpoint, confirm full round trip
- Treat this as stronger evidence than a pile of unit tests that mock the seam under change
Deterministic Guardrails
- Run the repo's built-in verify entrypoint first when it exists
- Prefer targeted checks over full-suite context floods during iteration
- Prefer integration, contract, smoke, and e2e checks over unit tests that mostly stub dependencies
- Mock-heavy unit tests are supporting evidence, not primary proof, when they control the behavior being claimed
- If a deterministic check fails, fix that failure before claiming runtime success
Code Shape
- Review the changed files for clarity, duplication, and maintainability after behavior is proven
- Delete dead code, stale branches, and unused helpers when they no longer protect a real boundary
- Treat
any, unsafe casts, and boundary-leaking unknown as verification failures unless explicitly allowed
- Prefer parsing external data once at the boundary over scattering validation and casts through core logic
- Classify failures intentionally: validation, not-found, auth, dependency, and programmer errors should not collapse into one vague catch-all
- Prefer user-facing errors that explain what happened and what to try next, while preserving richer diagnostics for logs or operators
- Prefer matching existing language/framework patterns over inventing a new local style
- Delete comments that only compensate for unclear code; keep only durable context the code cannot express
- Ask whether a fresh agent could extend the changed path without reverse-engineering hidden intent
External Contracts
- Verify field names, enums, response shapes against docs or real responses
- Can't verify a contract detail? Stop and surface the gap
State and Config
- Verify public interfaces end to end
- Verify persistence/state round trips with real data
- Verify config changes by starting the program with the new config
Failure Quality
- Exercise at least one real failure path when the change touches validation, IO, auth, network, or external dependencies
- Confirm the surfaced error is actionable: clear cause, stable classification or code where appropriate, and a useful recovery step when the user can do something about it
- Reject swallowed failures, vague "something went wrong" responses, and raw internals dumped directly to end users
CI Integration
- If the project has CI, push and wait for results before declaring done
- CI failures after verify = verify missed something. Investigate
Smell Test
- Check outputs look plausible to a human
- Investigate anything odd instead of rationalizing it — this is the specific failure mode
- Look for: unexpected empty states, wrong user names, stale data, truncated responses, hardcoded test values in production output
Proof of Work
- Query structured logs, health endpoints, error traces
- Keep screenshots, response logs, traces, sample responses
- Evidence should be reproducible — include exact commands
Evaluator Pattern
Anthropic's GAN-inspired approach: separate generation from evaluation entirely.
Three roles: Planner → Generator → Evaluator
- Evaluator is always independent from the builder
- Uses Playwright/CDP, curl, or the shipped CLI to inspect the live result
- Can reject work and send it back with specific feedback
- Success criteria are defined before running the check
Why it works: LLMs are terrible self-evaluators — they confidently praise mediocre work. Tuning a standalone evaluator to be skeptical is far more tractable than making a generator self-critical.
When to use: complex features, subjective quality, UI work. Not for simple CRUD or config changes.
Evaluator tuning: out of the box, Claude identifies issues then talks itself into approving. Multiple rounds of prompt tuning needed — read evaluator logs, find judgment divergences from human expectations, update QA prompt.
Context Flooding Problem
HumanLayer identified this: running a full test suite floods the context window, causing agents to lose track and hallucinate about test files. Verification must be context-efficient:
- Swallow passing output, only surface errors
- Run targeted subsets (< 30 seconds), not the full suite every iteration
- Use hooks that run silently on success, exit with error only on failure
Check Selection
Pick the smallest set of checks that can honestly disprove the change:
- UI change → targeted UI flow, screenshot, responsive spot-check, console scan
- API/backend → representative request, error request, schema or contract check; prefer this over handler-level mocks
- Error-handling change → exercise one real failure path, inspect classification, and inspect the user/operator message quality
- CLI/tooling → shipped command invocation, representative args, exit code, stdout/stderr sanity check
- State/config → write/read round trip, restart, boot with changed config, migrate existing state
- Pure refactor → deterministic tests plus one surface check that proves behavior parity, then delete stale paths and duplicates exposed by the refactor
- Generated-looking or overly busy code → add a code-shape pass on the touched files: clarity, dedupe, dead code, abstraction pressure, comment necessity, and escape-hatch types
Model Selection
Match model capability to lane complexity:
- Strong reasoning (e.g. Opus, GPT-5.4): evaluator orchestration, complex UI judgment, contract-sensitive checks
- Balanced (e.g. Sonnet, GPT-5.4-mini): targeted runtime checks, API verification, state/config passes
- Fast/cheap (e.g. Haiku, flash): repeated smoke checks, screenshot capture, command re-runs
Use your strongest model for planning/orchestration. Use cheaper models for workers and surface checks.
Cost Awareness
Anthropic's numbers:
- Solo agent: $9 / 20 min
- Full 3-agent setup: $200 / 6 hours (22x more)
- Simplified (Opus 4.6): $125 / 4 hours
The evaluator is expensive. Use it for:
- Complex features at the model's capability edge
- Subjective quality (UI design, UX flows)
- High-stakes changes where self-evaluation is dangerous
Skip it for:
- Tasks within the model's comfort zone
- Simple CRUD, config, or migration work
- When deterministic checks (lint, tests, CI) are sufficient
Key principle: "Every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing." As models improve, simplify the infrastructure.