Verify your own completed code changes using the repo's existing infrastructure and an independent evaluator context. Use after implementing a change when you need to run unit or integration tests, check build or lint gates, prove the real surface works with evidence, and challenge the changed code for clarity, deduplication, and maintainability. If the repo is not verifiable yet, hand off to `agent-readiness`; if you are reviewing someone else's code, use `review`.
97
100%
Does it follow best practices?
Impact
89%
1.02xAverage score across 3 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent actually starts the server and makes real HTTP requests to verify behavior — not just reads/analyzes the source code — and whether the report follows evidence rules: exact commands recorded, real responses captured, failure paths exercised, surfaces named, success kept terse, findings tied to impact, and a valid verdict produced.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Server started and queried",
"description": "Report shows evidence that the server was actually started (e.g. references to starting the process) and that HTTP requests were made to localhost:9127 — not just a static code review",
"max_score": 15
},
{
"name": "Real HTTP client used",
"description": "Verification used curl or an equivalent real HTTP client (e.g. httpx, requests, wget) to exercise the server — not just calling Python functions directly or importing the module",
"max_score": 12
},
{
"name": "Actual response captured",
"description": "Report includes actual HTTP response body content and status codes (e.g. '200 OK', '404', JSON response body) received from the live server",
"max_score": 10
},
{
"name": "Error/failure path exercised",
"description": "At least one error or failure path was tested (e.g. requesting a non-existent user ID, sending malformed JSON to POST /users, or requesting an unknown route) and the response is documented",
"max_score": 12
},
{
"name": "Exact surfaces named",
"description": "Report names the specific endpoints exercised (e.g. GET /users, GET /users/1, POST /users, GET /users/999) — not just 'the API was tested'",
"max_score": 10
},
{
"name": "Exact commands recorded",
"description": "Report includes the actual commands run (e.g. the curl command lines with flags and URLs), not just a prose description of what was done",
"max_score": 10
},
{
"name": "Success kept terse",
"description": "Report does NOT include verbose output from successful requests (e.g. does not dump full response headers for every passing call); successful checks are summarized briefly",
"max_score": 8
},
{
"name": "Finding tied to impact",
"description": "At least one finding or observation in the report explains what could break, who is affected, or why it matters — not just a description of the behavior observed",
"max_score": 8
},
{
"name": "Valid verdict",
"description": "Report contains a verdict using exactly one of: 'ship it', 'needs review', or 'blocked'",
"max_score": 15
}
]
}