Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".
84
94%
Does it follow best practices?
Impact
81%
1.30xAverage score across 8 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent returns PARTIAL (not PASS) when the environment blocks strong verification, such as when a live PostgreSQL database is required to truly execute the script, and whether it is explicit about that limitation.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Verdict is PARTIAL not PASS",
"description": "The verdict is PARTIAL (or FAIL) — not PASS — because the agent cannot run the script against a real PostgreSQL database",
"max_score": 18
},
{
"name": "Environment block cited",
"description": "The report explicitly states that it could not run the script because no DATABASE_URL or live PostgreSQL instance is available",
"max_score": 14
},
{
"name": "Static review acknowledged",
"description": "The report explicitly acknowledges that verification was limited to static/code inspection, using phrasing like 'static review', 'code reading only', or 'could not execute'",
"max_score": 10
},
{
"name": "Remaining Risks present",
"description": "A Remaining Risks section is present with at least two untested risks related to the live database scenario",
"max_score": 10
},
{
"name": "Identifies batching gap",
"description": "The report identifies that the subquery approach (SELECT id WHERE user_tier IS NULL LIMIT N) might not correctly paginate on a live table as rows are updated concurrently",
"max_score": 10
},
{
"name": "Lock risk examined",
"description": "The report examines whether the batched UPDATE approach actually avoids row-level locks or introduces lock contention risks on a 50M row table",
"max_score": 8
},
{
"name": "Idempotency claim checked",
"description": "The report assesses whether the 'skips already backfilled rows' and column-exists check actually make the script safe to re-run",
"max_score": 8
},
{
"name": "Valid verdict value",
"description": "Verdict is exactly PASS, PARTIAL, or FAIL",
"max_score": 8
},
{
"name": "Correct output sections",
"description": "Report contains Claim, Attempts To Break It, Evidence, and Verdict sections",
"max_score": 6
},
{
"name": "One-sentence claim",
"description": "The Claim section states the verification claim in a single sentence",
"max_score": 8
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
skills
skeptic-verifier