Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".
84
94%
Does it follow best practices?
Impact
81%
1.30xAverage score across 8 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent follows the required output format when performing adversarial verification. The report must use the exact section structure (Claim, Attempts To Break It, Evidence, Verdict, Remaining Risks) and state the claim in a single sentence with a valid verdict.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Claim section present",
"description": "The verification report contains a section titled 'Claim' (e.g. '### Claim' or '## Claim')",
"max_score": 8
},
{
"name": "One-sentence claim",
"description": "The content under the Claim section is a single sentence (not a paragraph or bullet list)",
"max_score": 8
},
{
"name": "Attempts section present",
"description": "The report contains a section titled 'Attempts To Break It' or equivalent",
"max_score": 8
},
{
"name": "Evidence section present",
"description": "The report contains a section titled 'Evidence' with at least one bullet point of observed proof",
"max_score": 8
},
{
"name": "Verdict section present",
"description": "The report contains a section titled 'Verdict'",
"max_score": 8
},
{
"name": "Valid verdict value",
"description": "The verdict is exactly one of: PASS, PARTIAL, or FAIL (not a combination or alternative word)",
"max_score": 12
},
{
"name": "Remaining Risks conditional",
"description": "Remaining Risks section is present only when the verdict is PARTIAL or FAIL (not added as a filler section for a PASS verdict with no real risks)",
"max_score": 8
},
{
"name": "Functional proof attempted",
"description": "The Evidence section contains output from actually running code (e.g. test results, command output, script execution) rather than only describing code by inspection",
"max_score": 14
},
{
"name": "Attempts are falsification-oriented",
"description": "The attempts listed focus on trying to break or disprove the claim (e.g. testing edge cases, boundary inputs, counterexamples) rather than confirming it works for normal cases only",
"max_score": 14
},
{
"name": "Covers exact-multiple boundary",
"description": "The attempts explicitly test the case where total_items is an exact multiple of page_size (the specific scenario the fix targets)",
"max_score": 12
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
skills
skeptic-verifier