Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".
84
94%
Does it follow best practices?
Impact
81%
1.30xAverage score across 8 eval scenarios
Passed
No known issues
Output format compliance
Claim section present
0%
100%
One-sentence claim
0%
100%
Attempts section present
0%
100%
Evidence section present
0%
100%
Verdict section present
0%
100%
Valid verdict value
0%
100%
Remaining Risks conditional
100%
100%
Functional proof attempted
50%
28%
Attempts are falsification-oriented
85%
85%
Covers exact-multiple boundary
100%
100%
Falsification bias in verification plan
Falsification-framed attempts
78%
85%
No confirmation bias
90%
70%
Addresses incomplete edge cases
25%
33%
Addresses stale test assumptions
90%
90%
Shortest path to disproof
60%
80%
Functional proof attempted
66%
66%
Evidence contains execution output
70%
40%
Valid verdict
0%
37%
Correct output format
12%
25%
Claim graded not code quality
83%
100%
Failure mode identification
Identifies race condition risk
100%
100%
Addresses environment-specific breakage
100%
100%
Addresses edge cases
100%
100%
Addresses stale assumptions
80%
100%
Happy-path behavior examined
50%
50%
Failure modes framed adversarially
80%
100%
Valid verdict assigned
0%
100%
Verdict matches evidence
100%
100%
Correct output sections
25%
100%
Claim graded not style
100%
100%
Static review acknowledgment
Acknowledges static review
100%
100%
Verdict is not PASS
100%
100%
Environment limitation cited
0%
100%
Remaining Risks section present
60%
100%
BroadcastChannel compatibility noted
100%
100%
Race condition identified
100%
100%
Correct output sections
37%
100%
Valid verdict value
0%
100%
One-sentence claim
0%
100%
Token leak risk considered
0%
100%
PARTIAL verdict for environment-blocked verification
Verdict is PARTIAL not PASS
77%
100%
Environment block cited
0%
35%
Static review acknowledged
70%
100%
Remaining Risks present
100%
100%
Identifies batching gap
40%
40%
Lock risk examined
100%
100%
Idempotency claim checked
100%
100%
Valid verdict value
0%
100%
Correct output sections
16%
100%
One-sentence claim
0%
100%
Grade the claim not code quality
No code quality criticism in verdict
100%
100%
Claim-based verdict
91%
100%
Functional proof executed
100%
100%
Backoff calculation verified
100%
100%
Verdict matches actual behavior
80%
100%
Attempts are falsification-oriented
40%
40%
Valid verdict value
0%
0%
Correct output format
0%
12%
One-sentence claim
0%
0%
Evidence contains execution output
83%
100%
FAIL verdict with counterexample
FAIL verdict issued
71%
71%
Specific counterexample documented
100%
100%
Counterexample found by running code
83%
41%
Shortest path used
80%
90%
International formats tested
100%
100%
Correct output sections
25%
0%
One-sentence claim
0%
0%
Verdict not based on code quality
100%
87%
Remaining Risks or gaps described
87%
87%
Falsification-oriented attempts
90%
70%
Functional proof over code inspection
Code actually executed
62%
81%
Adversarial inputs tested
100%
100%
Python bool edge case tested
100%
100%
Boundary values tested
100%
100%
Evidence is concrete
70%
80%
Correct output sections
25%
37%
Valid verdict value
0%
100%
Verdict grounded in test results
87%
100%
One-sentence claim
0%
25%
Functional proof preferred over inspection
62%
87%
Table of Contents