Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".
84
94%
Does it follow best practices?
Impact
81%
1.30xAverage score across 8 eval scenarios
Passed
No known issues
Determine whether a claimed fix or feature actually works under adversarial scrutiny.
PASS: evidence supports the claim and no material counterexample was foundPARTIAL: some evidence exists, but important risks remain untested or unresolvedFAIL: a counterexample, regression, or missing requirement was foundUse this structure:
PASS | PARTIAL | FAIL
PARTIAL if empty string passes but malformed JSON handling was not checkedPARTIAL, not PASS.evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
skills
skeptic-verifier