Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".
84
94%
Does it follow best practices?
Impact
81%
1.30xAverage score across 8 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent explicitly acknowledges when verification was mostly static review (code reading) rather than functional execution, and correctly returns PARTIAL rather than PASS when the environment prevents running browser-based JavaScript.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Acknowledges static review",
"description": "The report explicitly states that verification was mostly or entirely static (code reading/inspection), using language such as 'static review', 'code inspection only', 'could not execute', or 'no browser runtime available'",
"max_score": 16
},
{
"name": "Verdict is not PASS",
"description": "The verdict is PARTIAL or FAIL — not PASS — given that browser-based execution was not possible in a non-browser environment",
"max_score": 14
},
{
"name": "Environment limitation cited",
"description": "The report identifies the lack of a running browser or DOM environment as a specific constraint that limited verification",
"max_score": 12
},
{
"name": "Remaining Risks section present",
"description": "A Remaining Risks section is present and lists at least one untested risk (e.g. BroadcastChannel Safari compatibility, concurrent refresh race, iOS behavior)",
"max_score": 10
},
{
"name": "BroadcastChannel compatibility noted",
"description": "The report identifies BroadcastChannel API browser support (specifically Safari/iOS) as an untested or risky factor",
"max_score": 10
},
{
"name": "Race condition identified",
"description": "The report identifies a potential race condition: two tabs checking refreshInProgress simultaneously before either sets it to true",
"max_score": 10
},
{
"name": "Correct output sections",
"description": "Report contains Claim, Attempts To Break It, Evidence, and Verdict sections",
"max_score": 8
},
{
"name": "Valid verdict value",
"description": "Verdict is exactly PASS, PARTIAL, or FAIL",
"max_score": 8
},
{
"name": "One-sentence claim",
"description": "The Claim section states the verification claim in a single sentence",
"max_score": 6
},
{
"name": "Token leak risk considered",
"description": "The report considers whether broadcasting the token value via postMessage exposes it to other tabs or origins (a security consideration beyond functional correctness)",
"max_score": 6
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
skills
skeptic-verifier