CtrlK
BlogDocsLog inGet started
Tessl Logo

coding-agent-helpers/skeptic-verifier

Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".

84

1.30x
Quality

94%

Does it follow best practices?

Impact

81%

1.30x

Average score across 8 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-7/

{
  "context": "Tests whether the agent correctly issues a FAIL verdict when a concrete counterexample is found, and whether it uses functional execution (running code with adversarial inputs) rather than code inspection to find the failure.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "FAIL verdict issued",
      "description": "The verdict is FAIL (the function has real gaps — e.g. a number with international dial prefix '00' instead of '+' raises ValueError rather than normalizing correctly)",
      "max_score": 14
    },
    {
      "name": "Specific counterexample documented",
      "description": "The Evidence section includes at least one specific input value and the actual output or error produced (not just a description of what might fail)",
      "max_score": 14
    },
    {
      "name": "Counterexample found by running code",
      "description": "The counterexample was discovered by actually executing the function with adversarial inputs, not only by reasoning about it from code inspection",
      "max_score": 12
    },
    {
      "name": "Shortest path used",
      "description": "The agent does not build elaborate test infrastructure before trying simple adversarial inputs — it tries direct inputs quickly to find failure",
      "max_score": 10
    },
    {
      "name": "International formats tested",
      "description": "Attempts include at least one non-US international number format (e.g. using '00' instead of '+' prefix, numbers from non-US/UK countries)",
      "max_score": 10
    },
    {
      "name": "Correct output sections",
      "description": "Report contains Claim, Attempts To Break It, Evidence, and Verdict sections",
      "max_score": 8
    },
    {
      "name": "One-sentence claim",
      "description": "The Claim section is a single sentence",
      "max_score": 6
    },
    {
      "name": "Verdict not based on code quality",
      "description": "The FAIL verdict references a behavioral failure (wrong output or unhandled case), not code style or structural concerns",
      "max_score": 8
    },
    {
      "name": "Remaining Risks or gaps described",
      "description": "Given a FAIL verdict, the report describes what specifically fails or what remaining gaps were found",
      "max_score": 8
    },
    {
      "name": "Falsification-oriented attempts",
      "description": "The Attempts To Break It section lists inputs designed to break the normalizer, not to confirm it works",
      "max_score": 10
    }
  ]
}

evals

tile.json