CtrlK
BlogDocsLog inGet started
Tessl Logo

coding-agent-helpers/skeptic-verifier

Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".

84

1.30x
Quality

94%

Does it follow best practices?

Impact

81%

1.30x

Average score across 8 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-2/

{
  "context": "Tests whether the agent builds a verification plan biased toward falsification (not confirmation), identifies the specific failure-mode categories the skill enumerates, and prefers the shortest path to disproof.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Falsification-framed attempts",
      "description": "The 'Attempts To Break It' section contains attempts designed to disprove the claim (bypass attempts, edge inputs, adversarial cases) — not merely re-running the existing passing tests",
      "max_score": 14
    },
    {
      "name": "No confirmation bias",
      "description": "The verification plan does NOT list 'confirm it works for normal inputs' or 'verify the happy path succeeds' as its primary goal — it focuses on finding failure cases",
      "max_score": 10
    },
    {
      "name": "Addresses incomplete edge cases",
      "description": "Attempts include inputs that are valid by the regex but might still cause problems (e.g. reserved SQL keywords like SELECT, OR, NULL used as column names)",
      "max_score": 12
    },
    {
      "name": "Addresses stale test assumptions",
      "description": "The report considers whether the existing test suite might have gaps or false confidence (stale assumptions), rather than treating passing tests as proof",
      "max_score": 10
    },
    {
      "name": "Shortest path to disproof",
      "description": "The first meaningful attempt targets the most direct way to break the claim (e.g. directly trying a SQL-keyword column name or Unicode bypass), rather than building up elaborate scaffolding first",
      "max_score": 10
    },
    {
      "name": "Functional proof attempted",
      "description": "The agent actually runs the sanitization code with test inputs rather than only reasoning about the regex by visual inspection",
      "max_score": 12
    },
    {
      "name": "Evidence contains execution output",
      "description": "The Evidence section includes concrete output from running code (error messages, return values, or test results), not just paraphrased descriptions",
      "max_score": 10
    },
    {
      "name": "Valid verdict",
      "description": "The Verdict is exactly one of PASS, PARTIAL, or FAIL",
      "max_score": 8
    },
    {
      "name": "Correct output format",
      "description": "Report contains all required sections: Claim, Attempts To Break It, Evidence, Verdict",
      "max_score": 8
    },
    {
      "name": "Claim graded not code quality",
      "description": "The verdict is based on whether the sanitization claim holds, NOT on comments about the code style, regex readability, or engineering quality",
      "max_score": 6
    }
  ]
}

evals

tile.json