CtrlK
BlogDocsLog inGet started
Tessl Logo

coding-agent-helpers/skeptic-verifier

Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".

84

1.30x
Quality

94%

Does it follow best practices?

Impact

81%

1.30x

Average score across 8 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-3/

{
  "context": "Tests whether the agent systematically identifies the specific categories of failure modes enumerated in the skill: wrong happy-path behavior, hidden regressions, environment-specific breakage, incomplete edge-case handling, and stale assumptions in tests or manual verification.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Identifies race condition risk",
      "description": "The report explicitly identifies the window between the redis.exists() check and the redis.setex() call as a potential failure mode (hidden regression / incomplete fix for the original race condition)",
      "max_score": 14
    },
    {
      "name": "Addresses environment-specific breakage",
      "description": "The report identifies at least one environment-specific risk (e.g. Redis connection failure, Redis being unavailable, network partition between check and set) that could cause incorrect behavior",
      "max_score": 12
    },
    {
      "name": "Addresses edge cases",
      "description": "The report considers at least one edge case beyond the happy path (e.g. empty recipient string, TTL expiring mid-campaign, same email sent across campaign boundary)",
      "max_score": 10
    },
    {
      "name": "Addresses stale assumptions",
      "description": "The report notes that the absence of visible tests does not confirm correctness, or explicitly calls out what assumptions the fix relies on that could be wrong",
      "max_score": 10
    },
    {
      "name": "Happy-path behavior examined",
      "description": "The report also addresses whether the basic happy-path claim holds (first send goes through, duplicate is blocked)",
      "max_score": 8
    },
    {
      "name": "Failure modes framed adversarially",
      "description": "The Attempts To Break It section is organized around ways the claim could fail, not a list of positive confirmations",
      "max_score": 10
    },
    {
      "name": "Valid verdict assigned",
      "description": "The verdict is exactly PASS, PARTIAL, or FAIL",
      "max_score": 8
    },
    {
      "name": "Verdict matches evidence",
      "description": "If the race condition risk is identified (check-then-act pattern), the verdict is NOT PASS — it is at minimum PARTIAL",
      "max_score": 14
    },
    {
      "name": "Correct output sections",
      "description": "Report has Claim, Attempts To Break It, Evidence, and Verdict sections",
      "max_score": 8
    },
    {
      "name": "Claim graded not style",
      "description": "The verdict discussion focuses on whether duplicates are prevented, not on code style, naming, or engineering choices",
      "max_score": 6
    }
  ]
}

evals

tile.json