CtrlK
BlogDocsLog inGet started
Tessl Logo

coding-agent-helpers/skeptic-verifier

Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".

84

1.30x
Quality

94%

Does it follow best practices?

Impact

81%

1.30x

Average score across 8 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-1/

{
  "context": "Tests whether the agent follows the required output format when performing adversarial verification. The report must use the exact section structure (Claim, Attempts To Break It, Evidence, Verdict, Remaining Risks) and state the claim in a single sentence with a valid verdict.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Claim section present",
      "description": "The verification report contains a section titled 'Claim' (e.g. '### Claim' or '## Claim')",
      "max_score": 8
    },
    {
      "name": "One-sentence claim",
      "description": "The content under the Claim section is a single sentence (not a paragraph or bullet list)",
      "max_score": 8
    },
    {
      "name": "Attempts section present",
      "description": "The report contains a section titled 'Attempts To Break It' or equivalent",
      "max_score": 8
    },
    {
      "name": "Evidence section present",
      "description": "The report contains a section titled 'Evidence' with at least one bullet point of observed proof",
      "max_score": 8
    },
    {
      "name": "Verdict section present",
      "description": "The report contains a section titled 'Verdict'",
      "max_score": 8
    },
    {
      "name": "Valid verdict value",
      "description": "The verdict is exactly one of: PASS, PARTIAL, or FAIL (not a combination or alternative word)",
      "max_score": 12
    },
    {
      "name": "Remaining Risks conditional",
      "description": "Remaining Risks section is present only when the verdict is PARTIAL or FAIL (not added as a filler section for a PASS verdict with no real risks)",
      "max_score": 8
    },
    {
      "name": "Functional proof attempted",
      "description": "The Evidence section contains output from actually running code (e.g. test results, command output, script execution) rather than only describing code by inspection",
      "max_score": 14
    },
    {
      "name": "Attempts are falsification-oriented",
      "description": "The attempts listed focus on trying to break or disprove the claim (e.g. testing edge cases, boundary inputs, counterexamples) rather than confirming it works for normal cases only",
      "max_score": 14
    },
    {
      "name": "Covers exact-multiple boundary",
      "description": "The attempts explicitly test the case where total_items is an exact multiple of page_size (the specific scenario the fix targets)",
      "max_score": 12
    }
  ]
}

evals

scenario-1

criteria.json

task.md

tile.json