CtrlK
BlogDocsLog inGet started
Tessl Logo

coding-agent-helpers/skeptic-verifier

Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".

84

1.30x
Quality

94%

Does it follow best practices?

Impact

81%

1.30x

Average score across 8 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-8/

{
  "context": "Tests whether the agent prefers functional proof over code inspection, actually running the validator with adversarial inputs rather than reasoning about the code by reading it.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Code actually executed",
      "description": "The agent runs the validator function with concrete inputs — evidence includes actual return values or printed output from executing the code, not just descriptions of what the code appears to do",
      "max_score": 16
    },
    {
      "name": "Adversarial inputs tested",
      "description": "The agent tests at least 3 distinct adversarial inputs (e.g. boundary values, wrong types, missing fields, boolean True as integer) — not just the described valid/invalid cases",
      "max_score": 12
    },
    {
      "name": "Python bool edge case tested",
      "description": "The agent tests whether a boolean value (True or False) passes the isinstance(x, int) check — in Python, bool is a subclass of int, so True would be accepted as memory_mb=True (value 1), which is below the 64 minimum",
      "max_score": 12
    },
    {
      "name": "Boundary values tested",
      "description": "The agent tests exact boundary values (64, 65536, 1, 1024, 50) to confirm they are accepted, and values just outside boundaries (63, 65537, 0, 1025, 51) to confirm they are rejected",
      "max_score": 10
    },
    {
      "name": "Evidence is concrete",
      "description": "The Evidence section shows actual function return values or output (e.g. [] for valid, ['memory_mb must be ...'] for invalid) rather than saying 'the function correctly handles this case'",
      "max_score": 10
    },
    {
      "name": "Correct output sections",
      "description": "Report contains Claim, Attempts To Break It, Evidence, and Verdict sections",
      "max_score": 8
    },
    {
      "name": "Valid verdict value",
      "description": "Verdict is exactly PASS, PARTIAL, or FAIL",
      "max_score": 8
    },
    {
      "name": "Verdict grounded in test results",
      "description": "The verdict is justified by the test results found during execution, not by how the code reads",
      "max_score": 8
    },
    {
      "name": "One-sentence claim",
      "description": "The Claim section states the claim in a single sentence",
      "max_score": 8
    },
    {
      "name": "Functional proof preferred over inspection",
      "description": "The report demonstrates a preference for running code (even writing a small test script) rather than describing the code visually — the Evidence section reflects actual outputs",
      "max_score": 8
    }
  ]
}

evals

tile.json