CtrlK
BlogDocsLog inGet started
Tessl Logo

coding-agent-helpers/skeptic-verifier

Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".

84

1.30x
Quality

94%

Does it follow best practices?

Impact

81%

1.30x

Average score across 8 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-6/

{
  "context": "Tests whether the agent grades the behavioral claim (does the retry logic work?) rather than the code quality (is it well-written?), and whether it runs functional tests rather than relying on code inspection.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "No code quality criticism in verdict",
      "description": "The verdict and the evidence section do NOT base the conclusion on code style, verbosity, naming conventions, or readability — they focus only on whether the retry behavior is correct",
      "max_score": 14
    },
    {
      "name": "Claim-based verdict",
      "description": "The verdict explicitly addresses whether the claimed behaviors hold (retries on 503, exponential backoff, gives up after N attempts) — not whether the code is good",
      "max_score": 12
    },
    {
      "name": "Functional proof executed",
      "description": "The agent actually runs the code (e.g. runs the test file or writes and executes additional tests) rather than only reasoning about it by reading",
      "max_score": 12
    },
    {
      "name": "Backoff calculation verified",
      "description": "The agent checks whether the exponential backoff delays are computed correctly (the implementation uses attempt_number after increment, so delays are 2x, 4x, 8x rather than 1x, 2x, 4x)",
      "max_score": 12
    },
    {
      "name": "Verdict matches actual behavior",
      "description": "If the agent discovers the backoff calculation is off-by-one (or confirms it is correct), the verdict reflects this finding accurately",
      "max_score": 10
    },
    {
      "name": "Attempts are falsification-oriented",
      "description": "Attempts include at least one adversarial test beyond the provided test suite (e.g. verifying exact delay values, testing with max_attempts=1, testing non-503 error codes)",
      "max_score": 10
    },
    {
      "name": "Valid verdict value",
      "description": "Verdict is exactly PASS, PARTIAL, or FAIL",
      "max_score": 8
    },
    {
      "name": "Correct output format",
      "description": "Report contains Claim, Attempts To Break It, Evidence, and Verdict sections",
      "max_score": 8
    },
    {
      "name": "One-sentence claim",
      "description": "The Claim section contains a single sentence summarizing what is being verified",
      "max_score": 8
    },
    {
      "name": "Evidence contains execution output",
      "description": "The Evidence section shows actual output from running the code (test results, printed values) rather than paraphrased descriptions",
      "max_score": 6
    }
  ]
}

evals

tile.json