CtrlK
BlogDocsLog inGet started
Tessl Logo

coding-agent-helpers/skeptic-verifier

Use when the user wants an adversarial double-check of a code or config change. Run the strongest checks available, try to break the claim, look for edge cases and hidden regressions, and return PASS, PARTIAL, or FAIL with evidence. Good triggers include "poke holes in this", "stress test this change", "double check this fix", and "try to break it".

84

1.30x
Quality

94%

Does it follow best practices?

Impact

81%

1.30x

Average score across 8 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-5/

{
  "context": "Tests whether the agent returns PARTIAL (not PASS) when the environment blocks strong verification, such as when a live PostgreSQL database is required to truly execute the script, and whether it is explicit about that limitation.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Verdict is PARTIAL not PASS",
      "description": "The verdict is PARTIAL (or FAIL) — not PASS — because the agent cannot run the script against a real PostgreSQL database",
      "max_score": 18
    },
    {
      "name": "Environment block cited",
      "description": "The report explicitly states that it could not run the script because no DATABASE_URL or live PostgreSQL instance is available",
      "max_score": 14
    },
    {
      "name": "Static review acknowledged",
      "description": "The report explicitly acknowledges that verification was limited to static/code inspection, using phrasing like 'static review', 'code reading only', or 'could not execute'",
      "max_score": 10
    },
    {
      "name": "Remaining Risks present",
      "description": "A Remaining Risks section is present with at least two untested risks related to the live database scenario",
      "max_score": 10
    },
    {
      "name": "Identifies batching gap",
      "description": "The report identifies that the subquery approach (SELECT id WHERE user_tier IS NULL LIMIT N) might not correctly paginate on a live table as rows are updated concurrently",
      "max_score": 10
    },
    {
      "name": "Lock risk examined",
      "description": "The report examines whether the batched UPDATE approach actually avoids row-level locks or introduces lock contention risks on a 50M row table",
      "max_score": 8
    },
    {
      "name": "Idempotency claim checked",
      "description": "The report assesses whether the 'skips already backfilled rows' and column-exists check actually make the script safe to re-run",
      "max_score": 8
    },
    {
      "name": "Valid verdict value",
      "description": "Verdict is exactly PASS, PARTIAL, or FAIL",
      "max_score": 8
    },
    {
      "name": "Correct output sections",
      "description": "Report contains Claim, Attempts To Break It, Evidence, and Verdict sections",
      "max_score": 6
    },
    {
      "name": "One-sentence claim",
      "description": "The Claim section states the verification claim in a single sentence",
      "max_score": 8
    }
  ]
}

evals

tile.json