CtrlK
BlogDocsLog inGet started
Tessl Logo

uinaf/verify

Verify your own completed code changes using the repo's existing infrastructure and an independent evaluator context. Use after implementing a change when you need to run unit or integration tests, check build or lint gates, prove the real surface works with evidence, and challenge the changed code for clarity, deduplication, and maintainability. If the repo is not verifiable yet, hand off to `agent-readiness`; if you are reviewing someone else's code, use `review`.

97

1.02x
Quality

100%

Does it follow best practices?

Impact

89%

1.02x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-2/

{
  "context": "Tests whether the agent runs the repo's built-in verification infrastructure (make verify or pytest) before making any claims, produces a complete report with all required sections, exercises a real failure path, and correctly records exact evidence — rather than just reading the code and opining.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Guardrails run first",
      "description": "Report shows that the repo's built-in test command (make verify, make test, or pytest on test_calculator.py) was executed before any verdict or surface claim was made",
      "max_score": 15
    },
    {
      "name": "Test output recorded",
      "description": "Report includes actual pytest output (e.g. test names, PASSED/FAILED indicators, or summary line like '7 passed') from running the test suite — not just a statement that tests were run",
      "max_score": 10
    },
    {
      "name": "Failure path exercised",
      "description": "Report demonstrates that at least one error path was exercised (e.g. divide by zero, unknown operation) either through the test suite or by direct invocation, and the behavior is documented",
      "max_score": 10
    },
    {
      "name": "Verdict field present",
      "description": "Report contains a 'Verdict' section using exactly one of: 'ship it', 'needs review', or 'blocked'",
      "max_score": 10
    },
    {
      "name": "Change Verified section",
      "description": "Report includes a 'Change Verified' section (or equivalent) describing what was confirmed to work",
      "max_score": 8
    },
    {
      "name": "Surfaces Exercised section",
      "description": "Report includes a 'Surfaces Exercised' section (or equivalent) naming the specific commands or functions that were invoked",
      "max_score": 8
    },
    {
      "name": "Code-Shape Findings section",
      "description": "Report includes a 'Code-Shape Findings' section (or equivalent) containing observations about code quality, clarity, error handling, or duplication in the changed files",
      "max_score": 8
    },
    {
      "name": "Exact Evidence section",
      "description": "Report includes an 'Exact Evidence' section (or equivalent) containing the actual commands run and representative output received",
      "max_score": 8
    },
    {
      "name": "Recommended Follow-up section",
      "description": "Report includes a 'Recommended Follow-up' section (or equivalent) suggesting concrete next steps",
      "max_score": 8
    },
    {
      "name": "Error actionability assessed",
      "description": "Report comments on whether the error messages raised (e.g. 'Cannot divide by zero', 'Unknown operation: X') are clear, actionable, and help the caller understand what to do next",
      "max_score": 8
    },
    {
      "name": "No unverified claims",
      "description": "Report does NOT claim behavior is correct without running the code (e.g. does not say 'the divide function looks correct' without having executed it or its tests)",
      "max_score": 7
    }
  ]
}

evals

SKILL.md

tile.json