CtrlK
BlogDocsLog inGet started
Tessl Logo

uinaf/verify

Verify your own completed code changes using the repo's existing infrastructure and an independent evaluator context. Use after implementing a change when you need to run unit or integration tests, check build or lint gates, prove the real surface works with evidence, and challenge the changed code for clarity, deduplication, and maintainability. If the repo is not verifiable yet, hand off to `agent-readiness`; if you are reviewing someone else's code, use `review`.

97

1.02x
Quality

98%

Does it follow best practices?

Impact

94%

1.02x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-2/

{
  "context": "Tests whether the agent runs the repo's built-in verification infrastructure (make verify or pytest) before making any claims, produces a compact evidence-backed verification footer, exercises a real failure path, and correctly records exact evidence — rather than just reading the code and opining.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Guardrails run first",
      "description": "Report shows that the repo's built-in test command (make verify, make test, or pytest on test_calculator.py) was executed before any verdict or surface claim was made",
      "max_score": 15
    },
    {
      "name": "Test output recorded",
      "description": "Report includes actual pytest output (e.g. test names, PASSED/FAILED indicators, or summary line like '7 passed') from running the test suite — not just a statement that tests were run",
      "max_score": 10
    },
    {
      "name": "Failure path exercised",
      "description": "Report demonstrates that at least one error path was exercised (e.g. divide by zero, unknown operation) either through the test suite or by direct invocation, and the behavior is documented",
      "max_score": 10
    },
    {
      "name": "Verdict field present",
      "description": "Report contains a 'Verdict' section using exactly one of: 'ready for review', 'needs more work', or 'blocked'. Does NOT use 'ship it' (that verdict belongs to review, not verify).",
      "max_score": 10
    },
    {
      "name": "Change Verified section",
      "description": "Report states what was confirmed to work either in a compact footer or in nearby verification notes",
      "max_score": 8
    },
    {
      "name": "Surfaces Exercised section",
      "description": "Report names the specific commands, functions, or runtime surfaces that were invoked without requiring a verbose section",
      "max_score": 8
    },
    {
      "name": "Self-Corrections section",
      "description": "Report lists cheap, obvious fixes made during verification when they happened, and omits this line when none happened. Substantive code-shape concerns are deferred to review, not pre-fixed here.",
      "max_score": 8
    },
    {
      "name": "Exact Evidence section",
      "description": "Report includes exact commands run and representative output received where failures or runtime responses matter, while keeping successful output terse",
      "max_score": 8
    },
    {
      "name": "Recommended Follow-up section",
      "description": "Report includes a compact `next` or equivalent follow-up line suggesting concrete next steps",
      "max_score": 8
    },
    {
      "name": "Compact verification footer",
      "description": "Report keeps the final verification footer to no more than 5 labeled lines, avoids repeated logs or screenshots, and summarizes successful evidence by command or surface name",
      "max_score": 8
    },
    {
      "name": "Error actionability assessed",
      "description": "Report comments on whether the error messages raised (e.g. 'Cannot divide by zero', 'Unknown operation: X') are clear, actionable, and help the caller understand what to do next",
      "max_score": 8
    },
    {
      "name": "No unverified claims",
      "description": "Report does NOT claim behavior is correct without running the code (e.g. does not say 'the divide function looks correct' without having executed it or its tests)",
      "max_score": 7
    }
  ]
}

evals

SKILL.md

tile.json