CtrlK
BlogDocsLog inGet started
Tessl Logo

uinaf/review

Review existing code, diffs, branches, or pull requests using concern-specific reviewer personas and evidence. Use when auditing someone else's work, triaging risk in a PR, or producing a ship-it / needs-review / blocked verdict. Do not use to verify your own completed change; use `verify` for that.

98

1.31x
Quality

100%

Does it follow best practices?

Impact

92%

1.31x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-1/

{
  "context": "Tests whether the agent produces an evidence-backed review report with correct verdict shape, ordered findings, explicit unverified markers, file references with line numbers, and incorporates repo guidance from CLAUDE.md. Also tests correct silent-failures and cleanup persona application.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "CLAUDE.md loaded",
      "description": "The report references or applies at least one rule from CLAUDE.md (e.g. MAX_RETRIES constant, dead-letter requirement, logging with payment ID, error classification)",
      "max_score": 10
    },
    {
      "name": "File references in findings",
      "description": "At least two findings cite a specific file name (e.g. retry.ts, retry.test.ts) rather than only vague descriptions",
      "max_score": 8
    },
    {
      "name": "Line-level evidence",
      "description": "At least one finding references a specific line number or code snippet within a file",
      "max_score": 8
    },
    {
      "name": "Verdict present",
      "description": "The report contains exactly one verdict label: 'ship it', 'needs review', or 'blocked'",
      "max_score": 8
    },
    {
      "name": "Findings by severity",
      "description": "Findings are explicitly labeled or ordered by severity (e.g. high/medium/low, or ranked from most to least critical)",
      "max_score": 8
    },
    {
      "name": "Error classification finding",
      "description": "The report identifies that transient vs permanent payment errors are not classified before retrying (all errors lead to PENDING_RETRY), which violates the CLAUDE.md rule",
      "max_score": 10
    },
    {
      "name": "Silent failure finding",
      "description": "The report identifies that the catch block in scheduleRetry only logs a vague warning without the error message or type, losing failure signal",
      "max_score": 8
    },
    {
      "name": "Mock-heavy test concern",
      "description": "The report notes that the tests mock the database, charge provider, and logger — meaning the tests would pass even if real integrations broke",
      "max_score": 6
    },
    {
      "name": "Dead code identified",
      "description": "The report identifies retry.old.ts as dead/deprecated code that should be removed",
      "max_score": 6
    },
    {
      "name": "Unverified surfaces marked",
      "description": "The report explicitly marks at least one area as unverified (e.g. actual runtime behavior, payment provider behavior, monitoring integration)",
      "max_score": 8
    },
    {
      "name": "Recommended follow-up",
      "description": "The report recommends a specific follow-up from: implementation, verify, agent-readiness, or docs",
      "max_score": 6
    },
    {
      "name": "No nit inflation",
      "description": "The report does NOT elevate purely stylistic issues (naming conventions, formatting) to the same severity as functional defects",
      "max_score": 6
    },
    {
      "name": "Personas listed",
      "description": "The report explicitly names which reviewer personas were used",
      "max_score": 8
    }
  ]
}

evals

scenario-1

criteria.json

task.md

SKILL.md

tile.json