CtrlK
BlogDocsLog inGet started
Tessl Logo

jbaruch/coding-policy

General-purpose coding policy for Baruch's AI agents

95

1.31x
Quality

91%

Does it follow best practices?

Impact

96%

1.31x

Average score across 10 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-6/

{
  "context": "Tests whether the script follows this tile's prescribed mechanisms for polling CI status and scraping reviews. The tile prescribes a specific trio — `gh pr checks <N> --json name,bucket` for CI (structured output the caller parses in a polling loop), `gh api repos/<o>/<r>/pulls/<N>/reviews` for review state, `gh api repos/<o>/<r>/pulls/<N>/comments` (pull-request review comments, NOT `/issues/<N>/comments`) for inline comments. Baseline agents typically pick a different combination: ad-hoc greps of `gh pr view` text output, `gh run list` polling, or the issue-comments endpoint (which returns wrong data). The specific-mechanism criteria below measure whether the tile was applied.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Uses `gh pr checks` with structured output",
      "description": "The script retrieves CI state by invoking `gh pr checks <N>` with `--json` (the prescribed mechanism returns structured `name`/`bucket` fields the caller can parse into a terminal state). `gh run list`, ad-hoc greps of `gh pr view` text output, or UI polling score zero — those are different approaches that do not give the caller a structured terminal-state signal",
      "max_score": 15
    },
    {
      "name": "Uses `gh api .../pulls/<N>/reviews` for review state",
      "description": "The script hits the pull-request reviews endpoint (`gh api repos/<owner>/<repo>/pulls/<N>/reviews`) to retrieve per-reviewer state. Other endpoints (`gh pr view`, `gh api` on a different path) score zero — the tile prescribes this endpoint specifically because it returns the per-review structure needed for multi-bot setups",
      "max_score": 15
    },
    {
      "name": "Uses `gh api .../pulls/<N>/comments` for inline comments",
      "description": "The script retrieves inline review comments via the pull-request review-comments endpoint (`gh api repos/<owner>/<repo>/pulls/<N>/comments`). This is a tile-prescribed endpoint choice that avoids the issue-comments trap",
      "max_score": 15
    },
    {
      "name": "Does NOT use `/issues/<N>/comments`",
      "description": "The script must NOT use the issue-comments endpoint (`gh api repos/<o>/<r>/issues/<N>/comments`) for inline review comments. That endpoint returns a different object shape (PR-level comments, not inline review comments) and surfaces wrong data. A script that conflates the two scores zero here — this is a known GitHub-API gotcha the tile exists to prevent",
      "max_score": 10
    },
    {
      "name": "Retrieves per-reviewer state distinctly",
      "description": "The script captures the latest review state for EACH reviewer distinctly, not a single aggregate. Multi-bot setups (Copilot + gh-aw) each have their own state, and the tile prescribes exposing them separately so the developer sees which bot is gating the merge",
      "max_score": 10
    },
    {
      "name": "No hardcoded PR, owner, or repo in the script body",
      "description": "OBSERVABLE: the script body contains no literal for the target PR, owner, or repo — all three are pulled from runtime inputs (args, env vars). Running against a new target requires changing only the invocation, not the script source",
      "max_score": 10
    },
    {
      "name": "Waits for CI to finish before surfacing state",
      "description": "The script (or the prescribed caller loop around it) polls until CI is in a terminal state before declaring a final verdict. Surfacing mid-run `pending` as the summary outcome scores zero — any reasonable wait mechanism (the caller looping on the script's output, `gh run watch` on the CI run, etc.) is acceptable",
      "max_score": 4
    },
    {
      "name": "Surfaces CI state in the summary",
      "description": "The final summary prints a terminal CI verdict (success / failure / cancelled / timed-out), parsed from `gh pr checks` JSON output. `pending` is not a valid final verdict",
      "max_score": 4
    },
    {
      "name": "Surfaces review states in the summary",
      "description": "The final summary prints each reviewer's latest state — not just raw JSON the developer has to parse",
      "max_score": 8
    },
    {
      "name": "Surfaces inline comment content or count",
      "description": "The final summary includes inline comment bodies (for low counts) or a count with a pointer to read them, so the developer can tell whether there is unaddressed feedback",
      "max_score": 9
    }
  ]
}

evals

README.md

tile.json