{
  "context": "Tests whether the agent applies this tile's prescribed review-reply conventions — not whether a baseline agent reasons to some equivalent reply. The tile invents specific short phrases (`Fixed in <sha>` for accepts, `Declining — <reason>` for declines) so review threads scan consistently across a team. Baseline agents would produce ad-hoc replies (`done`, `no thanks`, `I disagree`). The specific-format criteria below measure whether the tile's convention was applied; the substance criteria measure whether the reply has the content the convention is meant to carry.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "CI failure: fix required",
      "description": "For the CI failure (failing unit test), the guide states the test must be fixed. Does NOT suggest skipping, ignoring, deferring, or `[skip ci]`",
      "max_score": 10
    },
    {
      "name": "Applies the reasonable suggestion",
      "description": "For the clearly-correct suggestion (simplify the nested conditional), the guide says to apply it — does NOT suggest declining or deferring",
      "max_score": 10
    },
    {
      "name": "Declines the over-engineered suggestion",
      "description": "For the over-engineered suggestion (unnecessary abstraction for current scope), the guide says to push back — does NOT say to blindly implement it",
      "max_score": 10
    },
    {
      "name": "All three threads get replies",
      "description": "The guide produces a reply for EVERY thread (CI failure, accept, decline) — not two of three. An empty reply or 'no response needed' on any thread scores zero",
      "max_score": 12
    },
    {
      "name": "Accept reply uses the `Fixed in <sha>` format",
      "description": "The accept reply follows the tile's prescribed format — it begins with the literal word 'Fixed in' followed by a sha (either a real commit hash or, per the task's placeholder instruction, the literal `<sha>` marker). A reply that says 'Done' or 'Resolved' or 'Applied the fix' scores zero even if semantically equivalent, because the tile exists to standardize the specific phrasing across the team",
      "max_score": 15
    },
    {
      "name": "Decline reply uses the `Declining — <reason>` format",
      "description": "The decline reply follows the tile's prescribed format — it begins with the literal word 'Declining', an em dash (—), then a reason (either prose or, per the task's placeholder instruction, the literal `<reason>` marker). A reply that says 'Disagree' or 'No thanks' or 'Not doing this' scores zero — the tile prescribes this specific opening precisely so declines are unambiguous in thread scans",
      "max_score": 15
    },
    {
      "name": "Decline reply cites a verifiable reference",
      "description": "The decline reply names at least one externally-verifiable reference for the rejection: a file:line, quoted spec/docs clause, linked prior PR/issue, or named scope constraint from the PR description. This complements the format check — the format alone is not enough; the tile also requires substance",
      "max_score": 8
    },
    {
      "name": "Fixes pushed to the same branch",
      "description": "The guide states any code changes go on the EXISTING feature branch of the PR, not a new branch. Opening a second PR for review fixes scores zero",
      "max_score": 10
    },
    {
      "name": "No dangling threads before merge",
      "description": "The guide explicitly states that EVERY thread must have a reply before the PR is merged — nothing left unresolved",
      "max_score": 10
    }
  ]
}

rules

README.md

tile.json

jbaruch/coding-policy

criteria.json.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-9/

criteria.jsonevals/scenario-9/