{
  "context": "Tests whether the agent can audit an existing eval suite for coverage gaps and author new scenarios that follow this tile's eval-authoring conventions. Weight concentrates on tile-specific quality properties (correct criteria format, no bleeding, no leaking, negative-case coverage, meaningful failure descriptions) rather than on whether the agent spotted the gaps — basic QA reasoning is enough to identify gaps from the skill's functional description, so that axis contributes little lift on its own.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Identifies at least two uncovered decision branches",
      "description": "A written coverage analysis exists (as a file in the working directory; the task lets the agent pick the filename) and it names at least two concrete branches the existing two scenarios don't exercise (any combination of: failure paths, refusals, alternative inputs, edge conditions). Weighted low because identifying gaps from a functional skill description is basic QA reasoning and is not tile-specific",
      "max_score": 10
    },
    {
      "name": "Writes new scenario directories",
      "description": "At least two new directories appear under `evals/`, each containing both a `task.md` and a `criteria.json`. Scenario names must not recreate the two existing happy-path scenarios",
      "max_score": 10
    },
    {
      "name": "Criteria files follow the weighted_checklist format prescribed by the tile",
      "description": "Each new `criteria.json` is a valid JSON object with a `context` string, `type: \"weighted_checklist\"`, and a `checklist` array where every entry has `name`, `description`, and `max_score`. Missing any of these fields drops the score proportionally. Tile-specific — the format is prescribed by plugin-evals guidance, not guessable from general practice",
      "max_score": 15
    },
    {
      "name": "Criteria weights sum to 100 and are not equally distributed",
      "description": "Each new scenario's weights sum to exactly 100 AND are not all identical. This tile's convention is that weights sum to 100 and reflect importance, not a flat split — uniform weights or sums other than 100 score zero",
      "max_score": 5
    },
    {
      "name": "New task.md files pass the no-bleeding check",
      "description": "No criterion's expected literal value appears verbatim in its own scenario's task description. For each criterion with a concrete expected value (specific strings, flags, API names), grep the task text — a match is bleeding, per this tile's plugin-evals rule",
      "max_score": 15
    },
    {
      "name": "New criteria don't leak tile internals",
      "description": "Criteria describe observable behaviour, not tile-internal file paths, action names, or invented string conventions. Referencing public tool/API surfaces is fine; referencing `.tessl/tiles/...` paths, internal routing, or prescribed reply templates is leaking",
      "max_score": 10
    },
    {
      "name": "Failure descriptions are specific",
      "description": "Every new criterion's `description` explains what went wrong on failure in a way a grader can act on — not just `mismatch`, `fails`, or a restatement of the criterion name. Generic descriptions score zero on this axis",
      "max_score": 10
    },
    {
      "name": "At least one new scenario exercises a negative case",
      "description": "At least one new scenario tests refusal-to-proceed, produce-silence, or a failure-path recovery — not another happy-path variation. This tile's plugin-evals rule requires both positive and negative cases, and the grading is on whether the author knows it",
      "max_score": 10
    },
    {
      "name": "Coverage analysis justifies each gap",
      "description": "The written coverage analysis doesn't only list gaps — for each one it explains why the missing case matters (e.g., `without this, a production deploy without approval would pass evals despite being a policy violation`). Each gap without justification docks this criterion proportionally",
      "max_score": 15
    }
  ]
}

rules

README.md

tile.json

jbaruch/coding-policy

criteria.json.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-4/

criteria.jsonevals/scenario-4/