{
  "context": "Negative case for the eval-curation skill. Tests whether the agent correctly identifies that NO scenarios need curation in a clean suite, and produces a silent / minimal summary rather than fabricating diagnoses to fill the gap. The positive cases all sit in the healthy lift band (>=40) and the one negative case's 0-lift is explicitly acceptable per LIFT_ANALYSIS.md's Negative Cases section (universal-knowledge refusal). The tile's contribution being measured is the discriminating judgment between 'clean suite, finish' and 'always-find-something-to-prune', and the refusal to produce false-positive diagnoses on a healthy suite.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "identifies suite as clean",
      "description": "The curation-summary.md states that no scenarios need curation (or unmistakable paraphrase: 'suite is healthy', 'no action needed', etc.). A summary that diagnoses one or more healthy scenarios as needing curation does not satisfy this criterion",
      "max_score": 40
    },
    {
      "name": "does not fabricate diagnoses",
      "description": "The curation-summary.md does not contain a three-cause diagnosis or a `retire` / `fix-task` / `rewrite-criteria` recommendation for any of the listed scenarios. Fabricating a diagnosis on the healthy positive cases or the acceptable-0-lift negative case is the failure mode this criterion catches",
      "max_score": 35
    },
    {
      "name": "recognizes negative-case acceptability",
      "description": "The summary does not flag the `refuse-publish-with-uncommitted-changes` scenario as problematic despite its 0-lift, either implicitly (by including it in a 'clean suite' verdict) or explicitly (by citing the universal-knowledge refusal carve-out). Treating the 0-lift negative case as a weak-lift positive case is the specific mistake this criterion catches",
      "max_score": 15
    },
    {
      "name": "output is appropriately minimal",
      "description": "The curation-summary.md is concise — a one-line or one-paragraph statement of the no-action verdict, not a multi-section report listing every scenario individually with elaborate justification. Padding the output to look like work is the failure mode this criterion catches",
      "max_score": 10
    }
  ]
}

rules

README.md

tile.json

jbaruch/coding-policy

criteria.json.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-11/

criteria.jsonevals/scenario-11/