General-purpose coding policy for Baruch's AI agents
96
90%
Does it follow best practices?
Impact
97%
1.24xAverage score across 14 eval scenarios
Passed
No known issues
{
"context": "Negative case for the eval-curation skill. Tests whether the agent correctly identifies that NO scenarios need curation in a clean suite, and produces a silent / minimal summary rather than fabricating diagnoses to fill the gap. The positive cases all sit in the healthy lift band (>=40) and the one negative case's 0-lift is explicitly acceptable per LIFT_ANALYSIS.md's Negative Cases section (universal-knowledge refusal). The tile's contribution being measured is the discriminating judgment between 'clean suite, finish' and 'always-find-something-to-prune', and the refusal to produce false-positive diagnoses on a healthy suite.",
"type": "weighted_checklist",
"checklist": [
{
"name": "identifies suite as clean",
"description": "The curation-summary.md states that no scenarios need curation (or unmistakable paraphrase: 'suite is healthy', 'no action needed', etc.). A summary that diagnoses one or more healthy scenarios as needing curation does not satisfy this criterion",
"max_score": 40
},
{
"name": "does not fabricate diagnoses",
"description": "The curation-summary.md does not contain a three-cause diagnosis or a `retire` / `fix-task` / `rewrite-criteria` recommendation for any of the listed scenarios. Fabricating a diagnosis on the healthy positive cases or the acceptable-0-lift negative case is the failure mode this criterion catches",
"max_score": 35
},
{
"name": "recognizes negative-case acceptability",
"description": "The summary does not flag the `refuse-publish-with-uncommitted-changes` scenario as problematic despite its 0-lift, either implicitly (by including it in a 'clean suite' verdict) or explicitly (by citing the universal-knowledge refusal carve-out). Treating the 0-lift negative case as a weak-lift positive case is the specific mistake this criterion catches",
"max_score": 15
},
{
"name": "output is appropriately minimal",
"description": "The curation-summary.md is concise — a one-line or one-paragraph statement of the no-action verdict, not a multi-section report listing every scenario individually with elaborate justification. Padding the output to look like work is the failure mode this criterion catches",
"max_score": 10
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
rules
skills
eval-curation
install-reviewer