General-purpose coding policy for Baruch's AI agents
96
90%
Does it follow best practices?
Impact
97%
1.24xAverage score across 14 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent applies the three-cause diagnosis to a near-zero-lift scenario where the task description leaks the tile-prescribed technique (the task literally states 'use `--ff-only`' and the criterion checks for `--ff-only`). The correct cause is 'Task leaked the technique' and the prescribed action is `fix-task` — strip the leak from the task, keep the criterion. The rule explicitly forbids dropping the criterion in this case. The tile's contribution being measured is the discriminating judgment between fix-task (correct) and drop-criterion (a wrong intervention that baseline agents sometimes pick).",
"type": "weighted_checklist",
"checklist": [
{
"name": "names canonical cause",
"description": "The diagnosis identifies the cause as 'Task leaked the technique' (or an unmistakable paraphrase that names the task text revealing the technique-under-test as the mechanism). Generic 'the test is gameable' without naming the canonical cause does not satisfy this criterion",
"max_score": 30
},
{
"name": "prescribes fix-task",
"description": "The recommended action is `fix-task` — edit the task to strip the leak. `retire` and `rewrite-criteria` are both wrong for this cause and do not satisfy this criterion",
"max_score": 25
},
{
"name": "preserves the criterion",
"description": "The diagnosis explicitly states that the criterion should be kept (not dropped, not weakened). Dropping the criterion is the wrong intervention here per the rule, even though it would also kill the bleed — the criterion is what measures tile value once the leak is stripped",
"max_score": 25
},
{
"name": "task rewrite strips technique, keeps situation",
"description": "The proposed rewritten task removes the `--ff-only` literal (and any equivalent technique hint) while preserving the situation the user needs done (merge PR #42 cleanly). A rewrite that strips so much the task becomes unsolvable, or that leaves the leak in, does not satisfy this criterion",
"max_score": 20
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
rules
skills
eval-curation
install-reviewer