CtrlK
BlogDocsLog inGet started
Tessl Logo

jbaruch/coding-policy

General-purpose coding policy for Baruch's AI agents

96

1.24x
Quality

90%

Does it follow best practices?

Impact

97%

1.24x

Average score across 14 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-13/

{
  "context": "Tests whether the agent applies the three-cause diagnosis to a near-zero-lift scenario where the task description leaks the tile-prescribed technique (the task literally states 'use `--ff-only`' and the criterion checks for `--ff-only`). The correct cause is 'Task leaked the technique' and the prescribed action is `fix-task` — strip the leak from the task, keep the criterion. The rule explicitly forbids dropping the criterion in this case. The tile's contribution being measured is the discriminating judgment between fix-task (correct) and drop-criterion (a wrong intervention that baseline agents sometimes pick).",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "names canonical cause",
      "description": "The diagnosis identifies the cause as 'Task leaked the technique' (or an unmistakable paraphrase that names the task text revealing the technique-under-test as the mechanism). Generic 'the test is gameable' without naming the canonical cause does not satisfy this criterion",
      "max_score": 30
    },
    {
      "name": "prescribes fix-task",
      "description": "The recommended action is `fix-task` — edit the task to strip the leak. `retire` and `rewrite-criteria` are both wrong for this cause and do not satisfy this criterion",
      "max_score": 25
    },
    {
      "name": "preserves the criterion",
      "description": "The diagnosis explicitly states that the criterion should be kept (not dropped, not weakened). Dropping the criterion is the wrong intervention here per the rule, even though it would also kill the bleed — the criterion is what measures tile value once the leak is stripped",
      "max_score": 25
    },
    {
      "name": "task rewrite strips technique, keeps situation",
      "description": "The proposed rewritten task removes the `--ff-only` literal (and any equivalent technique hint) while preserving the situation the user needs done (merge PR #42 cleanly). A rewrite that strips so much the task becomes unsolvable, or that leaves the leak in, does not satisfy this criterion",
      "max_score": 20
    }
  ]
}

evals

README.md

tile.json