General-purpose coding policy for Baruch's AI agents
95
91%
Does it follow best practices?
Impact
96%
1.31xAverage score across 10 eval scenarios
Advisory
Suggest reviewing before use
Reference material for Step 9 of the eval-authoring skill. Pulled out of the SKILL.md body so the main workflow stays scannable.
lift = with_context_score - baseline_score
Lift is the number that matters. Aggregate attainment on its own is a vanity metric — a tile scoring 99% with-context and 73% baseline is contributing 26 points of real value, not 99. A 100/100 scenario with the tile loaded but 100/100 baseline is delivering zero tile value, no matter how good the score looks.
| Lift band | Verdict | Action |
|---|---|---|
| ≥ 40 | Healthy signal | The tile is doing real work — usually checks specific tile-prescribed choices (particular bot-ID discovery, particular reply template, particular CLI sequence). Keep these scenarios; do NOT soften criteria toward "testing reasoning" baseline already does |
| 10–39 | Weak | Audit for the three causes below — the scenario likely tests baseline competence rather than tile value |
| < 10 | No signal | Apply the three-cause diagnosis below before retiring; many "no lift" scenarios are actually "wrong target" |
Near-zero lift on a negative case is acceptable only when the baseline refusal is driven by universal knowledge (obvious error cases — e.g., "do not push secrets," "do not merge red CI"). Tile-specific refusal reasoning (e.g., "refuses to overwrite an existing reviewer workflow because the tile's overwrite-guard says so") must still show lift; if it doesn't, apply the three-cause diagnosis above.
For each scenario with non-zero lift but with-context below 100%, identify the failing criteria and decide where the problem lives:
skills/<name>/SKILL.md.evals/<scenario>/task.md.evals/<scenario>/criteria.json.The skill / task / criteria triage is the most common reason a fix lands in the wrong place — re-asking "what artifact owns this concern?" before editing anything saves a round trip.