General-purpose coding policy for Baruch's AI agents
90
91%
Does it follow best practices?
Impact
90%
1.30xAverage score across 18 eval scenarios
Advisory
Suggest reviewing before use
Reference material for Step 9 of the eval-authoring skill. Pulled out of the SKILL.md body so the main workflow stays scannable.
lift = with_context_score - baseline_score
Lift is the number that matters. Aggregate attainment on its own is a vanity metric — a tile scoring 99% with-context and 73% baseline is contributing 26 points of real value, not 99. A 100/100 scenario with the tile loaded but 100/100 baseline is delivering zero tile value, no matter how good the score looks.
glm-5.1 (the CI publish-run solver); a curation decision reached on a stronger solver such as claude-opus-4-7 must be re-checked against the floor before any retire| Lift band | Verdict | Action |
|---|---|---|
| ≥ 40 | Healthy signal | The tile is doing real work — usually checks specific tile-prescribed choices (particular bot-ID discovery, particular reply template, particular CLI sequence). Keep these scenarios; do NOT soften criteria toward "testing reasoning" baseline already does |
| 10–39 | Weak | Audit for the three causes below — the scenario likely tests baseline competence rather than tile value |
| < 10 | No signal | Apply the three-cause diagnosis below before retiring; many "no lift" scenarios are actually "wrong target" |
Cause #1 vs #3 discriminator — both present as "baseline already does it," and conflating them sends the action the wrong way. Ask: does the tile prescribe a more specific behaviour than the criteria currently grade, one baseline would not produce by default?
retirerewrite-criteriaThe salvageable-replacement test decides — there is no default action.
Near-zero lift on a negative case is acceptable only when the baseline refusal is driven by universal knowledge (obvious error cases — e.g., "do not push secrets," "do not merge red CI"). Tile-specific refusal reasoning (e.g., "refuses to overwrite an existing reviewer workflow because the tile's overwrite-guard says so") must still show lift; if it doesn't, apply the three-cause diagnosis above.
For each scenario with non-zero lift but with-context below 100%, identify the failing criteria and decide where the problem lives:
skills/<name>/SKILL.md.evals/<scenario>/task.md.evals/<scenario>/criteria.json.The skill / task / criteria triage is the most common reason a fix lands in the wrong place — re-asking "what artifact owns this concern?" before editing anything saves a round trip.
.tessl-plugin
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
rules
skills
adopt-fork-pr
eval-curation
install-reviewer