General-purpose coding policy for Baruch's AI agents
90
91%
Does it follow best practices?
Impact
90%
1.30xAverage score across 18 eval scenarios
Advisory
Suggest reviewing before use
Prune an existing eval suite down to the scenarios that actually pull weight. Process steps in order. Do not skip ahead.
The diagnostic vocabulary (lift bands, three-cause diagnosis) lives in:
skills/eval-authoring/LIFT_ANALYSIS.mdRead it before Step 3. The obligation to prune is set by rules/plugin-evals.md "Lift, Not Attainment".
tessl eval run .Wait until the run completes. Record the run ID — Step 2 needs it.
Proceed immediately to Step 2.
tessl eval view --json <run-id> | python3 .tessl/plugins/jbaruch/coding-policy/skills/eval-curation/compute-lift.pyThe script reads a tessl eval view --json payload from stdin (or a path argument), pairs each scenario's with-context/usage-spec variant against its baseline/without-context variant, sums the assessmentResults scores per side, and emits the per-scenario lift trio as JSON:
{
"lifts": [{ "scenario_id": "<uuid>", "lift": <float>, "with_context_total": <float>, "baseline_total": <float> }],
"skipped": [{ "scenario_id": "<uuid>", "reason": "<diagnostic>" }]
}Scenarios missing a paired variant land in skipped with a diagnostic; the script does not silently drop them. The deterministic JSON walk + arithmetic live in the script per rules/script-delegation.md so this step is reproducible and testable independently of any agent.
Proceed immediately to Step 3.
Bucket each scenario into the lift bands defined in:
skills/eval-authoring/LIFT_ANALYSIS.mdHealthy positive-case bands and negative cases whose near-zero lift is acceptable per that file's Negative Cases section stay as-is.
If no scenarios sit in the actionable bands (no weak / no-lift positive cases, no tile-specific negative case that fails the lift expectation), the suite is clean. Produce a one-line curation-summary.md stating "no curation needed" and finish here — do not fabricate diagnoses for scenarios that don't need them.
Otherwise proceed immediately to Step 4 with two inputs to the three-cause diagnosis: the weak / no-lift positive cases, AND any tile-specific negative case whose lift fell below the acceptable band.
Apply the three-cause diagnosis from rules/plugin-evals.md "Lift, Not Attainment" to each scenario routed in from Step 3:
skills/eval-authoring/REVIEW_CHECKLIST.md's No Bleeding rules; keep the criterion. Do NOT drop the criterion.Record the decision per scenario: retire, fix-task, or rewrite-criteria. Proceed immediately to Step 5.
For each retire: git rm -r evals/<scenario-dir> and note the removal in the tile's CHANGELOG.md under Unreleased.
For each fix-task: edit task.md per the No Bleeding rules — strip the technique / format / literal that leaked; keep the situation the user actually needs done.
For each rewrite-criteria: edit criteria.json so the checklist grades the specific manner the tile prescribes (flag choices, format literals, sequences, conventions), not universal competence. Re-weight so max_score values still sum to exactly 100; if removing a criterion leaves nothing tile-specific, retire the scenario instead.
Proceed immediately to Step 6.
Re-run the suite (tessl eval run .) and re-fetch per-scenario lift via Step 2's mechanism. Verify three things: retired scenarios are gone from the run, fixed scenarios now show meaningful lift, the lift distribution is denser than before the curation pass.
If any fixed scenario still produces near-zero lift, return to Step 4 with that scenario alone — the diagnosis was wrong or the fix didn't take. Otherwise finish here when the distribution is stable and every kept scenario contributes signal.
.tessl-plugin
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
rules
skills
adopt-fork-pr
eval-curation
install-reviewer