General-purpose coding policy for Baruch's AI agents
96
90%
Does it follow best practices?
Impact
97%
1.24xAverage score across 14 eval scenarios
Passed
No known issues
Prune an existing eval suite down to the scenarios that actually pull weight. Process steps in order. Do not skip ahead.
The diagnostic vocabulary (lift bands, three-cause diagnosis) lives in:
skills/eval-authoring/LIFT_ANALYSIS.mdRead it before Step 3. The obligation to prune is set by rules/plugin-evals.md "Lift, Not Attainment".
tessl eval run .Wait until the run completes. Record the run ID — Step 2 needs it.
Proceed immediately to Step 2.
tessl eval view --json <run-id> | python3 .tessl/tiles/jbaruch/coding-policy/skills/eval-curation/compute-lift.pyThe script reads a tessl eval view --json payload from stdin (or a path argument), pairs each scenario's with-context/usage-spec variant against its baseline/without-context variant, sums the assessmentResults scores per side, and emits the per-scenario lift trio as JSON:
{
"lifts": [{ "scenario_id": "<uuid>", "lift": <float>, "with_context_total": <float>, "baseline_total": <float> }],
"skipped": [{ "scenario_id": "<uuid>", "reason": "<diagnostic>" }]
}Scenarios missing a paired variant land in skipped with a diagnostic; the script does not silently drop them. The deterministic JSON walk + arithmetic live in the script per rules/script-delegation.md so this step is reproducible and testable independently of any agent.
Proceed immediately to Step 3.
Bucket each scenario into the lift bands defined in:
skills/eval-authoring/LIFT_ANALYSIS.mdHealthy positive-case bands and negative cases whose near-zero lift is acceptable per that file's Negative Cases section stay as-is.
If no scenarios sit in the actionable bands (no weak / no-lift positive cases, no tile-specific negative case that fails the lift expectation), the suite is clean. Produce a one-line curation-summary.md stating "no curation needed" and finish here — do not fabricate diagnoses for scenarios that don't need them.
Otherwise proceed immediately to Step 4 with: the weak / no-lift positive cases (each routed to the three-cause diagnosis), AND any tile-specific negative case whose lift fell below the acceptable band (also routed to the three-cause diagnosis — the rule's framework applies because the underspecification lives in the same three places).
Apply the three-cause diagnosis from rules/plugin-evals.md "Lift, Not Attainment" to each scenario routed in from Step 3:
skills/eval-authoring/REVIEW_CHECKLIST.md's No Bleeding rules; keep the criterion. Do NOT drop the criterion.Record the decision per scenario: retire, fix-task, or rewrite-criteria. Proceed immediately to Step 5.
For each retire: git rm -r evals/<scenario-dir> (stages the deletion in the same step so the curation pass can't ship with the disk delete unstaged) and note the removal in the tile's CHANGELOG.md under Unreleased.
For each fix-task: edit task.md per the No Bleeding rules — strip the technique / format / literal that leaked; keep the situation the user actually needs done.
For each rewrite-criteria: edit criteria.json so the checklist grades the specific manner the tile prescribes (flag choices, format literals, sequences, conventions), not universal competence. Re-weight so max_score values still sum to exactly 100; if removing a criterion leaves nothing tile-specific, retire the scenario instead.
Proceed immediately to Step 6.
Re-run the suite (tessl eval run .) and re-fetch per-scenario lift via Step 2's mechanism. Verify three things: retired scenarios are gone from the run, fixed scenarios now show meaningful lift, the lift distribution is denser than before the curation pass.
If any fixed scenario still produces near-zero lift, return to Step 4 with that scenario alone — the diagnosis was wrong or the fix didn't take. Otherwise finish here when the distribution is stable and every kept scenario contributes signal.
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
rules
skills
eval-curation
install-reviewer