General-purpose coding policy for Baruch's AI agents
95
91%
Does it follow best practices?
Impact
96%
1.31xAverage score across 10 eval scenarios
Advisory
Suggest reviewing before use
Generate, review, and iterate on eval scenarios for a Tessl skill. The 10-step workflow: generate (1) and download (2–3) scenarios, audit each (4) for bleeding/leaking, fix (5) or delete (6) unsalvageable ones, fill coverage gaps (7), run evals (8), interpret results via lift analysis (9), iterate until stable (10). Steps are sequential — complete each before moving to the next.
tessl scenario generate .tessl scenario view <id>Poll until completed. If it fails, report the error and finish here. When status is completed, proceed immediately to Step 3.
tessl scenario download --output evals <id>For every scenario in evals/, read task.md and criteria.json. Check against skills/eval-authoring/REVIEW_CHECKLIST.md: does the task describe a situation without prescribing the technique? Do the criteria grade the specific manner the tile prescribes (good) rather than restating literals from the task (bleeding)? Any tile-internal leaks in the criteria? Are criteria values public surfaces, tile-prescribed conventions (allowed — they measure tile value), or tile internals (leaking)? Any quality or consistency issues?
If no issues found in a scenario, proceed silently to the next one. Proceed immediately to Step 5.
Edit criteria.json and task.md to remove bleeding, remove leaking, improve failure messages, and align criteria with task. See skills/eval-authoring/REVIEW_CHECKLIST.md for definitions.
When a criterion is misaligned, leaking, or otherwise unsalvageable, remove it and reweight the remaining criteria so the checklist still sums to 100 — do NOT keep a bad criterion just to preserve the existing weights. Bumping a misaligned criterion's max_score to keep the math tidy makes the scenario worse, not better. If a fix removes the only criterion that gave the scenario tile-specific signal, the scenario itself is unsalvageable — delete it per Step 6.
Remove scenario directories that can't be fixed: task tests an internal detail, task is too vague, or fixing bleeding would rewrite the entire task.
Write new scenarios directly rather than re-generating — you have full plugin context, the cloud generator doesn't. Each scenario is a directory in evals/<name>/ with two files: task.md and criteria.json.
criteria.json MUST use the weighted-checklist wrapper — the scorer rejects bare arrays:
{
"context": "<one-paragraph rationale: what this scenario tests and why it measures tile value (not baseline reasoning)>",
"type": "weighted_checklist",
"checklist": [
{ "name": "<short-criterion-name>", "max_score": <int>, "description": "<what passes and what failure looks like>" }
]
}Weights in checklist[].max_score MUST sum to exactly 100. Do not distribute evenly — weight the criteria that most specifically grade tile-prescribed behaviour. Look at a sibling evals/*/criteria.json in this tile to anchor on the exact shape; ignore any pre-existing plain-array files in the test repo under evaluation — those are seed data, not the format to emit.
After writing each new scenario, run the Step 4 review against it and apply Step 5 fixes before moving on — new scenarios need the same no-bleeding / no-leaking audit as generated ones, and the failure mode on this step is to skip review on content you authored yourself.
tessl eval run .If any scenario fails to run, diagnose and fix before proceeding.
For each scenario, compute lift = with_context_score - baseline_score. Lift is the number that matters — aggregate attainment alone is a vanity metric.
Classify each scenario by its lift band, run the three-cause diagnosis on weak / no-lift positive cases, decide whether negative-case lift is healthy or null, and triage failing criteria into a skill / task / criteria fix per the reference at:
skills/eval-authoring/LIFT_ANALYSIS.mdBring the results back here, then proceed immediately to Step 10.
Fix the identified issues — including retiring null-test scenarios — then re-run from Step 8. Repeat until every positive-case scenario shows meaningful lift and every criterion grades behaviour the tile actually contributes. Finish here when the lift distribution is stable and no null tests remain.