name:: eval-authoring
description:: Generate, review, and curate eval scenarios for Tessl skills. Handles scenario generation, bleeding/leaking detection, criteria quality checks, lift-gated scenario admission, and score-driven iteration. Use when creating test cases or test scenarios for a skill, evaluating or assessing skill quality, running evals or evaluations, reviewing existing evals, expanding eval coverage, or skill testing.

Eval Authoring Skill

Name: jbaruch/coding-policy
Rating: 90.13 (1 reviews)
Author: jbaruch

Process steps in order. Do not skip ahead.

Generate, review, and iterate on eval scenarios for a Tessl skill. The 10-step workflow: generate (1) and download (2–3) scenarios, audit each (4) for bleeding/leaking, fix (5) or delete (6) unsalvageable ones, add lift-bearing scenarios (7), run evals (8), interpret results via lift analysis (9), iterate until stable (10).

Step 1 — Generate Scenarios

tessl scenario generate .

Step 2 — Wait for Generation

tessl scenario view <id>

Poll until completed. If it fails, report the error and finish here. When status is completed, proceed immediately to Step 3.

Step 3 — Download Scenarios

tessl scenario download --output evals <id>

Step 4 — Review Each Scenario

For every scenario in evals/, read task.md and criteria.json. Check against skills/eval-authoring/REVIEW_CHECKLIST.md: does the task describe a situation without prescribing the technique? Do the criteria grade the specific manner the plugin prescribes (good) rather than restating literals from the task (bleeding)? Any plugin-internal leaks in the criteria? Are criteria values public surfaces, plugin-prescribed conventions (allowed — they measure plugin value), or plugin internals (leaking)? Any quality or consistency issues?

If no issues found in a scenario, proceed silently to the next one. Proceed immediately to Step 5.

Step 5 — Fix Issues

Edit criteria.json and task.md to remove bleeding, remove leaking, improve failure messages, and align criteria with task. See skills/eval-authoring/REVIEW_CHECKLIST.md for definitions.

When a criterion is misaligned, leaking, or unsalvageable, remove it and reweight the remaining criteria so the checklist sums to 100 — do not keep a bad criterion to preserve existing weights. If removing the bad criterion leaves no plugin-specific signal, the scenario itself is unsalvageable — delete it per Step 6.

Step 6 — Delete Unsalvageable Scenarios

Remove scenario directories that can't be fixed: task tests an internal detail, task is too vague, or fixing bleeding would rewrite the entire task.

Step 7 — Add Lift-Bearing Scenarios

Read rules/plugin-evals.md Coverage for the admission criteria before adding anything. Procedurally: the Step 1 batch is a starting point, not a coverage target — do NOT enumerate a skill's behaviors and write a scenario for each. Add a scenario only where the plugin prescribes a decision a baseline agent would plausibly handle differently; Steps 8–9 confirm its lift, and a flat one is dropped (Step 6 / Step 10), not "improved."

Write new scenarios directly rather than re-generating — you have full plugin context, the cloud generator doesn't. Each scenario is a directory in evals/<name>/ with two files: task.md and criteria.json.

criteria.json MUST use the weighted-checklist wrapper — the scorer rejects bare arrays:

{
  "context": "<one-paragraph rationale: what this scenario tests and why it measures plugin value (not baseline reasoning)>",
  "type": "weighted_checklist",
  "checklist": [
    { "name": "<short-criterion-name>", "max_score": <int>, "description": "<what passes and what failure looks like>" }
  ]
}

Weights in checklist[].max_score MUST sum to exactly 100. Do not distribute evenly — weight the criteria that most specifically grade plugin-prescribed behaviour. Look at a sibling evals/*/criteria.json in this plugin to anchor on the exact shape; ignore any pre-existing plain-array files in the test repo under evaluation — those are seed data, not the format to emit.

After writing each new scenario, run the Step 4 review against it and apply Step 5 fixes before moving on — new scenarios need the same no-bleeding / no-leaking audit as generated ones, and the failure mode on this step is to skip review on content you authored yourself.

Step 8 — Run Evals

tessl eval run .

If any scenario fails to run, diagnose and fix before proceeding.

Step 9 — Analyze Results (Lift, Not Attainment)

For each scenario, compute lift = with_context_score - baseline_score. Lift is the number that matters — aggregate attainment alone is a vanity metric.

Classify each scenario by its lift band, run the three-cause diagnosis on weak / no-lift positive cases, decide whether negative-case lift is healthy or null, and triage failing criteria into a skill / task / criteria fix per the reference at:

skills/eval-authoring/LIFT_ANALYSIS.md

Bring the results back here, then proceed immediately to Step 10.

Step 10 — Iterate

Fix the identified issues — including retiring null-test scenarios — then re-run from Step 8. Repeat until every positive-case scenario shows meaningful lift and every criterion grades behaviour the plugin actually contributes. Finish here when the lift distribution is stable and no null tests remain.

.tessl-plugin

README.md

tile.json

jbaruch/coding-policy

SKILL.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}skills/eval-authoring/