alwaysApply:: No
applyTo:: evals/**, skills/**/SKILL.md — when authoring or maintaining eval scenarios
description:: Eval coverage, lift-not-attainment scoring, no bleeding, no leaking, fixture hygiene, closed-loop carve-out

Plugin Evals

Name: jbaruch/coding-policy
Rating: 90.14 (1 reviews)
Author: jbaruch

Coverage

Every skill with decisional logic ships eval cases, subject only to the closed-loop carve-out below
Include positive cases that show correct behavior, and negative cases that refuse bad input or produce silence when nothing actionable
Write negative cases by hand using existing scenarios as a structural template (tessl scenario generate skews toward happy-path)

Closed-Loop Carve-Out

Narrow exception for closed-loop automated systems with no human eval-result consumption
Tile exempt from the Coverage clause above and the entire Persistence section
Preconditions (all required):
1. No human review — no human reads eval output in any form: scores, lift deltas, scenario diffs, regression alerts, failure traces, dashboards, periodic reports
2. No gating use — eval results do not gate any downstream automated action: release blocks, deploy blocks, publish-tile gates, rollback triggers, alert routing, dashboard surfaces, paging, summary stats consumed by another workflow
3. Affirmative owner declaration — tile's CHANGELOG records the exception under a ### Rules entry naming this rule + date
"We plan to look at results" does NOT qualify
"We have a gate that fails on regressions but nobody checks the failures" does NOT qualify (precondition 2)
Re-introducing any consumption later (human review or automated gating) requires re-introducing evals first under the standard requirement
Every other tile follows the rule in full

Task and Criteria: the load-bearing shape

Task describes the SITUATION — what the user needs done. It does NOT prescribe the technique, format, sequence, or specific manner of solving it. "Ship a hotfix" is a task; "Ship a hotfix using a feature branch named fix/*" is a task with the answer smuggled in
Criteria grade whether the output matches the specific manner this tile prescribes. Checking for tile-prescribed specifics (flag choices, format literals, sequences, conventions) is measuring tile value, not testing reading

No Bleeding

Primary form: a criterion value appears verbatim in the task description. Grep each criterion's expected literal against the task text — if found there, the criterion tests reading, not application
Fix bleeding at the task, not the criterion. Strip the leaked technique/format/literal from the task; keep the criterion. If stripping makes the task unsolvable for a baseline, the scenario is too narrow — reframe it
Second form: fixtures matching examples inside the skill prompt. Keep fixtures in a namespace separate from skill examples

No Leaking

Use sanitized or synthetic fixtures — never live user data. Real emails, calendar events, production PRs, or internal logs must never appear in an eval fixture; use stable synthetic IDs and scrubbed examples
Criteria must not reference tile-internal implementation details that mean nothing outside the tile — internal skill action names, .tessl/plugins/... paths, tile-only identifiers
Criteria may reference public tool/API surfaces that exist independent of the tile — gh pr create, REST endpoints, conventional-commits format, semver
Criteria may reference tile-prescribed conventions and specific values such as reply templates like Fixed in <sha>, chosen flags like --ff-only, invented format literals. Checking for them measures application, not leaking
Test: would someone outside the tile recognize the term? gh pr merge is public; createJwtToken internal action is tile-internal

Lift, Not Attainment

Lift = with-context score minus baseline score, where with-context is with the tile loaded and baseline is without. Near-zero lift on a positive case has three causes:
1. Coincidence with universal competence — tile's prescribed manner matches what baseline agents already produce. Retire the scenario
2. Task leaked the technique — fix the task per No Bleeding; keep the criterion
3. Criteria grade universal competence — criteria test things baseline always does (basic git safety, obvious judgement) instead of tile-specific choices. Rewrite the criteria or retire the scenario
Always report per-scenario lift alongside the average
High-lift scenarios test tile-prescribed choices where baseline picks something different (bot-ID discovery, reply format, CLI sequence). Keep them — do not rewrite toward "testing reasoning" if baseline already reasons to the same outcome
Pruning is mandatory upkeep, not optional cleanup
Run the curation pass (see skills/eval-curation/SKILL.md) every few publishes
Retire any scenario showing near-zero lift after the three-cause diagnosis and fix attempt
Measure by per-scenario lift contribution, not raw scenario count — a 10-scenario suite where every scenario pulls weight beats a 35-scenario suite where half score baseline-equivalent

Quality

Failure messages must explain what went wrong, not just "mismatch"
Criteria must be specific and weighted sensibly — vague criteria produce vague results
Criteria must align with what the task actually asks for

Persistence

Tessl's publish pipeline runs the eval suite automatically when tessl tile publish (or tesslio/patch-version-publish) executes — that is the persistence point
Do not add a tessl eval run step to tile-repo CI; do not add a scheduled or recurring workflow that re-runs the eval suite
Out of scope: local invocations during authoring/debugging, and ad-hoc invocations by separate measurement rigs (e.g., jbaruch/coding-policy-evals)
Regressions block the publish
Fix the regression, or fix the scenario when the cause is fixture drift per Fixture Hygiene
Do not add a parallel CI step that could mask the publish-layer failure

Naming

Scenario directory names use kebab-case — lowercase + hyphens only (my-scenario, not MyScenario, my_scenario, or my scenario)
Skill-specific scenarios: prefix with the skill name — <skill>-<descriptor> (e.g., install-reviewer-refuses-overwrite, eval-curation-task-leak-fix)
Cross-cutting scenarios: name the behavior directly without a skill prefix (e.g., pr-merge-and-post-merge-cleanup)
Descriptors name the behavior under test, not the implementation: refuses-overwrite ✓, checks-existing-file-via-stat ✗
Default cap at 40 characters for the directory name — driven by tessl's interactive tooling (tessl eval view, tessl scenario generate) silently truncating longer names
Cap applies prospectively; scenarios already run through tessl eval run that exceed it are grandfathered by the rename-stability clause (which wins over the cap)
Once committed and run through tessl eval run, the name is stable — do not rename. tessl eval view identifies scenarios by directory name; renaming resets the lift history the Persistence section relies on
Rig conventions (programmatic shapes like <rule>-<fixture_type>-<cell>-run-<n>) may diverge from kebab-case-with-descriptor when documented in the tile's evals/instructions.json
Narrow exception for rig-shaped scenarios that exceed the 40-char default cap
Preconditions (all required):
1. Rig's programmatic shape provably cannot fit 40 chars without losing reconstructability of its components (e.g., 4-tuple rule/fixture_type/cell/run needed by the rig's scorer for lift bucketing)
2. evals/instructions.json declares the rig's actual safe length AND names every tessl-eval tool the rig touches end-to-end
3. Rig bypasses the interactive tools that drive the default cap — scenario names built by custom script, runs invoked via tessl eval run <path>, scoring via custom scorer
4. Rig owns naming-collision and truncation-driven scoring drift in its own scorer, not in tessl's tooling
5. Rig's CHANGELOG records the carve-out under a ### Rules (or equivalent) entry citing this clause
Rename-stability applies regardless of cap — tessl eval view identifies scenarios by directory name irrespective of how the name was generated
Every scenario outside an active rig carve-out still respects the 40-char default

Fixture Hygiene

Version fixtures with dates in filenames (e.g., fixture-2025-04-17.json)
Update fixtures when the skill's contract changes — stale fixtures produce false passes

.tessl-plugin

README.md

tile.json

jbaruch/coding-policy

plugin-evals.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}rules/

Plugin Evals

Coverage

Closed-Loop Carve-Out

Task and Criteria: the load-bearing shape

No Bleeding

No Leaking

Lift, Not Attainment

Quality

Persistence

Naming

Fixture Hygiene

plugin-evals.mdrules/