CtrlK
BlogDocsLog inGet started
Tessl Logo

jbaruch/coding-policy

General-purpose coding policy for Baruch's AI agents

96

1.24x
Quality

90%

Does it follow best practices?

Impact

97%

1.24x

Average score across 14 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

plugin-evals.mdrules/

alwaysApply:
Yes

Plugin Evals

Coverage

  • Every skill with decisional logic ships eval cases, subject only to the closed-loop carve-out below
  • Include both positive cases (correct behavior) and negative cases (refuse bad input, produce silence when nothing actionable)
  • tessl scenario generate skews toward happy-path scenarios — write negative cases by hand using existing scenarios as a structural template
  • Narrow exception for closed-loop automated systems with no human eval-result consumption: a tile is exempt from BOTH the "every skill with decisional logic ships eval cases" coverage clause above AND the entire Persistence section ("Tessl's publish pipeline runs the eval suite automatically when tessl tile publish (or tesslio/patch-version-publish) executes" + "Regressions block the publish"), only when ALL three of these preconditions hold: (1) no human review — no human ever reads eval output for this tile, in any form: attainment scores, lift deltas, scenario-by-scenario diffs, regression alerts, failure traces, dashboards, periodic reports; (2) no gating use — eval results do NOT gate any downstream automated action, including but not limited to: release blocks, deploy blocks, publish-tile gates, rollback triggers, alert routing, dashboard surfaces, paging, summary stats consumed by another workflow. The no-eyeballs assumption is meaningless if a gate consumes the signal — a publish-blocking eval gate is still producing signal, just not via human eyes; (3) affirmative owner declaration — the tile's CHANGELOG records the exception in writing under a ### Rules (or equivalent) entry naming this rule + date, AND the owner accepts that re-introducing any consumption of eval results later (whether human review OR automated gating) requires re-introducing evals first under the standard requirement. The reasoning is structural: evals are an instrument, not a deliverable. They produce measurements that only become signal when something — a human, a gate, a downstream system — reads them and acts. A tile satisfying all three preconditions is generating measurements that never become signal anywhere — every eval run is pure cost (Tessl tessl eval run budget, scenario-authoring effort, fixture maintenance) producing zero decisions, and the suite has no theory of how it would catch a regression (real regression manifests → eval flags it → output goes nowhere → regression ships anyway). Reference example: the jbaruch/nanoclaw-* plugin fleet (nanoclaw-admin, nanoclaw-core, nanoclaw-trusted, nanoclaw-untrusted, nanoclaw-host, nanoclaw-telegram) — fully-automated agent loop satisfying all three preconditions; the prior evals.yml workflow ran with continue-on-error: true (no gating use), no human reviewed the daily-cadence runs (no human review), and the owner declaration was recorded in nanoclaw-admin CHANGELOG + a follow-up coding-policy PR (this carve-out itself, post-merge). Multi-month observation period confirmed the predicted failure mode: 40-scenario suite was not catching real regressions, several scenarios had been retired for ~zero lift, recurring runs were silent on the silent-success regressions they nominally watched for. The exception is scoped narrowly and affirmatively: "we don't currently look at the results, but we plan to" does NOT qualify (intent without follow-through is bypass-cope dressed as future-work, the exact framing boy-scout.md and context-artifacts.md's "Disagreeing With the Reviewer" were authored to close); "we have a publish-tile gate that fails on eval regressions but nobody actually checks the failures" does NOT qualify (precondition 2 is violated by the gate itself, regardless of whether a human reads the failure). Tiles that fail any of the three preconditions — including coding-policy itself, where the maintainer reads scenario lift on every publish (precondition 1 fails) — are NOT exempt; the rule applies in full.

Task and Criteria: the load-bearing shape

  • Task describes the SITUATION — what the user needs done. It does NOT prescribe the technique, format, sequence, or specific manner of solving it. "Ship a hotfix" is a task; "Ship a hotfix using a feature branch named fix/*" is a task with the answer smuggled in
  • Criteria grade whether the output matches the specific manner this tile prescribes. That conformance IS the tile's contribution — without the tile, agents pick some manner; with the tile, they pick the manner the tile teaches. Checking for tile-prescribed specifics (flag choices, format literals, sequences, conventions) is measuring tile value, not testing reading

No Bleeding

  • The primary form of bleeding is a criterion value appearing verbatim in the task description. Grep each criterion's expected literal against the task text — if you find it there, the criterion is testing reading of the task, not application of the tile
  • Fix bleeding at the task, not at the criterion. Strip the technique/format/literal from the task; keep the criterion checking for the tile-prescribed answer. Baseline agents should be able to attempt the SITUATION described in the task (they'll just pick some other manner); if stripping the leak makes the task unsolvable even for a baseline, the scenario is too narrow to evaluate the tile and should be reframed
  • A second form of bleeding: fixtures reachable as examples inside the skill prompt. If the skill teaches by showing an example, and the eval scenario uses that same example as a fixture, the agent "passes" by recognizing the example rather than applying the lesson. Keep fixtures in a separate namespace from skill examples

No Leaking

  • Use sanitized or synthetic fixtures — never live user data. Real emails, calendar events, production PRs, or internal logs must never appear in an eval fixture; use stable synthetic IDs and scrubbed examples
  • Criteria must not reference tile-internal implementation details that mean nothing outside the tile — internal skill action names, .tessl/tiles/... paths, tile-only identifiers
  • Criteria may reference public tool/API surfaces that exist independent of the tile — gh pr create, REST endpoints, conventional-commits format, semver
  • Criteria may reference tile-prescribed conventions and specific values — reply templates (Fixed in <sha>), chosen flags (--ff-only), specific sequences, invented format literals. A competent engineer without the tile would not produce those specific choices; that is precisely why they measure tile value. Checking for them is measuring application, not leaking
  • The distinction between a public surface and a tile-internal is whether someone outside the tile would recognize the term at all. "Uses gh pr merge" is public. "Uses createJwtToken internal action" is tile-internal

Lift, Not Attainment

  • Every scenario's value is measured as lift — the delta between the with-context score (tile loaded) and the baseline score (tile not loaded). A scenario with near-zero lift on a positive case is telling you one of three things:
    1. Coincidence with universal competence: the tile's prescribed manner matches what baseline agents already produce by default (e.g. a rule saying "use imperative mood in commits" when agents already do that). The rule codifies common practice; lift won't show because output is the same. Retire — the rule prose itself documents the prescription, so a perpetually-passing eval scenario adds no documentation value beyond the rule, only pays Tessl run-cost
    2. Task leaked the technique: baseline pattern-matched its way to the criterion because the task mentioned the technique. Fix the task per No Bleeding above — do NOT drop the criterion
    3. Criteria grade universal competence: the criteria test things baseline always does (basic git safety, obvious engineering judgement) rather than tile-specific choices. Rewrite the criteria to test the specific manner the tile prescribes, or retire the scenario
  • Aggregate attainment on its own is a vanity metric. A tile averaging 95% attainment with 82% baseline is contributing 13 points of real value, not 95. Always report per-scenario lift alongside the average
  • High-lift scenarios typically test specific tile-prescribed choices where baseline would pick something different (a specific bot-ID discovery approach, a specific reply format, a specific CLI sequence). These are legitimate and should be kept — do not rewrite them toward "testing reasoning" if baseline already reasons to the same outcome
  • Pruning is mandatory upkeep, not optional cleanup. A suite where a non-trivial fraction of scenarios consistently produce near-zero lift across runs is bloated, and every bloated run pays Tessl credits, runner time, and reviewer cycles for scenarios that produce no signal. Run the curation pass (see skills/eval-curation/SKILL.md) at least every few publishes; for any scenario that still shows near-zero lift after the three-cause diagnosis has been applied and the fix attempted, retire it. The bar is per-scenario lift contribution, not raw scenario count — a 10-scenario suite where every scenario pulls weight is healthier than a 35-scenario suite where half score baseline-equivalent. Concrete failure mode this clause is authored against: an eval suite grown by uncritical tessl scenario generate runs accumulating happy-path scenarios that match baseline competence, never pruned, paying Tessl run-cost on every publish for zero added signal

Quality

  • Failure messages must explain what went wrong, not just "mismatch"
  • Criteria must be specific and weighted sensibly — vague criteria produce vague results
  • Criteria must align with what the task actually asks for

Persistence

  • Tessl's publish pipeline runs the eval suite automatically when tessl tile publish (or tesslio/patch-version-publish) executes — that is the persistence point. Do not add a tessl eval run step to tile-repo CI and do not add a scheduled/recurring workflow that re-runs the suite as a persistence mechanism; the Tessl-publish layer owns persistence execution and any cadence on top is duplicate cost producing the same numbers a maintainer would already see at publish time. Out of scope of this clause: local invocations by authors during scenario authoring or debugging (tessl eval run . per skills/eval-authoring/SKILL.md's authoring loop) and ad-hoc invocations by separate measurement rigs (the 4-way reviewer matrix in jbaruch/coding-policy-evals is the canonical example) — those aren't "persistence" in the sense this rule governs and remain permitted
  • Regressions block the publish — a passing eval that starts failing is a bug, not noise. The gate lives at the Tessl-publish layer; if Tessl-publish fails on eval regression, fix the regression (or the scenario, if it's fixture drift per Fixture Hygiene). Don't add a parallel CI step that could mask the publish-layer failure

Naming

  • Scenario directory names use kebab-case — lowercase + hyphens only (my-scenario, not MyScenario, my_scenario, or my scenario)
  • When a scenario exercises a specific skill, prefix with that skill's name: <skill>-<descriptor> (e.g., install-reviewer-refuses-overwrite, eval-curation-task-leak-fix). When the scenario tests a cross-cutting workflow or a rule that isn't owned by a single skill, name the behavior directly without a skill prefix (e.g., pr-merge-and-post-merge-cleanup)
  • Descriptors name the behavior under test, not the implementation: refuses-overwrite ✓, checks-existing-file-via-stat ✗. The implementation is allowed to change without the scenario being renamed; the behavior is what the eval grades
  • Hard cap at 40 characters total for the directory name. Some tooling surface in the tessl-eval pipeline silently truncates longer names — concrete observed failure: a scenario authored as version-bump-reasoning-and-manifest-update (43 chars) was committed at version-bump-reasoning-and-manifest-upda (40 chars), losing the trailing "te". The 40-char cap is set AT the observed safe threshold, not above it — anything in the 41+ range is in the truncation zone the cap exists to prevent. The cap applies prospectively to newly authored or not-yet-run scenarios; scenarios that have already been run through tessl eval run and exceed the cap are grandfathered by the next bullet's rename-stability clause (which wins over the cap)
  • Once a scenario has been committed and run through tessl eval run, its name is stable — do not rename. tessl eval view and the per-scenario lift history identify scenarios by directory name, so a rename makes the renamed scenario appear as a new scenario with no historical lift data, resetting the lift trend across publishes. That trend IS the regression-detector the Persistence section relies on; renaming silently destroys it
  • Rig conventions (programmatic generation from a fixed shape like <rule>-<fixture_type>-<cell>-run-<n> in jbaruch/coding-policy-evals) may diverge from the above when the divergence is documented in the tile's local evals/instructions.json or equivalent — the shape itself becomes the convention for that tile. The 40-char cap and the rename-stability clause still apply

Fixture Hygiene

  • Version fixtures with dates in filenames (e.g., fixture-2025-04-17.json)
  • Update fixtures when the skill's contract changes — stale fixtures produce false passes

README.md

tile.json