General-purpose coding policy for Baruch's AI agents
96
90%
Does it follow best practices?
Impact
97%
1.24xAverage score across 14 eval scenarios
Passed
No known issues
tessl scenario generate skews toward happy-path scenarios — write negative cases by hand using existing scenarios as a structural templatetessl tile publish (or tesslio/patch-version-publish) executes" + "Regressions block the publish"), only when ALL three of these preconditions hold: (1) no human review — no human ever reads eval output for this tile, in any form: attainment scores, lift deltas, scenario-by-scenario diffs, regression alerts, failure traces, dashboards, periodic reports; (2) no gating use — eval results do NOT gate any downstream automated action, including but not limited to: release blocks, deploy blocks, publish-tile gates, rollback triggers, alert routing, dashboard surfaces, paging, summary stats consumed by another workflow. The no-eyeballs assumption is meaningless if a gate consumes the signal — a publish-blocking eval gate is still producing signal, just not via human eyes; (3) affirmative owner declaration — the tile's CHANGELOG records the exception in writing under a ### Rules (or equivalent) entry naming this rule + date, AND the owner accepts that re-introducing any consumption of eval results later (whether human review OR automated gating) requires re-introducing evals first under the standard requirement. The reasoning is structural: evals are an instrument, not a deliverable. They produce measurements that only become signal when something — a human, a gate, a downstream system — reads them and acts. A tile satisfying all three preconditions is generating measurements that never become signal anywhere — every eval run is pure cost (Tessl tessl eval run budget, scenario-authoring effort, fixture maintenance) producing zero decisions, and the suite has no theory of how it would catch a regression (real regression manifests → eval flags it → output goes nowhere → regression ships anyway). Reference example: the jbaruch/nanoclaw-* plugin fleet (nanoclaw-admin, nanoclaw-core, nanoclaw-trusted, nanoclaw-untrusted, nanoclaw-host, nanoclaw-telegram) — fully-automated agent loop satisfying all three preconditions; the prior evals.yml workflow ran with continue-on-error: true (no gating use), no human reviewed the daily-cadence runs (no human review), and the owner declaration was recorded in nanoclaw-admin CHANGELOG + a follow-up coding-policy PR (this carve-out itself, post-merge). Multi-month observation period confirmed the predicted failure mode: 40-scenario suite was not catching real regressions, several scenarios had been retired for ~zero lift, recurring runs were silent on the silent-success regressions they nominally watched for. The exception is scoped narrowly and affirmatively: "we don't currently look at the results, but we plan to" does NOT qualify (intent without follow-through is bypass-cope dressed as future-work, the exact framing boy-scout.md and context-artifacts.md's "Disagreeing With the Reviewer" were authored to close); "we have a publish-tile gate that fails on eval regressions but nobody actually checks the failures" does NOT qualify (precondition 2 is violated by the gate itself, regardless of whether a human reads the failure). Tiles that fail any of the three preconditions — including coding-policy itself, where the maintainer reads scenario lift on every publish (precondition 1 fails) — are NOT exempt; the rule applies in full.fix/*" is a task with the answer smuggled in.tessl/tiles/... paths, tile-only identifiersgh pr create, REST endpoints, conventional-commits format, semverFixed in <sha>), chosen flags (--ff-only), specific sequences, invented format literals. A competent engineer without the tile would not produce those specific choices; that is precisely why they measure tile value. Checking for them is measuring application, not leakinggh pr merge" is public. "Uses createJwtToken internal action" is tile-internalwith-context score (tile loaded) and the baseline score (tile not loaded). A scenario with near-zero lift on a positive case is telling you one of three things:
skills/eval-curation/SKILL.md) at least every few publishes; for any scenario that still shows near-zero lift after the three-cause diagnosis has been applied and the fix attempted, retire it. The bar is per-scenario lift contribution, not raw scenario count — a 10-scenario suite where every scenario pulls weight is healthier than a 35-scenario suite where half score baseline-equivalent. Concrete failure mode this clause is authored against: an eval suite grown by uncritical tessl scenario generate runs accumulating happy-path scenarios that match baseline competence, never pruned, paying Tessl run-cost on every publish for zero added signaltessl tile publish (or tesslio/patch-version-publish) executes — that is the persistence point. Do not add a tessl eval run step to tile-repo CI and do not add a scheduled/recurring workflow that re-runs the suite as a persistence mechanism; the Tessl-publish layer owns persistence execution and any cadence on top is duplicate cost producing the same numbers a maintainer would already see at publish time. Out of scope of this clause: local invocations by authors during scenario authoring or debugging (tessl eval run . per skills/eval-authoring/SKILL.md's authoring loop) and ad-hoc invocations by separate measurement rigs (the 4-way reviewer matrix in jbaruch/coding-policy-evals is the canonical example) — those aren't "persistence" in the sense this rule governs and remain permittedFixture Hygiene). Don't add a parallel CI step that could mask the publish-layer failuremy-scenario, not MyScenario, my_scenario, or my scenario)<skill>-<descriptor> (e.g., install-reviewer-refuses-overwrite, eval-curation-task-leak-fix). When the scenario tests a cross-cutting workflow or a rule that isn't owned by a single skill, name the behavior directly without a skill prefix (e.g., pr-merge-and-post-merge-cleanup)refuses-overwrite ✓, checks-existing-file-via-stat ✗. The implementation is allowed to change without the scenario being renamed; the behavior is what the eval gradesversion-bump-reasoning-and-manifest-update (43 chars) was committed at version-bump-reasoning-and-manifest-upda (40 chars), losing the trailing "te". The 40-char cap is set AT the observed safe threshold, not above it — anything in the 41+ range is in the truncation zone the cap exists to prevent. The cap applies prospectively to newly authored or not-yet-run scenarios; scenarios that have already been run through tessl eval run and exceed the cap are grandfathered by the next bullet's rename-stability clause (which wins over the cap)tessl eval run, its name is stable — do not rename. tessl eval view and the per-scenario lift history identify scenarios by directory name, so a rename makes the renamed scenario appear as a new scenario with no historical lift data, resetting the lift trend across publishes. That trend IS the regression-detector the Persistence section relies on; renaming silently destroys it<rule>-<fixture_type>-<cell>-run-<n> in jbaruch/coding-policy-evals) may diverge from the above when the divergence is documented in the tile's local evals/instructions.json or equivalent — the shape itself becomes the convention for that tile. The 40-char cap and the rename-stability clause still applyfixture-2025-04-17.json)evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
rules
skills
eval-curation
install-reviewer