General-purpose coding policy for Baruch's AI agents
86
91%
Does it follow best practices?
Impact
86%
1.17xAverage score across 10 eval scenarios
Advisory
Suggest reviewing before use
tessl scenario generate skews toward happy-path scenarios — write negative cases by hand using existing scenarios as a structural templatetessl tile publish (or tesslio/patch-version-publish) executes" + "Regressions block the publish"), only when ALL three of these preconditions hold: (1) no human review — no human ever reads eval output for this tile, in any form: attainment scores, lift deltas, scenario-by-scenario diffs, regression alerts, failure traces, dashboards, periodic reports; (2) no gating use — eval results do NOT gate any downstream automated action, including but not limited to: release blocks, deploy blocks, publish-tile gates, rollback triggers, alert routing, dashboard surfaces, paging, summary stats consumed by another workflow. The no-eyeballs assumption is meaningless if a gate consumes the signal — a publish-blocking eval gate is still producing signal, just not via human eyes; (3) affirmative owner declaration — the tile's CHANGELOG records the exception in writing under a ### Rules (or equivalent) entry naming this rule + date, AND the owner accepts that re-introducing any consumption of eval results later (whether human review OR automated gating) requires re-introducing evals first under the standard requirement. The reasoning is structural: evals are an instrument, not a deliverable. They produce measurements that only become signal when something — a human, a gate, a downstream system — reads them and acts. A tile satisfying all three preconditions is generating measurements that never become signal anywhere — every eval run is pure cost (Tessl tessl eval run budget, scenario-authoring effort, fixture maintenance) producing zero decisions, and the suite has no theory of how it would catch a regression (real regression manifests → eval flags it → output goes nowhere → regression ships anyway). Reference example: the jbaruch/nanoclaw-* plugin fleet (nanoclaw-admin, nanoclaw-core, nanoclaw-trusted, nanoclaw-untrusted, nanoclaw-host, nanoclaw-telegram) — fully-automated agent loop satisfying all three preconditions; the prior evals.yml workflow ran with continue-on-error: true (no gating use), no human reviewed the daily-cadence runs (no human review), and the owner declaration was recorded in nanoclaw-admin CHANGELOG + a follow-up coding-policy PR (this carve-out itself, post-merge). Multi-month observation period confirmed the predicted failure mode: 40-scenario suite was not catching real regressions, several scenarios had been retired for ~zero lift, recurring runs were silent on the silent-success regressions they nominally watched for. The exception is scoped narrowly and affirmatively: "we don't currently look at the results, but we plan to" does NOT qualify (intent without follow-through is bypass-cope dressed as future-work, the exact framing boy-scout.md and context-artifacts.md's "Disagreeing With the Reviewer" were authored to close); "we have a publish-tile gate that fails on eval regressions but nobody actually checks the failures" does NOT qualify (precondition 2 is violated by the gate itself, regardless of whether a human reads the failure). Tiles that fail any of the three preconditions — including coding-policy itself, where the maintainer reads scenario lift on every publish (precondition 1 fails) — are NOT exempt; the rule applies in full.fix/*" is a task with the answer smuggled in.tessl/tiles/... paths, tile-only identifiersgh pr create, REST endpoints, conventional-commits format, semverFixed in <sha>), chosen flags (--ff-only), specific sequences, invented format literals. A competent engineer without the tile would not produce those specific choices; that is precisely why they measure tile value. Checking for them is measuring application, not leakinggh pr merge" is public. "Uses createJwtToken internal action" is tile-internalwith-context score (tile loaded) and the baseline score (tile not loaded). A scenario with near-zero lift on a positive case is telling you one of three things:
tessl tile publish (or tesslio/patch-version-publish) executes — that is the persistence point. Do not add a tessl eval run step to tile-repo CI and do not add a scheduled/recurring workflow that re-runs the suite as a persistence mechanism; the Tessl-publish layer owns persistence execution and any cadence on top is duplicate cost producing the same numbers a maintainer would already see at publish time. Out of scope of this clause: local invocations by authors during scenario authoring or debugging (tessl eval run . per skills/eval-authoring/SKILL.md's authoring loop) and ad-hoc invocations by separate measurement rigs (the 4-way reviewer matrix in jbaruch/coding-policy-evals is the canonical example) — those aren't "persistence" in the sense this rule governs and remain permittedFixture Hygiene). Don't add a parallel CI step that could mask the publish-layer failurefixture-2025-04-17.json)evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
rules
skills
install-reviewer