General-purpose coding policy for Baruch's AI agents
95
91%
Does it follow best practices?
Impact
96%
1.31xAverage score across 10 eval scenarios
Advisory
Suggest reviewing before use
tessl scenario generate skews toward happy-path scenarios — write negative cases by hand using existing scenarios as a structural template### Rules (or equivalent) entry naming this rule + date, AND the owner accepts that re-introducing any consumption of eval results later (whether human review OR automated gating) requires re-introducing evals first under the standard requirement. The reasoning is structural: evals are an instrument, not a deliverable. They produce measurements that only become signal when something — a human, a gate, a downstream system — reads them and acts. A tile satisfying all three preconditions is generating measurements that never become signal anywhere — every eval run is pure cost (Tessl tessl eval run budget, scenario-authoring effort, fixture maintenance) producing zero decisions, and the suite has no theory of how it would catch a regression (real regression manifests → eval flags it → output goes nowhere → regression ships anyway). Reference example: the jbaruch/nanoclaw-* plugin fleet (nanoclaw-admin, nanoclaw-core, nanoclaw-trusted, nanoclaw-untrusted, nanoclaw-host, nanoclaw-telegram) — fully-automated agent loop satisfying all three preconditions; the prior evals.yml workflow ran with continue-on-error: true (no gating use), no human reviewed the daily-cadence runs (no human review), and the owner declaration was recorded in nanoclaw-admin CHANGELOG + a follow-up coding-policy PR (this carve-out itself, post-merge). Multi-month observation period confirmed the predicted failure mode: 40-scenario suite was not catching real regressions, several scenarios had been retired for ~zero lift, recurring runs were silent on the silent-success regressions they nominally watched for. The exception is scoped narrowly and affirmatively: "we don't currently look at the results, but we plan to" does NOT qualify (intent without follow-through is bypass-cope dressed as future-work, the exact framing boy-scout.md and context-artifacts.md's "Disagreeing With the Reviewer" were authored to close); "we have a publish-tile gate that fails on eval regressions but nobody actually checks the failures" does NOT qualify (precondition 2 is violated by the gate itself, regardless of whether a human reads the failure). Tiles that fail any of the three preconditions — including coding-policy itself, where the maintainer reads scenario lift on every publish (precondition 1 fails) — are NOT exempt; the rule applies in full.fix/*" is a task with the answer smuggled in.tessl/tiles/... paths, tile-only identifiersgh pr create, REST endpoints, conventional-commits format, semverFixed in <sha>), chosen flags (--ff-only), specific sequences, invented format literals. A competent engineer without the tile would not produce those specific choices; that is precisely why they measure tile value. Checking for them is measuring application, not leakinggh pr merge" is public. "Uses createJwtToken internal action" is tile-internalwith-context score (tile loaded) and the baseline score (tile not loaded). A scenario with near-zero lift on a positive case is telling you one of three things:
fixture-2025-04-17.json)evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
rules
skills
install-reviewer