General-purpose coding policy for Baruch's AI agents
96
90%
Does it follow best practices?
Impact
97%
1.24xAverage score across 14 eval scenarios
Passed
No known issues
You're running a curation pass over an eval suite for a tile. The most recent tessl eval run produced per-scenario lift numbers, and one positive-case scenario is sitting at near-zero lift after multiple runs. The tile's owner needs a diagnosis report and a recommended action before the next publish.
The scenario under review is below. The lift signal across three runs: with-context 92, baseline 90.8 (lift +1.2).
Write a file named diagnosis.md in the working directory containing:
retire, fix-task, rewrite-criteria.Do not edit the scenario itself in this task — your output is the diagnosis report only.
=============== FILE: evals/scenario-imperative-mood/task.md ===============
You've just added a function that validates email addresses against RFC 5322. Author a single git commit that captures this change. Produce the commit message body only (no git commit invocation).
=============== FILE: evals/scenario-imperative-mood/criteria.json =============== { "context": "Tests whether the agent produces a commit message in imperative mood, per the commit-conventions tile.", "type": "weighted_checklist", "checklist": [ { "name": "imperative mood", "max_score": 60, "description": "The first line of the commit message uses imperative mood (e.g., 'Add', 'Validate', 'Implement') rather than past tense ('Added', 'Validated') or present participle ('Adding', 'Validating')" }, { "name": "subject line under 72 chars", "max_score": 40, "description": "The subject line is 72 characters or fewer" } ] }
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
rules
skills
eval-curation
install-reviewer