General-purpose coding policy for Baruch's AI agents
96
90%
Does it follow best practices?
Impact
97%
1.24xAverage score across 14 eval scenarios
Passed
No known issues
You're running a curation pass over an eval suite for a tile. The most recent tessl eval run produced per-scenario lift numbers, and one positive-case scenario is sitting at near-zero lift after multiple runs. The tile's owner needs a diagnosis report and a recommended action before the next publish.
The scenario under review is below. The lift signal across three runs: with-context 94, baseline 91.6 (lift +2.4).
Write a file named diagnosis.md in the working directory containing:
retire, fix-task, rewrite-criteria. If you choose fix-task, include the rewritten task body inline.Do not edit the scenario files directly; produce the diagnosis report (with the proposed task rewrite inline if applicable).
=============== FILE: evals/scenario-merge-flag/task.md ===============
You've been asked to merge PR #42 into main. The team's convention is to use --ff-only for clean merges so the branch history stays linear. Produce the exact gh pr merge invocation you would run.
=============== FILE: evals/scenario-merge-flag/criteria.json ===============
{
"context": "Tests whether the agent uses the team's prescribed merge flag, per the merge-workflow tile.",
"type": "weighted_checklist",
"checklist": [
{
"name": "uses ff-only",
"max_score": 70,
"description": "The gh pr merge invocation includes --ff-only (the team's prescribed merge mode)"
},
{
"name": "references PR 42",
"max_score": 30,
"description": "The invocation targets PR 42 (gh pr merge 42 --ff-only or equivalent)"
}
]
}
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
rules
skills
eval-curation
install-reviewer