General-purpose coding policy for Baruch's AI agents
96
90%
Does it follow best practices?
Impact
97%
1.24xAverage score across 14 eval scenarios
Passed
No known issues
You're running a curation pass over an eval suite for a tile. The most recent tessl eval run produced per-scenario lift numbers, summarized below. The tile's owner wants a curation summary before the next publish.
Write a file named curation-summary.md in the working directory. The file's content depends on the suite's state:
Do not fabricate diagnoses for scenarios that don't need them.
The suite has 5 positive-case scenarios and 1 negative-case scenario. Lift values are means across 3 runs.
| Scenario | Case type | with-context | baseline | lift |
|---|---|---|---|---|
merge-with-canonical-flag | positive | 96 | 41 | +55 |
reply-with-fixed-in-template | positive | 92 | 35 | +57 |
discover-bot-id-via-graphql | positive | 88 | 38 | +50 |
compose-pr-body-with-author-model-line | positive | 90 | 47 | +43 |
chain-poll-then-merge-after-green | positive | 94 | 51 | +43 |
refuse-publish-with-uncommitted-changes | negative | 100 | 100 | 0 |
Note on the negative case: refuse-publish-with-uncommitted-changes tests that the agent refuses to publish when the working tree has uncommitted changes. The 0-lift means baseline and with-context agents both refuse at the same rate.
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
rules
skills
eval-curation
install-reviewer