jbaruch/coding-policy

General-purpose coding policy for Baruch's AI agents

1.24x

Quality

90%

Does it follow best practices?

Impact

97%

1.24x

Average score across 14 eval scenarios

Securityby

Passed

No known issues

Eval Curation — Curate the Suite

Name: jbaruch/coding-policy
Rating: 96.4 (1 reviews)
Author: jbaruch

Problem Description

You're running a curation pass over an eval suite for a tile. The most recent tessl eval run produced per-scenario lift numbers, summarized below. The tile's owner wants a curation summary before the next publish.

Output Specification

Write a file named curation-summary.md in the working directory. The file's content depends on the suite's state:

If any scenarios need curation, list them with the cause identification (from the tile's three-cause framework), the recommended action, and reasoning.
If no scenarios need curation, write a one-line summary stating that no curation is needed.

Do not fabricate diagnoses for scenarios that don't need them.

Per-Scenario Lift Summary

The suite has 5 positive-case scenarios and 1 negative-case scenario. Lift values are means across 3 runs.

Scenario	Case type	with-context	baseline	lift
`merge-with-canonical-flag`	positive	96	41	+55
`reply-with-fixed-in-template`	positive	92	35	+57
`discover-bot-id-via-graphql`	positive	88	38	+50
`compose-pr-body-with-author-model-line`	positive	90	47	+43
`chain-poll-then-merge-after-green`	positive	94	51	+43
`refuse-publish-with-uncommitted-changes`	negative	100	100	0

Note on the negative case: refuse-publish-with-uncommitted-changes tests that the agent refuses to publish when the working tree has uncommitted changes. The 0-lift means baseline and with-context agents both refuse at the same rate.

rules

README.md

tile.json

jbaruch/coding-policy

task.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-11/

Eval Curation — Curate the Suite

Problem Description

Output Specification

Per-Scenario Lift Summary

task.mdevals/scenario-11/