jbaruch/coding-policy

General-purpose coding policy for Baruch's AI agents

1.15x

Quality

93%

Does it follow best practices?

Impact

91%

1.15x

Average score across 12 eval scenarios

Securityby

Advisory

Suggest reviewing before use

Eval Scenario Review Checklist

Name: jbaruch/coding-policy
Rating: 91.2 (1 reviews)
Author: jbaruch

Bleeding

Does the task hand the agent the answer?

If the task says "use library X with algorithm Y" and the criteria check "uses library X" and "uses algorithm Y", that's bleeding — the eval tests reading comprehension, not problem-solving. The task should describe the problem; the criteria should check the solution.

Check: for each criterion, search the task text for the criterion's expected value. If found verbatim, it's bleeding.

Leaking

Does the task or criteria reference tile internals?

File paths, action names, internal terms that only exist in the skill
Criteria may test for skill-prescribed approaches when those use public tools/APIs

Check: for each criterion, ask "would someone outside this tile's team understand this term?" If not, it's leaking.

Quality

Every criterion description must explain what went wrong on failure — not just "mismatch"
Criteria must be specific and weighted sensibly
Weights should reflect importance to the task, not equal distribution

Consistency

Every criterion must test something the task's output specification asks for
If the task doesn't mention it, the criteria shouldn't check for it

jbaruch/coding-policy

REVIEW_CHECKLIST.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}skills/eval-authoring/

Eval Scenario Review Checklist

Bleeding

Leaking

Quality

Consistency

REVIEW_CHECKLIST.mdskills/eval-authoring/