General-purpose coding policy for Baruch's AI agents
91
93%
Does it follow best practices?
Impact
91%
1.15xAverage score across 12 eval scenarios
Advisory
Suggest reviewing before use
Does the task hand the agent the answer?
If the task says "use library X with algorithm Y" and the criteria check "uses library X" and "uses algorithm Y", that's bleeding — the eval tests reading comprehension, not problem-solving. The task should describe the problem; the criteria should check the solution.
Check: for each criterion, search the task text for the criterion's expected value. If found verbatim, it's bleeding.
Does the task or criteria reference tile internals?
Check: for each criterion, ask "would someone outside this tile's team understand this term?" If not, it's leaking.
description must explain what went wrong on failure — not just "mismatch"