General-purpose coding policy for Baruch's AI agents
95
91%
Does it follow best practices?
Impact
96%
1.31xAverage score across 10 eval scenarios
Advisory
Suggest reviewing before use
{
"context": "Tests whether the agent can audit an existing eval suite for coverage gaps and author new scenarios that follow this tile's eval-authoring conventions. Weight concentrates on tile-specific quality properties (correct criteria format, no bleeding, no leaking, negative-case coverage, meaningful failure descriptions) rather than on whether the agent spotted the gaps — basic QA reasoning is enough to identify gaps from the skill's functional description, so that axis contributes little lift on its own.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Identifies at least two uncovered decision branches",
"description": "A written coverage analysis exists (as a file in the working directory; the task lets the agent pick the filename) and it names at least two concrete branches the existing two scenarios don't exercise (any combination of: failure paths, refusals, alternative inputs, edge conditions). Weighted low because identifying gaps from a functional skill description is basic QA reasoning and is not tile-specific",
"max_score": 10
},
{
"name": "Writes new scenario directories",
"description": "At least two new directories appear under `evals/`, each containing both a `task.md` and a `criteria.json`. Scenario names must not recreate the two existing happy-path scenarios",
"max_score": 10
},
{
"name": "Criteria files follow the weighted_checklist format prescribed by the tile",
"description": "Each new `criteria.json` is a valid JSON object with a `context` string, `type: \"weighted_checklist\"`, and a `checklist` array where every entry has `name`, `description`, and `max_score`. Missing any of these fields drops the score proportionally. Tile-specific — the format is prescribed by plugin-evals guidance, not guessable from general practice",
"max_score": 15
},
{
"name": "Criteria weights sum to 100 and are not equally distributed",
"description": "Each new scenario's weights sum to exactly 100 AND are not all identical. This tile's convention is that weights sum to 100 and reflect importance, not a flat split — uniform weights or sums other than 100 score zero",
"max_score": 5
},
{
"name": "New task.md files pass the no-bleeding check",
"description": "No criterion's expected literal value appears verbatim in its own scenario's task description. For each criterion with a concrete expected value (specific strings, flags, API names), grep the task text — a match is bleeding, per this tile's plugin-evals rule",
"max_score": 15
},
{
"name": "New criteria don't leak tile internals",
"description": "Criteria describe observable behaviour, not tile-internal file paths, action names, or invented string conventions. Referencing public tool/API surfaces is fine; referencing `.tessl/tiles/...` paths, internal routing, or prescribed reply templates is leaking",
"max_score": 10
},
{
"name": "Failure descriptions are specific",
"description": "Every new criterion's `description` explains what went wrong on failure in a way a grader can act on — not just `mismatch`, `fails`, or a restatement of the criterion name. Generic descriptions score zero on this axis",
"max_score": 10
},
{
"name": "At least one new scenario exercises a negative case",
"description": "At least one new scenario tests refusal-to-proceed, produce-silence, or a failure-path recovery — not another happy-path variation. This tile's plugin-evals rule requires both positive and negative cases, and the grading is on whether the author knows it",
"max_score": 10
},
{
"name": "Coverage analysis justifies each gap",
"description": "The written coverage analysis doesn't only list gaps — for each one it explains why the missing case matters (e.g., `without this, a production deploy without approval would pass evals despite being a policy violation`). Each gap without justification docks this criterion proportionally",
"max_score": 15
}
]
}