General-purpose coding policy for Baruch's AI agents
90
91%
Does it follow best practices?
Impact
90%
1.30xAverage score across 18 eval scenarios
Advisory
Suggest reviewing before use
{
"context": "Tests whether the agent applies this tile's prescribed merge-and-cleanup sequence — not whether it can reason its way to some working equivalent. The tile teaches a specific combination of flags and commands for consistency across a team (merge-commit strategy, fast-forward-only pull, safe branch delete, remote pruning), plus a multi-gate post-merge contract: capture a registry baseline before merge, resolve THIS publish run by merge-SHA + push-event filter, watch the run to terminal state, then confirm the conjunction of run-success AND registry-version-advance. Baseline agents typically pick different merge defaults (squash, plain pull, force delete) and implement a weaker fire-and-forget post-merge check (`gh run list`, 'workflow was triggered'). Each tile-specific criterion measures whether the tile was applied.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Merge strategy uses `gh pr merge --merge`",
"description": "The script calls `gh pr merge ... --merge` (the tile prescribes merge-commit strategy for this workflow). `--squash` or `--rebase` score zero — those are different strategies the tile does not prescribe here",
"max_score": 8
},
{
"name": "Merge includes `--delete-branch`",
"description": "The `gh pr merge` invocation passes `--delete-branch` so the remote feature branch is deleted on merge. Separate `git push origin --delete` AFTER merge is acceptable but less clean than the tile's prescribed one-flag approach",
"max_score": 4
},
{
"name": "Fast-forward-only pull after merge",
"description": "Uses `git pull --ff-only` (not plain `git pull`, which can create a spurious merge commit on local main). The `--ff-only` prescription is tile-specific — baseline agents default to plain `git pull`",
"max_score": 8
},
{
"name": "Safe local-branch delete with `git branch -d`",
"description": "Deletes the local feature branch with `git branch -d` (safe, refuses to delete un-merged work). `git branch -D` (force delete) scores zero — the tile explicitly prefers the safe form",
"max_score": 5
},
{
"name": "Stale remote-tracking refs pruned",
"description": "Runs `git remote prune origin` (or `git fetch --prune`) so `origin/*` refs pointing at deleted remote branches are cleaned up. Skipping prune scores zero — the tile prescribes this as part of cleanup",
"max_score": 4
},
{
"name": "Pre-merge CI gate",
"description": "The script refuses to merge when CI is not green — exits non-zero with a diagnostic; does NOT proceed to `gh pr merge` on pending or failing CI",
"max_score": 10
},
{
"name": "Pre-merge review gate",
"description": "The script refuses to merge when a blocking review is outstanding or any review thread is unresolved. At minimum, checks that no review has state `CHANGES_REQUESTED`",
"max_score": 6
},
{
"name": "Verifies merge landed on main",
"description": "Explicit post-merge check that the PR's commits are present on main (`gh pr view`, `git log origin/main`, or equivalent). A silent assumption that `gh pr merge` succeeded scores zero",
"max_score": 5
},
{
"name": "Pre-merge registry baseline captured",
"description": "Captures the registry's current Latest Version BEFORE invoking `gh pr merge` (e.g., `PRE=$(tessl tile info <workspace>/<tile> | grep 'Latest Version' | awk '{print $NF}')`) so the post-merge check has a baseline to compare against. Scripts that skip the baseline and only inspect post-merge registry state score zero — without the baseline, the agent can't distinguish 'this release published' from 'a prior release already shipped this version'",
"max_score": 8
},
{
"name": "SHA-bound publish-run resolution",
"description": "Resolves THE publish run for this merge by merge-commit SHA AND `push` event filter — derives the SHA from `gh pr view <N> --json mergeCommit --jq '.mergeCommit.oid'`, then selects the run whose `.headSha == \"<merge-sha>\"` and `.event == \"push\"` (jq string literals — bare `push` would be an undefined variable). Scripts using 'latest on main', `gh run list --limit 1`, branch-name-only filters, or selecting by workflow name alone score zero — those are race-prone selectors that may pick a parallel-merge run or a manual `workflow_dispatch` and are exactly the heuristics the tile forbids",
"max_score": 10
},
{
"name": "Watch publish run to terminal state",
"description": "Watches the resolved publish run to terminal state via `gh run watch <id>` against the specific run id. A `gh run list` 'workflow exists' check, a single `gh run view` snapshot, or any check that confirms only that the workflow was triggered scores zero — the tile requires watching the run through to completion, not a fire-and-forget trigger confirmation",
"max_score": 8
},
{
"name": "Conjunction check: run-success AND registry-advanced",
"description": "Gates final success on BOTH conjuncts: the resolved run's `conclusion` is `success` AND the registry's Latest Version is strictly greater than the captured pre-merge baseline. Scoring zero: (a) accepting run-success alone as proof the release landed, (b) accepting registry-advance alone without checking the run's conclusion, (c) deriving an expected version from the merge SHA's manifest and comparing against it, or (d) encoding 'moderation hold' / 'moderation queue' / 'moderation rejection' as expected intermediate states or retry conditions — the registry never rejects a published version, so missing-after-success means CI failed, full stop. Any of (a)–(d) scores zero",
"max_score": 15
},
{
"name": "Final summary includes merged PR URL",
"description": "The script prints a final summary naming the merged PR URL and the publish-landed outcome (run conclusion + registry version), so the developer doesn't have to re-check manually",
"max_score": 4
},
{
"name": "Graceful failure on unmet preconditions",
"description": "When CI is red, reviews are outstanding, the publish run fails, or the registry doesn't advance, the script exits non-zero with a stderr diagnostic naming the specific unmet condition — does NOT silently skip or mask the failure",
"max_score": 3
},
{
"name": "No hardcoded PR, owner, or repo in the script body",
"description": "OBSERVABLE: the script body contains no literal for the target PR number, owner, or repo — all three come from runtime inputs (args or env vars). Running against a new target requires changing only the invocation, not the script source. Pinning any of them in the script source scores zero",
"max_score": 2
}
]
}.tessl-plugin
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
rules
skills
adopt-fork-pr
eval-curation
install-reviewer