CtrlK
BlogDocsLog inGet started
Tessl Logo

jbaruch/coding-policy

General-purpose coding policy for Baruch's AI agents

95

1.31x
Quality

91%

Does it follow best practices?

Impact

96%

1.31x

Average score across 10 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

Evaluation results

100%

8%

Wire Up Automated Policy Review in a Consumer Repo

Criteria
Without context
With context

Creates a feature branch

100%

100%

Populates .github/workflows with both source + lock pairs

73%

100%

Commits both sources and both locks

100%

100%

Pushes and opens a PR

100%

100%

PR body lists OPENAI_API_KEY

100%

100%

PR body lists ANTHROPIC_API_KEY

100%

100%

PR body lists TESSL_TOKEN

100%

100%

Does not merge

100%

100%

Does not bypass pre-commit hooks

100%

100%

Explains the cross-family reviewer rationale

100%

100%

100%

35%

Install Policy Reviewer When Tooling Is Missing

Criteria
Without context
With context

Identifies the missing dependency

32%

100%

Stops before making changes

100%

100%

Provides the install command

40%

100%

Explains why gh-aw is needed

80%

100%

Invites re-invocation

100%

100%

100%

Re-installing Policy Review Over an Existing Workflow

Criteria
Without context
With context

Detects existing workflow

100%

100%

Refuses to overwrite

100%

100%

No downstream actions after refusal

100%

100%

Explains why the guard matters

100%

100%

Offers an actionable next step

100%

100%

Preserves existing file

100%

100%

100%

5%

Eval Coverage Gap Analysis

Criteria
Without context
With context

Identifies at least two uncovered decision branches

100%

100%

Writes new scenario directories

100%

100%

Criteria files follow the weighted_checklist format prescribed by the tile

100%

100%

Criteria weights sum to 100 and are not equally distributed

100%

100%

New task.md files pass the no-bleeding check

66%

100%

New criteria don't leak tile internals

100%

100%

Failure descriptions are specific

100%

100%

At least one new scenario exercises a negative case

100%

100%

Coverage analysis justifies each gap

100%

100%

100%

20%

Eval Scenario Quality Audit

Criteria
Without context
With context

scenario-a bleeding detected

0%

100%

scenario-a bleeding fixed

0%

100%

scenario-a leaking detected

100%

100%

scenario-a leaking fixed

100%

100%

scenario-b vague messages detected

100%

100%

scenario-b vague messages fixed

100%

100%

scenario-b misaligned criteria detected

100%

100%

scenario-b misaligned criteria fixed

100%

100%

scenario-c deleted

100%

100%

audit report produced

100%

100%

100%

15%

PR Status Monitor Script

Criteria
Without context
With context

Uses `gh pr checks` with structured output

0%

100%

Uses `gh api .../pulls/<N>/reviews` for review state

100%

100%

Uses `gh api .../pulls/<N>/comments` for inline comments

100%

100%

Does NOT use `/issues/<N>/comments`

100%

100%

Retrieves per-reviewer state distinctly

100%

100%

No hardcoded PR, owner, or repo in the script body

100%

100%

Waits for CI to finish before surfacing state

100%

100%

Surfaces CI state in the summary

100%

100%

Surfaces review states in the summary

100%

100%

Surfaces inline comment content or count

100%

100%

98%

82%

Automate PR Creation and Code Review Request

Criteria
Without context
With context

Uses GraphQL `requestReviews` mutation

0%

100%

Inline comment explains why REST doesn't work

0%

100%

Pinned bot ID with fallback to dynamic discovery

0%

100%

Resolves the PR's GraphQL node ID

0%

100%

Verifies the review request was registered

100%

100%

Feature-branch guard

0%

100%

PR title follows conventional-commits format

0%

100%

PR body structure

0%

100%

Pre-push readiness checks

0%

33%

No hardcoded inputs in the script body

100%

100%

95%

40%

PR Merge and Branch Cleanup Automation

Criteria
Without context
With context

Merge strategy uses `gh pr merge --merge`

0%

100%

Merge includes `--delete-branch`

100%

100%

Fast-forward-only pull after merge

100%

100%

Safe local-branch delete with `git branch -d`

25%

100%

Stale remote-tracking refs pruned

100%

100%

Pre-merge CI gate

20%

100%

Pre-merge review gate

0%

100%

Verifies merge landed on main

75%

62%

Publish CI verification

80%

100%

Final summary includes merged PR URL

100%

71%

Graceful failure on unmet preconditions

40%

100%

No hardcoded PR, owner, or repo in the script body

100%

100%

73%

3%

Code Review Response Guide

Criteria
Without context
With context

CI failure: fix required

100%

100%

Applies the reasonable suggestion

100%

100%

Declines the over-engineered suggestion

100%

100%

All three threads get replies

100%

100%

Accept reply uses the `Fixed in <sha>` format

0%

0%

Decline reply uses the `Declining — <reason>` format

0%

46%

Decline reply cites a verifiable reference

100%

100%

Fixes pushed to the same branch

100%

100%

No dangling threads before merge

100%

60%

100%

27%

Release Runbook for a Multi-Change Sprint

Criteria
Without context
With context

Patch: no manual manifest update

0%

100%

Patch: explains CI auto-bump

0%

100%

Minor: manifest bumped to `1.5.0`

100%

100%

Major: manifest bumped to `2.0.0`

100%

100%

Major: flags breaking-change impact for downstream

100%

100%

Release sequencing: patch first, major last

100%

100%

Readiness gate: tests + linter

50%

100%

Runbook covers all three changes separately

100%

100%

Evaluated
Agent
Claude
Model
Claude Sonnet 4.6

Table of Contents