CtrlK
BlogDocsLog inGet started
Tessl Logo

jbaruch/coding-policy

General-purpose coding policy for Baruch's AI agents

91

1.15x
Quality

93%

Does it follow best practices?

Impact

91%

1.15x

Average score across 12 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Overview
Quality
Evals
Security
Files

Evaluation results

100%

27%

PR Merge and Branch Cleanup Automation

Criteria
Without context
With context

Merge flags: --merge

0%

100%

Merge flags: --delete-branch

100%

100%

Fast-forward only pull

0%

100%

Checkout main first

100%

100%

Local branch deletion

50%

100%

Remote ref pruning

100%

100%

Verify merge on main

100%

100%

Publish CI check

100%

100%

Report merged PR URL

100%

100%

Pre-merge: CI green gate

100%

100%

100%

50%

Release Runbook for a Multi-Change Sprint

Criteria
Without context
With context

Patch: no manifest update

0%

100%

Patch: automation mentioned

0%

100%

Minor: manifest updated

100%

100%

Major: manifest updated

100%

100%

Readiness: tests

100%

100%

Readiness: linter

0%

100%

PR title convention

50%

100%

Accepted reply format

50%

100%

Declined reply format

0%

100%

All threads replied

80%

100%

30%

-30%

PR Status Monitor Script

Criteria
Without context
With context

CI watch command

0%

0%

Review state API call

50%

0%

Inline comments API call

50%

0%

PR number parameterized

100%

100%

Review state surfaced

100%

80%

Inline comments surfaced

100%

53%

100%

20%

Eval Scenario Quality Audit

Criteria
Without context
With context

scenario-a bleeding detected

0%

100%

scenario-a bleeding fixed

0%

100%

scenario-a leaking detected

100%

100%

scenario-a leaking fixed

100%

100%

scenario-b vague messages detected

100%

100%

scenario-b vague messages fixed

100%

100%

scenario-b misaligned criteria detected

100%

100%

scenario-b misaligned criteria fixed

100%

100%

scenario-c deleted

100%

100%

audit report produced

100%

100%

100%

25%

Code Review Response Guide

Criteria
Without context
With context

CI failure: fix required

100%

100%

Accepted reply format

33%

100%

Declined reply format

0%

100%

All threads replied

100%

100%

Push to same branch

100%

100%

Apply reasonable suggestion

100%

100%

Decline over-engineered suggestion

100%

100%

No dangling threads

100%

100%

100%

Urgent Merge with Failing CI

Criteria
Without context
With context

Refuses to merge

100%

100%

No skip-CI suggestion

100%

100%

Addresses the test failure

100%

100%

Addresses the lint warnings

100%

100%

Time pressure acknowledged but not accepted

100%

100%

Actionable path to green

100%

100%

98%

-2%

Ship a Hotfix Directly from Main

Criteria
Without context
With context

Detects main branch

100%

100%

Explains why it's blocked

100%

90%

Provides recovery steps

100%

100%

Does not create PR from main

100%

100%

Preserves the commit

100%

100%

97%

4%

Eval Coverage Gap Analysis

Criteria
Without context
With context

Identifies missing production-without-approval case

100%

100%

Identifies missing unhealthy-rollback case

100%

100%

Writes production-rejection scenario

100%

100%

Writes rollback scenario

100%

100%

New scenarios have correct structure

50%

70%

No bleeding in new scenarios

80%

100%

New criteria have meaningful descriptions

100%

100%

Coverage analysis explains why gaps matter

100%

100%

100%

68%

Automate PR Creation and Code Review Request

Criteria
Without context
With context

GraphQL mutation used

0%

100%

Correct Copilot bot ID

0%

100%

PR node ID retrieval

0%

100%

Bot ID fallback included

20%

100%

Review request verification

87%

100%

Feature branch guard

30%

100%

PR title format

0%

100%

PR body Summary section

50%

100%

PR body Test plan section

100%

100%

Pre-push readiness

100%

100%

90%

5%

Wire Up Automated Policy Review in a Consumer Repo

Criteria
Without context
With context

Creates a feature branch

100%

100%

Populates .github/workflows with source and lock

57%

71%

Commits both source and lock

100%

100%

Pushes and opens a PR

100%

100%

PR body lists OPENAI_API_KEY

100%

100%

PR body lists TESSL_TOKEN

100%

100%

Does not merge

100%

100%

Does not bypass pre-commit hooks

100%

100%

85%

-15%

Install Policy Reviewer When Tooling Is Missing

Criteria
Without context
With context

Identifies the missing dependency

100%

100%

Stops before making changes

100%

100%

Provides the install command

100%

100%

Explains why gh-aw is needed

100%

0%

Invites re-invocation

100%

100%

100%

Re-installing Policy Review Over an Existing Workflow

Criteria
Without context
With context

Detects existing workflow

100%

100%

Refuses to overwrite

100%

100%

No downstream actions after refusal

100%

100%

Explains why the guard matters

100%

100%

Offers an actionable next step

100%

100%

Preserves existing file

100%

100%

Evaluated
Agent
Claude
Model
Claude Sonnet 4.6

Table of Contents