CtrlK
BlogDocsLog inGet started
Tessl Logo

uinaf/agent-readiness

Audit and build the infrastructure a repo needs so agents can work autonomously — boot scripts, smoke tests, CI/CD gates, dev environment setup, observability, and isolation. Use when a repo can't boot, tests are broken or missing, there's no dev environment, agents can't verify their work, or agents need human help to get anything done. Do not use for reviewing an existing diff or for documentation-only cleanup.

97

1.03x
Quality

100%

Does it follow best practices?

Impact

87%

1.03x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

industry-examples.mdreferences/

Industry Examples

Real-world patterns from teams running agents at scale.

Sources

Contents

OpenAI — Codex Frontend

3 engineers, ~1,500 PRs, ~1M LOC, 3.5 PRs/engineer/day. Zero lines of manually-written code over 5 months.

Infrastructure:

  • Per-worktree bootable app — each change gets its own running instance
  • CDP wired into agent runtime — DOM snapshots, screenshots, navigation
  • Ephemeral observability per worktree — logs, metrics, traces torn down after task
  • Custom linters with error messages that inject remediation into agent context

AGENTS.md: ~100 lines, table of contents pointing to docs/. "We tried the big AGENTS.md. It failed."

Architecture: rigid layers mechanically enforced (Types → Config → Repo → Service → Runtime → UI). "Enforce boundaries centrally, allow autonomy locally."

Review: agent-to-agent (Ralph Wiggum Loop). Humans may review but aren't required.

Slop management: "golden principles" + background Codex tasks scan for deviations, open refactoring PRs. Technical debt as high-interest loan — pay down continuously.

Source: https://openai.com/index/harness-engineering/

Anthropic — Evaluator Pattern

GAN-inspired three-agent pattern: Planner → Generator → Evaluator.

How the evaluator works:

  1. Boots the app
  2. Navigates key flows with Playwright MCP
  3. Takes screenshots
  4. Grades against predefined criteria (design quality, originality, craft, functionality)
  5. Returns pass/fail with evidence and specific feedback

Sprint contracts: before each sprint, generator and evaluator negotiate testable "done" criteria.

Key findings:

  • LLMs are terrible self-evaluators — confidently praise mediocre work
  • Out of the box, Claude identifies issues then talks itself into approving
  • Multiple tuning rounds needed: read evaluator logs → find judgment divergences → update QA prompt
  • Opus 4.6 allowed removing sprint construct entirely (model handles coherence natively)

Communication: agents coordinate via files, not message passing.

Cost: solo $9/20min → full 3-agent setup $200/6hr. Simplified (Opus 4.6): $125/4hr.

Source: https://www.anthropic.com/engineering/harness-design-long-running-apps

Stripe — Minions

1,300+ PRs/week. All human-reviewed, zero human-written code. Manages >$1 trillion annual payment volume.

Devboxes: cloud machines pre-loaded with codebase, 10-second spin-up via warm pool. QA-isolated. Built for humans years before agents — agents just slotted in.

Blueprints: hybrid orchestration mixing deterministic nodes (lint, push, PR template) with agentic nodes (implement, fix CI). "Putting LLMs into contained boxes compounds into reliability."

Scoped rules: global rules "very judiciously." Almost all rules scoped to subdirectories/file patterns, auto-attached as agent navigates. Same rules for Minions, Cursor, Claude Code.

Feedback: pre-push lint < 5 seconds (background daemon precomputes), selective CI from 3M+ tests with autofixes, max 2 CI rounds.

Toolshed: centralized MCP server with ~500 tools. Agents get curated subsets, not kitchen sink.

Key insight: "Investments in human developer productivity over time have returned to pay dividends in the world of agents."

Sources: https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents

Datadog — Observability-Driven Verification

The most rigorous verification approach. Infrastructure-first: invest in automated checks, not code review.

Verification pyramid:

LayerToolTime
SymbolicTLA+ specs2 min read
PrimaryDeterministic Simulation Testing (DST)~5 seconds
ExhaustiveModel checking (Stateright)30-60 seconds
BoundedBounded verification (Kani)~60 seconds
EmpiricalTelemetry + benchmarksseconds-minutes

DST as workhorse: each run ~5 seconds, exercises production code through randomized scenarios. Target: 500 seeds per component, 10 million across all components. Caught bugs impossible to find in code review.

Contracts before code: core invariants defined upfront. Agent is NOT allowed to invent system meaning. "Semi-formal methods" — specs explicit enough to be checked, cheap enough to run continuously.

Performance as hill-climbing: with correctness locked in, agent proposes optimization → full DST → if tests pass, measure throughput → keep or revert.

Human role: "define the system idea and invariants, review and strengthen the DST harness, set measurable targets, and approve architectural changes. Everything else was the agent running against the harness."

Reviews become bloom filters — a fast gate, not the source of correctness.

Source: https://www.datadoghq.com/blog/ai/harness-first-agents/

Cursor — Self-Driving Codebases

Built a web browser with ~1,000 commits/hour across 10M tool calls. Almost zero human intervention.

Architecture evolution: single agent → shared state (locking hell) → rigid planner/executor (bottleneck) → final: root planner + recursive subplanners + isolated workers with handoffs.

Key findings:

  • 100% commit correctness caused serialization — accepting some error rate was key
  • No cross-talk between workers — convergence through ownership chain
  • Duplicate work cheaper than coordination overhead
  • Scratchpads should be rewritten, not appended to
  • Disk I/O was the bottleneck at scale, not CPU/RAM

Source: https://cursor.com/blog/self-driving-codebases

Uber — Minion + Internal AI Stack

92% of devs use agents monthly. 65-72% of code AI-generated. AI costs up 6x since 2024.

Minion (background agent platform):

  • Monorepos pre-checked-out and ready. Internal infra access via MCP + AIFX CLI
  • Optimized defaults (compiling done, tools installed, helpful AGENTS.md)
  • Prompt improvement — analyzes and rewrites prompts for higher success rate
  • Web, Slack, GitHub PR, and code review interfaces
  • 70% of workloads are toil (migrations, upgrades) — higher accuracy = virtuous cycle

uReview (AI code review):

  • Multiple specialized bots per PR. Comments graded, low-confidence filtered, merged, categorized
  • Quality > quantity — worst thing is noisy low-quality comments
  • Devs rate usefulness → feedback loop. Bot comments trending down (focus on quality)

Code Inbox (smart PR routing):

  • Routes based on ownership, compliance, history, timezone, calendar availability
  • Risk profiles highlight high-impact changes for extra scrutiny
  • SLOs for response times, auto-reassign, escalation

Key finding: Claude Code usage nearly doubled in 3 months (32% → 63%), while IDE tools plateaued. Engineers naturally gravitate to multi-agent workflows. Sharing wins between engineers is the most effective adoption tactic — top-down mandates had limited impact.

Source: https://newsletter.pragmaticengineer.com/p/how-uber-uses-ai-for-development

Convergent Architecture

Every team converges on: expensive models for planning, cheap for workers. Workers isolated from each other. Filesystem as coordination primitive (not message passing). Accept and correct > prevent all errors. PR as the human oversight gate.

SKILL.md

tile.json