Audit and build the infrastructure a repo needs so agents can work autonomously — boot scripts, smoke tests, CI/CD gates, dev environment setup, observability, and isolation. Use when a repo can't boot, tests are broken or missing, there's no dev environment, agents can't verify their work, or agents need human help to get anything done. Do not use for reviewing an existing diff or for documentation-only cleanup.
97
100%
Does it follow best practices?
Impact
87%
1.03xAverage score across 3 eval scenarios
Passed
No known issues
Real-world patterns from teams running agents at scale.
3 engineers, ~1,500 PRs, ~1M LOC, 3.5 PRs/engineer/day. Zero lines of manually-written code over 5 months.
Infrastructure:
AGENTS.md: ~100 lines, table of contents pointing to docs/. "We tried the big AGENTS.md. It failed."
Architecture: rigid layers mechanically enforced (Types → Config → Repo → Service → Runtime → UI). "Enforce boundaries centrally, allow autonomy locally."
Review: agent-to-agent (Ralph Wiggum Loop). Humans may review but aren't required.
Slop management: "golden principles" + background Codex tasks scan for deviations, open refactoring PRs. Technical debt as high-interest loan — pay down continuously.
Source: https://openai.com/index/harness-engineering/
GAN-inspired three-agent pattern: Planner → Generator → Evaluator.
How the evaluator works:
Sprint contracts: before each sprint, generator and evaluator negotiate testable "done" criteria.
Key findings:
Communication: agents coordinate via files, not message passing.
Cost: solo $9/20min → full 3-agent setup $200/6hr. Simplified (Opus 4.6): $125/4hr.
Source: https://www.anthropic.com/engineering/harness-design-long-running-apps
1,300+ PRs/week. All human-reviewed, zero human-written code. Manages >$1 trillion annual payment volume.
Devboxes: cloud machines pre-loaded with codebase, 10-second spin-up via warm pool. QA-isolated. Built for humans years before agents — agents just slotted in.
Blueprints: hybrid orchestration mixing deterministic nodes (lint, push, PR template) with agentic nodes (implement, fix CI). "Putting LLMs into contained boxes compounds into reliability."
Scoped rules: global rules "very judiciously." Almost all rules scoped to subdirectories/file patterns, auto-attached as agent navigates. Same rules for Minions, Cursor, Claude Code.
Feedback: pre-push lint < 5 seconds (background daemon precomputes), selective CI from 3M+ tests with autofixes, max 2 CI rounds.
Toolshed: centralized MCP server with ~500 tools. Agents get curated subsets, not kitchen sink.
Key insight: "Investments in human developer productivity over time have returned to pay dividends in the world of agents."
Sources: https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents
The most rigorous verification approach. Infrastructure-first: invest in automated checks, not code review.
Verification pyramid:
| Layer | Tool | Time |
|---|---|---|
| Symbolic | TLA+ specs | 2 min read |
| Primary | Deterministic Simulation Testing (DST) | ~5 seconds |
| Exhaustive | Model checking (Stateright) | 30-60 seconds |
| Bounded | Bounded verification (Kani) | ~60 seconds |
| Empirical | Telemetry + benchmarks | seconds-minutes |
DST as workhorse: each run ~5 seconds, exercises production code through randomized scenarios. Target: 500 seeds per component, 10 million across all components. Caught bugs impossible to find in code review.
Contracts before code: core invariants defined upfront. Agent is NOT allowed to invent system meaning. "Semi-formal methods" — specs explicit enough to be checked, cheap enough to run continuously.
Performance as hill-climbing: with correctness locked in, agent proposes optimization → full DST → if tests pass, measure throughput → keep or revert.
Human role: "define the system idea and invariants, review and strengthen the DST harness, set measurable targets, and approve architectural changes. Everything else was the agent running against the harness."
Reviews become bloom filters — a fast gate, not the source of correctness.
Source: https://www.datadoghq.com/blog/ai/harness-first-agents/
Built a web browser with ~1,000 commits/hour across 10M tool calls. Almost zero human intervention.
Architecture evolution: single agent → shared state (locking hell) → rigid planner/executor (bottleneck) → final: root planner + recursive subplanners + isolated workers with handoffs.
Key findings:
Source: https://cursor.com/blog/self-driving-codebases
92% of devs use agents monthly. 65-72% of code AI-generated. AI costs up 6x since 2024.
Minion (background agent platform):
uReview (AI code review):
Code Inbox (smart PR routing):
Key finding: Claude Code usage nearly doubled in 3 months (32% → 63%), while IDE tools plateaued. Engineers naturally gravitate to multi-agent workflows. Sharing wins between engineers is the most effective adoption tactic — top-down mandates had limited impact.
Source: https://newsletter.pragmaticengineer.com/p/how-uber-uses-ai-for-development
Every team converges on: expensive models for planning, cheap for workers. Workers isolated from each other. Filesystem as coordination primitive (not message passing). Accept and correct > prevent all errors. PR as the human oversight gate.