Estimates implementation time for web development tasks (frontend and/or backend) by analyzing the existing codebase and calibrating for an AI coding agent as executor — not a human developer. Use when the user asks about effort, sizing, or feasibility: 'how long', 'how much work', 'estimate this', 'what is the effort', 'breakdown this task', 'can we do this in X days', 'is this a big task', 'how complex is', 'what's involved in', 'fits in the sprint', 'rough sizing', 't-shirt size', 'story points'. Also use when the user describes a feature and implicitly wants to know scope — e.g. 'we need to add X to the app', 'thinking about building Y', 'is this feasible by Friday'. Supports batch estimation from any structured source (BMAD output, spec folders, PRDs, backlogs, task lists) — use when the user mentions 'estimate the stories', 'estimate the epic', 'scan the backlog', 'estimate all tasks', 'estimate the specs', or points to a folder of task/story/spec files.
95
94%
Does it follow best practices?
Impact
98%
1.40xAverage score across 5 eval scenarios
Passed
No known issues
A human developer estimates their own speed and cognitive load. A coding agent (Claude Code) has a radically different performance profile:
| Task type | Human baseline | Agent multiplier | Rationale |
|---|---|---|---|
| Boilerplate / scaffolding | 1× | 0.2–0.3× | Agent generates in seconds; no cognitive load |
| CRUD endpoints / forms | 1× | 0.3–0.4× | Well-trodden pattern, agent copies confidently |
| UI component (existing design system) | 1× | 0.4–0.5× | Follows existing patterns well |
| UI component (no design system) | 1× | 0.6–0.8× | Must invent conventions; higher correction rate |
| Business logic (clear spec) | 1× | 0.5–0.6× | Reasons well with clear requirements |
| Business logic (partial spec) | 1× | 0.8–1.0× | Clarification loops are expensive |
| Integration (documented 3rd-party API) | 1× | 0.6–0.8× | Docs must be ingested, edge cases explored |
| Integration (poor/missing docs) | 1× | 1.2–2.0× | High hallucination risk on API contracts |
| Debugging (clear stack trace) | 1× | 0.7–1.0× | Often faster than human with good visibility |
| Debugging (intermittent / no trace) | 1× | 1.0–1.8× | Expensive retry loops; agent fabricates root causes |
| Auth / permissions / roles | 1× | 0.8–1.2× | Security-critical; validation loops required |
| Database migration | 1× | 0.7–1.0× | Mechanical, but schema knowledge required |
| Architecture / design decisions | 1× | 1.0–1.5× | Requires human validation; agent oversimplifies |
| Refactor: extract hook/util | 1× | 0.3–0.5× | Pattern-matching, agent is good at this |
| Refactor: cross-cutting concern | 1× | 0.8–1.2× | Multi-file, risk of missing call sites |
| Ambiguous / underspecified task | 1× | 1.5–3.0× | The single most expensive multiplier |
| Writing / updating tests | 1× | 0.4–0.6× | Agent generates test boilerplate well |
| E2E test (Playwright/Cypress) | 1× | 0.5–0.8× | Selector strategy matters; fragility risk |
| Factor | Adjustment | When to apply |
|---|---|---|
| Clear spec / precise ticket | −10% | Written requirements with acceptance criteria |
| Vague spec | +30% | "Make it work like X" without specifics |
| No spec at all | +80% | Verbal description only |
| High prior art in codebase | −20% | Nearly identical pattern exists |
| Greenfield (no prior art) | +40% | No similar pattern to copy |
| External API (good docs) | +20% | Stripe, Twilio, SendGrid, etc. |
| External API (poor docs) | +40% | Internal microservice, legacy vendor |
| Auth / shared state dependency | +25% | Task touches session, permissions, tokens |
| DB migration required | +15% | Schema change needed |
| Monorepo / complex build | +20% | Multiple packages, cross-boundary imports |
| No TypeScript / no types | +30% | Agent loses inference; more hallucination risk |
| Deprecated or legacy stack | +50% | Agent training data coverage drops sharply |
With a human developer: vague task → conversation → clarification → work resumes.
With a coding agent:
The cost of wrong direction is higher because agents move faster into that direction.
Rule of thumb: Every hour spent on spec before the agent starts saves 2–4 hours of correction after.
Last updated: March 2026 — calibrate against your own observed agent performance and update this file
_refs
bin
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5