Estimates implementation time for web development tasks (frontend and/or backend) by analyzing the existing codebase and calibrating for an AI coding agent as executor — not a human developer. Use when the user asks about effort, sizing, or feasibility: 'how long', 'how much work', 'estimate this', 'what is the effort', 'breakdown this task', 'can we do this in X days', 'is this a big task', 'how complex is', 'what's involved in', 'fits in the sprint', 'rough sizing', 't-shirt size', 'story points'. Also use when the user describes a feature and implicitly wants to know scope — e.g. 'we need to add X to the app', 'thinking about building Y', 'is this feasible by Friday'. Supports batch estimation from any structured source (BMAD output, spec folders, PRDs, backlogs, task lists) — use when the user mentions 'estimate the stories', 'estimate the epic', 'scan the backlog', 'estimate all tasks', 'estimate the specs', or points to a folder of task/story/spec files.
95
94%
Does it follow best practices?
Impact
98%
1.40xAverage score across 5 eval scenarios
Passed
No known issues
These rules are mandatory. Read this file before writing any estimate number.
Never write "3 hours". Always write "2–4 hours". The range communicates real uncertainty. A single number is a false promise.
Minimum range spread:
Every estimate must include:
Confidence levels:
If no filesystem access or the user hasn't shared the project:
⚠️ Codebase not read. This estimate is based on task description only. Actual time may differ by 2–3× once the codebase is analyzed. Treat this as a rough order of magnitude.
Never silently skip the codebase read step.
If the real estimate is 3 days but the user is hoping for "a few hours", say 3 days. An honest high estimate protects the user. A low estimate that misses creates trust damage.
If the user pushes back ("that seems like a lot"), explain the specific sub-tasks driving the estimate. Do not compress the numbers without a clear reason.
Every estimate must include:
Top risk: [The single thing that could most increase this estimate — e.g., "If the 3rd-party API lacks typed SDKs, integration time could double."]
Apply these to the confidence level:
| Modifier | Effect |
|---|---|
| Codebase read AND prior art found | +1 confidence level |
| Codebase NOT read | −1 confidence level |
| Vague spec | −1 confidence level |
| Task has external API with poor docs | −1 confidence level |
| Similar task was done recently by agent (user confirms) | +1 confidence level |
| Task touches auth, billing, or data migration | −1 confidence level (always) |
Confidence cannot exceed High or go below Low. If multiple −1 modifiers apply and confidence would go below Low, note:
⚠️ Estimate reliability: Very Low — multiple uncertainty factors compound. Recommend spec clarification before committing to a timeline.
Escalate to "need clarification first" if ANY of these apply:
Escalation message format:
Before I can give you a reliable estimate, I need clarity on:
1. [Specific question]
2. [Specific question]
Without this, the estimate range could be X–Y hours (2×+ variance).
Want me to give you a rough order-of-magnitude anyway?If the task decomposes to >40 agent-hours:
XXL message format:
This task estimates to XXL (>40h agent time). Estimates at this scale are unreliable.
The main drivers are:
- [Sub-task]: ~Xh (because [reason])
- [Sub-task]: ~Yh (because [reason])
Recommendation: Break this into [N] separate stories and estimate each independently.
Want me to propose a breakdown?When the user changes scope mid-conversation:
Format:
Scope change: [what changed]
Delta:
- Removed: [sub-task] → −Xh
- Added: [sub-task] → +Yh
Revised total: A–B hours (was C–D hours)
Confidence: [updated if applicable]A reliable estimate:
_refs
bin
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5