Estimates implementation time for web development tasks (frontend and/or backend) by analyzing the existing codebase and calibrating for an AI coding agent as executor — not a human developer. Use when the user asks about effort, sizing, or feasibility: 'how long', 'how much work', 'estimate this', 'what is the effort', 'breakdown this task', 'can we do this in X days', 'is this a big task', 'how complex is', 'what's involved in', 'fits in the sprint', 'rough sizing', 't-shirt size', 'story points'. Also use when the user describes a feature and implicitly wants to know scope — e.g. 'we need to add X to the app', 'thinking about building Y', 'is this feasible by Friday'. Supports batch estimation from any structured source (BMAD output, spec folders, PRDs, backlogs, task lists) — use when the user mentions 'estimate the stories', 'estimate the epic', 'scan the backlog', 'estimate all tasks', 'estimate the specs', or points to a folder of task/story/spec files.
95
94%
Does it follow best practices?
Impact
98%
1.40xAverage score across 5 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent applies honesty rules when facing poor documentation, identifies escalation triggers, does not compress estimates to match unrealistic stakeholder expectations, and flags high-risk factors appropriately.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Does not match CEO timeline",
"description": "Estimate does NOT compress to 'a couple hours' — the total exceeds the CEO's stated expectation, with honest reasoning for why",
"max_score": 12
},
{
"name": "Poor docs risk flagged",
"description": "Estimate identifies the incomplete PDF documentation (missing appendix, no SDK, no sandbox) as a significant risk or uncertainty driver",
"max_score": 12
},
{
"name": "Escalation or clarification",
"description": "Output includes specific questions or clarifications needed before the estimate can be considered reliable (e.g. missing appendix, no sandbox, authentication details)",
"max_score": 10
},
{
"name": "Low or medium confidence",
"description": "Confidence level is Low or Medium, not High, reflecting the documentation gaps and external dependency",
"max_score": 10
},
{
"name": "External API buffer applied",
"description": "Time estimates are visibly inflated compared to a well-documented integration, reflecting the poor documentation quality",
"max_score": 8
},
{
"name": "Time ranges used",
"description": "All time estimates use ranges rather than single point values",
"max_score": 8
},
{
"name": "Wide range spread",
"description": "Range spread is at least ±40% on major sub-tasks, reflecting the high uncertainty",
"max_score": 8
},
{
"name": "Sub-task decomposition",
"description": "Work is broken into at least 4 distinct sub-tasks (e.g. API client, polling, reconciliation, push updates, tests)",
"max_score": 8
},
{
"name": "Stack detected as Go",
"description": "Output identifies the stack as Go (not TypeScript, Python, etc.)",
"max_score": 6
},
{
"name": "T-shirt size M or larger",
"description": "T-shirt size reflects substantial effort (M, L, or XL), not XS or S",
"max_score": 6
},
{
"name": "Top risk named",
"description": "A specific top risk is identified relating to the undocumented API or missing specification details",
"max_score": 6
},
{
"name": "Assumptions listed",
"description": "Output explicitly lists assumptions that the estimate depends on",
"max_score": 6
}
]
}_refs
bin
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5