Estimates implementation time for web development tasks (frontend and/or backend) by analyzing the existing codebase and calibrating for an AI coding agent as executor — not a human developer. Use when the user asks about effort, sizing, or feasibility: 'how long', 'how much work', 'estimate this', 'what is the effort', 'breakdown this task', 'can we do this in X days', 'is this a big task', 'how complex is', 'what's involved in', 'fits in the sprint', 'rough sizing', 't-shirt size', 'story points'. Also use when the user describes a feature and implicitly wants to know scope — e.g. 'we need to add X to the app', 'thinking about building Y', 'is this feasible by Friday'. Supports batch estimation from any structured source (BMAD output, spec folders, PRDs, backlogs, task lists) — use when the user mentions 'estimate the stories', 'estimate the epic', 'scan the backlog', 'estimate all tasks', 'estimate the specs', or points to a folder of task/story/spec files.
95
94%
Does it follow best practices?
Impact
98%
1.40xAverage score across 5 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent applies greenfield correction factors when no prior art exists for the requested feature, uses appropriate calibration multipliers for a Python stack, and produces wider ranges reflecting higher uncertainty.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Greenfield acknowledged",
"description": "Estimate explicitly notes the absence of an existing chart/visualization library or similar pattern in the codebase, impacting the estimate upward",
"max_score": 12
},
{
"name": "Wider time ranges",
"description": "Time ranges are noticeably wider than a typical well-defined task (reflecting greenfield uncertainty), with the upper bound at least 50% above the lower bound for major sub-tasks",
"max_score": 10
},
{
"name": "Medium or low confidence",
"description": "Confidence is set to Medium or Low (not High), reflecting the greenfield nature and absence of prior patterns",
"max_score": 10
},
{
"name": "Five or more sub-tasks",
"description": "Work is decomposed into at least 5 distinct sub-tasks covering library setup, data endpoints, chart rendering, filters, and export",
"max_score": 10
},
{
"name": "Library selection risk noted",
"description": "Estimate mentions that selecting and integrating a new chart library (with no existing pattern to follow) is a risk or driver of uncertainty",
"max_score": 8
},
{
"name": "Stack detected as Python",
"description": "Output identifies the stack as Python/FastAPI (not TypeScript or another framework)",
"max_score": 8
},
{
"name": "Agent-calibrated times",
"description": "Sub-task times are in the agent scale (minutes to hours), not in human developer scale (days to weeks)",
"max_score": 8
},
{
"name": "T-shirt size L or larger",
"description": "T-shirt size is L, XL, or XXL, consistent with a greenfield multi-component feature",
"max_score": 8
},
{
"name": "All required sections present",
"description": "Output includes Summary, Sub-tasks table, Assumptions, Risks, and T-shirt size sections",
"max_score": 8
},
{
"name": "Time ranges not points",
"description": "All time estimates use ranges rather than single point values",
"max_score": 8
},
{
"name": "Top risk identified",
"description": "A specific top risk is named explaining what could most increase the estimate",
"max_score": 5
},
{
"name": "Prior art assessed",
"description": "Estimate explicitly assesses prior art level (noting it as Low or none for the visualization component)",
"max_score": 5
}
]
}_refs
bin
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5