Estimates implementation time for web development tasks (frontend and/or backend) by analyzing the existing codebase and calibrating for an AI coding agent as executor — not a human developer. Use when the user asks about effort, sizing, or feasibility: 'how long', 'how much work', 'estimate this', 'what is the effort', 'breakdown this task', 'can we do this in X days', 'is this a big task', 'how complex is', 'what's involved in', 'fits in the sprint', 'rough sizing', 't-shirt size', 'story points'. Also use when the user describes a feature and implicitly wants to know scope — e.g. 'we need to add X to the app', 'thinking about building Y', 'is this feasible by Friday'. Supports batch estimation from any structured source (BMAD output, spec folders, PRDs, backlogs, task lists) — use when the user mentions 'estimate the stories', 'estimate the epic', 'scan the backlog', 'estimate all tasks', 'estimate the specs', or points to a folder of task/story/spec files.
95
94%
Does it follow best practices?
Impact
98%
1.40xAverage score across 5 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent uses the batch estimation workflow with a consolidated summary table, estimates each task individually with per-task sizing, identifies cross-task dependencies, suggests an implementation order, and avoids redundant codebase analysis across tasks.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Consolidated summary table",
"description": "Output includes a single table or matrix comparing all 5 features side-by-side with at least task name, time estimate, and size for each",
"max_score": 12
},
{
"name": "All five features estimated",
"description": "Every one of the 5 features has an individual time estimate (not just a lump total)",
"max_score": 10
},
{
"name": "Per-task T-shirt size",
"description": "Each feature is assigned its own T-shirt size (XS/S/M/L/XL)",
"max_score": 8
},
{
"name": "Grand total provided",
"description": "Output includes a total time range summing across all 5 features",
"max_score": 8
},
{
"name": "Implementation order suggested",
"description": "Output recommends a sequencing or priority order for the 5 features with reasoning",
"max_score": 10
},
{
"name": "Cross-task dependencies noted",
"description": "Output identifies at least one dependency between features (e.g. roles needed before invitations, or activity feed depending on event model)",
"max_score": 10
},
{
"name": "Dark mode sized smallest",
"description": "The dark mode feature receives the smallest estimate (XS or S) given strong existing support from Tailwind and shadcn/ui",
"max_score": 8
},
{
"name": "Permissions sized largest",
"description": "The role-based permissions feature receives one of the larger estimates (M, L, or XL) reflecting its cross-cutting, security-critical nature",
"max_score": 8
},
{
"name": "Time ranges not points",
"description": "All time values are expressed as ranges, not single point estimates",
"max_score": 8
},
{
"name": "Stack detected as TypeScript",
"description": "Output identifies the stack as TypeScript/Next.js",
"max_score": 5
},
{
"name": "Per-task risk or confidence",
"description": "Each feature has either a risk note or confidence level indicating relative uncertainty",
"max_score": 5
},
{
"name": "Shared assumptions listed",
"description": "Output includes assumptions that apply across multiple or all features",
"max_score": 8
}
]
}_refs
bin
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5