Execute a strict Red-Green-Refactor TDD cycle — one requirement at a time — in any language or framework.
97
Quality
100%
Does it follow best practices?
Impact
94%
1.11xAverage score across 5 eval scenarios
{
"context": "The agent is asked to write a RED phase test for a .NET InvoiceService. A developer incorrectly suggests mocking a private method. The agent must reject that approach and write a correct behavioral test using injected dependency mocking only.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Private mock rejected",
"description": "The agent explicitly rejects the suggestion to mock the private _calculateLineItemSubtotals method",
"max_score": 2
},
{
"name": "Violation explained",
"description": "The agent explains why mocking a private method violates behavioral TDD principles (couples test to implementation)",
"max_score": 2
},
{
"name": "Injected dependency mocked correctly",
"description": "The correct RED phase test mocks only an injected dependency (e.g. IOrderRepository or ILineItemCalculator) via NSubstitute",
"max_score": 2
},
{
"name": "Observable output asserted",
"description": "The test asserts on the observable output of GenerateInvoice — the returned Invoice.Total value — not on internal calls",
"max_score": 2
},
{
"name": "Phase gating respected",
"description": "The agent stops after Phase 1 and waits for user confirmation",
"max_score": 2
}
]
}