Execute a strict Red-Green-Refactor TDD cycle — one requirement at a time — in any language or framework.
97
Quality
100%
Does it follow best practices?
Impact
94%
1.11xAverage score across 5 eval scenarios
{
"context": "The agent is executing Phase 1 (RED) of a behavioral TDD cycle for a C#/xUnit PasswordValidator requirement. It should produce a failing test against the public interface only and stop before writing any implementation.",
"type": "weighted_checklist",
"checklist": [
{
"name": "xUnit test targets public interface",
"description": "A test method is written using xUnit ([Fact] or [Theory]) that instantiates or calls the public PasswordValidator interface",
"max_score": 2
},
{
"name": "Assertion checks false for short password",
"description": "The test asserts a return value of false (or equivalent) when given a password shorter than 8 characters",
"max_score": 2
},
{
"name": "No implementation code written",
"description": "No implementation code is written — only the test",
"max_score": 2
},
{
"name": "Public interface only",
"description": "The test uses only the public interface; no private methods are called or referenced",
"max_score": 2
},
{
"name": "Expected failure reason stated",
"description": "The agent states the expected failure reason (e.g. PasswordValidator does not exist)",
"max_score": 1
},
{
"name": "Phase gating respected",
"description": "The agent stops after Phase 1 and waits for user confirmation before proceeding to Phase 2",
"max_score": 2
}
]
}