Use when the user wants a test suite audit, test quality or reliability review, regression-protection review, unit/integration/e2e test review, coverage or CI signal assessment, flaky CI investigation, fixture-realism review, spec-drift review, or generated-test validation for AI/LLM/agent-written code. Produces severity-ranked findings for weak assertions, oracle gaps, brittle fixtures, over-mocking, CI trust, and generated-code test risks.
100
100%
Does it follow best practices?
Impact
100%
1.31xAverage score across 3 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent correctly applies LLM-Generated Test Validity and Agent-Built Codebase Risks audit domains: identifying hallucinated APIs, implementation-copying oracles, spec drift, and applying correct severity for financial-path weak oracles. Also checks adherence to guardrails around not modifying code and not trusting generated tests without evidence.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Hallucinated API flagged",
"description": "Report identifies that test_gateway_error_propagates accesses self.mock_gateway.GatewayError, which is a hallucinated attribute on the MagicMock (not a real exception class on the gateway), meaning the test cannot actually verify real error propagation behavior",
"max_score": 10
},
{
"name": "Hallucinated attribute flagged",
"description": "Report identifies that test_validate_payment_cache accesses self.validator._cache, an attribute that does not exist on PaymentValidator, making that test fail at runtime rather than testing intended behavior",
"max_score": 10
},
{
"name": "Implementation-copying oracle flagged",
"description": "Report flags test_calculate_fee_basic for computing the expected value using the same formula as the implementation (round(100.0 * 0.029 + 0.30, 2)), making the oracle unable to detect a wrong formula in the code under test",
"max_score": 10
},
{
"name": "Weak assertion flagged",
"description": "Report identifies at least one test with a meaningless or trivially-passing assertion: test_calculate_fee_small_amount (assertIsNotNone), test_refund_full_truthy (assertTrue on a dict which is always truthy), or test_fee_return_type (only checks type, not value)",
"max_score": 8
},
{
"name": "Financial severity correct",
"description": "The report rates weak oracle findings on fee calculation or charge amount handling at High severity (not Medium or Low), because these affect financial totals and ledger-impacting amounts",
"max_score": 10
},
{
"name": "Assertionless tests NOT rated Critical",
"description": "Assertionless tests and self-referential oracle findings are NOT placed under Critical Findings and are NOT given C- identifiers — they appear under High or Medium instead",
"max_score": 8
},
{
"name": "Spec drift addressed",
"description": "Report addresses whether the test suite validates the intended payment behavior (correct fee amounts, refund semantics, error handling) versus merely mirroring what the AI built, citing specific tests as evidence for or against",
"max_score": 10
},
{
"name": "No trust without evidence",
"description": "Report explicitly states that the generated tests require build/run/repeat verification before they can be treated as reliable regression protection, and does NOT assert that the tests are valid simply because they exist",
"max_score": 8
},
{
"name": "Finding contract complete",
"description": "At least two findings include all six required fields: severity, evidence checked, impact, affected test(s) or behavior, recommended fix, and verification step",
"max_score": 10
},
{
"name": "No code modification",
"description": "The workspace output contains only audit_report.md — no rewritten test files, no new test code, no edits to payment/processor.py or payment/validator.py are present",
"max_score": 8
},
{
"name": "Remediation not broad rewrite",
"description": "The remediation plan lists specific, risk-ordered actions (e.g. fix the hallucinated-API test, replace the copying oracle with an independent expected value) rather than recommending that all tests be rewritten as the first or primary action",
"max_score": 8
}
]
}