Use when the user wants a test suite audit, test quality or reliability review, regression-protection review, unit/integration/e2e test review, coverage or CI signal assessment, flaky CI investigation, fixture-realism review, spec-drift review, or generated-test validation for AI/LLM/agent-written code. Produces severity-ranked findings for weak assertions, oracle gaps, brittle fixtures, over-mocking, CI trust, and generated-code test risks.
100
100%
Does it follow best practices?
Impact
100%
1.31xAverage score across 3 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent applies the test-suite audit skill to flakiness, determinism, fixture realism, CI signal quality, and risk-ordered remediation without modifying code.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Repo brief and evidence inventory",
"description": "Report includes a Repo Brief and an evidence inventory with Reviewed/Partial/Not Provided/Not Applicable statuses before findings, including unavailable CI flake history and production incident evidence",
"max_score": 10
},
{
"name": "Sleep based flakiness flagged",
"description": "Report flags tests/test_scheduler.py::test_dispatch_waits_for_worker for using time.sleep or timing assumptions as a flaky-test risk",
"max_score": 10
},
{
"name": "Shared mutable fixture risk flagged",
"description": "Report identifies that tests relying on GLOBAL_QUEUE or other shared mutable state can be order-dependent or non-isolated, citing the relevant tests in tests/test_scheduler.py",
"max_score": 10
},
{
"name": "Random fixture determinism flagged",
"description": "Report flags unseeded random fixture data in test_batch_reference_is_unique or equivalent tests as reducing reproducibility and failure diagnosis quality",
"max_score": 8
},
{
"name": "Weak carrier oracle flagged",
"description": "Report identifies that the carrier dispatch tests mostly verify mocked return shape or truthiness rather than independently checking the carrier payload semantics",
"max_score": 10
},
{
"name": "CI signal gaps listed",
"description": "Report explicitly lists missing CI history, rerun/quarantine data, and flaky-test ownership evidence as evidence gaps rather than claiming they were reviewed",
"max_score": 8
},
{
"name": "Finding contract complete",
"description": "At least three findings include all six required fields: severity, evidence checked, impact, affected test(s) or behavior, recommended fix, and verification step",
"max_score": 12
},
{
"name": "Severity proportional",
"description": "Flakiness and weak carrier oracle findings are High or Medium based on release-risk impact, but are not marked Critical without separate safety/security/data-loss gate evidence",
"max_score": 8
},
{
"name": "Coverage not treated as proof",
"description": "Report does not treat a passing local test suite, green runs, or coverage presence as proof that the tests detect meaningful faults",
"max_score": 6
},
{
"name": "No code modification",
"description": "The workspace output contains only audit_report.md; no source files, test files, fixtures, or CI configuration are edited or created",
"max_score": 8
},
{
"name": "Remediation sequenced by risk",
"description": "Prioritized remediation starts with stabilizing flaky/order-dependent tests and strengthening dispatch oracles before lower-risk cleanup; it does not recommend a broad rewrite as the first action",
"max_score": 10
}
]
}