Use when the user wants a test suite audit, test quality or reliability review, regression-protection review, unit/integration/e2e test review, coverage or CI signal assessment, flaky CI investigation, fixture-realism review, spec-drift review, or generated-test validation for AI/LLM/agent-written code. Produces severity-ranked findings for weak assertions, oracle gaps, brittle fixtures, over-mocking, CI trust, and generated-code test risks.
100
100%
Does it follow best practices?
Impact
100%
1.31xAverage score across 3 eval scenarios
Passed
No known issues
Use repo evidence first. Run:
rg --files | rg '(^|/)(README|package.json|pyproject.toml|go.mod|Cargo.toml|pom.xml|build.gradle|vitest.config|jest.config|pytest.ini|tox.ini|.github/workflows)'
rg --files | rg '(^|/)(test|tests|spec|specs|__tests__|fixtures|mocks)(/|$)'
rg -n "coverage|mutation|stryker|mutmut|flake|retry|quarantine|snapshot|mock|sleep|TODO|generated|AI|LLM" .Before findings, write a repo brief covering:
Reviewed, Partial,
Not Provided, or Not Applicable.path:line evidence.Every finding at every severity, including Low, must use this block:
- Severity:
- Evidence checked: include `path:line` for local file evidence when available
- Impact:
- Affected tests or behavior:
- Recommended fix:
- Verification step:Classification shortcuts:
| Signal | Required handling |
|---|---|
| Assertionless or copied oracle only | Not Critical; usually Medium |
| Weak fee, refund, charge, or total oracle on ledger path | High, including copied formulas |
| Type-only, non-null, truthy, or mocked-shape check on ledger path | High |
| Generated tests without build/run/repeat evidence | Do not trust |
| AI-built behavior not checked against intended semantics | Explicit spec-drift finding |
| Fixed sleep or timing assumption | Standalone flakiness finding |
| Shared mutable fixture or global state | Standalone order-dependence finding |
| Unseeded random fixture data | Standalone reproducibility finding |
Coverage percentage, green CI, TODOs, and aspirational docs are not proof of fault detection. Do not claim mutation, coverage, CI, flake, requirement, or production-defect evidence was reviewed when unavailable.
All linked files are bundled under references/; load only the named file needed for the current step.
Use report-template.md. If a severity section
has no findings, keep the heading and write None found from available evidence. Sequence remediation by risk, dependency, and verification value; do
not recommend broad rewrites before the highest-risk weak signal is isolated.
For AI-assisted codebases, make LLM and Generated-Test Notes compare intended
behavior against what generated tests actually validate.
Optional deep dives: evidence-inventory.md for evidence statuses and sampling prompts, audit-domains.md for domain checks, guardrails-and-success.md for severity guardrails.