CtrlK
BlogDocsLog inGet started
Tessl Logo

sharaf/codebase-test-suite-audit

Use when the user wants a test suite audit, test quality or reliability review, regression-protection review, unit/integration/e2e test review, coverage or CI signal assessment, flaky CI investigation, fixture-realism review, spec-drift review, or generated-test validation for AI/LLM/agent-written code. Produces severity-ranked findings for weak assertions, oracle gaps, brittle fixtures, over-mocking, CI trust, and generated-code test risks.

100

1.31x
Quality

100%

Does it follow best practices?

Impact

100%

1.31x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-3/

{
  "context": "Tests whether the agent applies the test-suite audit skill to flakiness, determinism, fixture realism, CI signal quality, and risk-ordered remediation without modifying code.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Repo brief and evidence inventory",
      "description": "Report includes a Repo Brief and an evidence inventory with Reviewed/Partial/Not Provided/Not Applicable statuses before findings, including unavailable CI flake history and production incident evidence",
      "max_score": 10
    },
    {
      "name": "Sleep based flakiness flagged",
      "description": "Report flags tests/test_scheduler.py::test_dispatch_waits_for_worker for using time.sleep or timing assumptions as a flaky-test risk",
      "max_score": 10
    },
    {
      "name": "Shared mutable fixture risk flagged",
      "description": "Report identifies that tests relying on GLOBAL_QUEUE or other shared mutable state can be order-dependent or non-isolated, citing the relevant tests in tests/test_scheduler.py",
      "max_score": 10
    },
    {
      "name": "Random fixture determinism flagged",
      "description": "Report flags unseeded random fixture data in test_batch_reference_is_unique or equivalent tests as reducing reproducibility and failure diagnosis quality",
      "max_score": 8
    },
    {
      "name": "Weak carrier oracle flagged",
      "description": "Report identifies that the carrier dispatch tests mostly verify mocked return shape or truthiness rather than independently checking the carrier payload semantics",
      "max_score": 10
    },
    {
      "name": "CI signal gaps listed",
      "description": "Report explicitly lists missing CI history, rerun/quarantine data, and flaky-test ownership evidence as evidence gaps rather than claiming they were reviewed",
      "max_score": 8
    },
    {
      "name": "Finding contract complete",
      "description": "At least three findings include all six required fields: severity, evidence checked, impact, affected test(s) or behavior, recommended fix, and verification step",
      "max_score": 12
    },
    {
      "name": "Severity proportional",
      "description": "Flakiness and weak carrier oracle findings are High or Medium based on release-risk impact, but are not marked Critical without separate safety/security/data-loss gate evidence",
      "max_score": 8
    },
    {
      "name": "Coverage not treated as proof",
      "description": "Report does not treat a passing local test suite, green runs, or coverage presence as proof that the tests detect meaningful faults",
      "max_score": 6
    },
    {
      "name": "No code modification",
      "description": "The workspace output contains only audit_report.md; no source files, test files, fixtures, or CI configuration are edited or created",
      "max_score": 8
    },
    {
      "name": "Remediation sequenced by risk",
      "description": "Prioritized remediation starts with stabilizing flaky/order-dependent tests and strengthening dispatch oracles before lower-risk cleanup; it does not recommend a broad rewrite as the first action",
      "max_score": 10
    }
  ]
}

evals

README.md

tile.json