CtrlK
BlogDocsLog inGet started
Tessl Logo

sharaf/codebase-test-suite-audit

Use when the user wants a test suite audit, test quality or reliability review, regression-protection review, unit/integration/e2e test review, coverage or CI signal assessment, flaky CI investigation, fixture-realism review, spec-drift review, or generated-test validation for AI/LLM/agent-written code. Produces severity-ranked findings for weak assertions, oracle gaps, brittle fixtures, over-mocking, CI trust, and generated-code test risks.

100

1.31x
Quality

100%

Does it follow best practices?

Impact

100%

1.31x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-1/

{
  "context": "Tests whether the agent follows the skill's audit workflow: producing a repo brief first, building an evidence inventory table, using the required report structure, writing findings that include all required fields, applying the severity guide correctly, and avoiding prohibited behaviours such as modifying code or over-claiming reviewed evidence.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Repo brief present",
      "description": "The report contains a repo brief section before any findings, covering at minimum: product/domain description, language and test runner, test directories, and a note on unavailable evidence (e.g. CI history, mutation reports)",
      "max_score": 6
    },
    {
      "name": "Evidence inventory table",
      "description": "The report contains a table (or structured list) that assigns a status label — Reviewed, Partial, Not Provided, or Not Applicable — to each evidence area before findings are listed",
      "max_score": 8
    },
    {
      "name": "Required report sections",
      "description": "The report contains all required sections: Executive Summary, Evidence Reviewed, Test System Map, Critical/High/Medium/Low Findings, Domain-by-Domain Assessment, LLM and Generated-Test Notes, CI Signal and Flakiness Notes, Coverage/Mutation/Oracle Notes, Prioritized Remediation Plan, Open Evidence Gaps",
      "max_score": 10
    },
    {
      "name": "Finding contract fields",
      "description": "Each finding in the report includes all of: severity, evidence checked, impact, affected test(s) or behavior, recommended fix, and verification step",
      "max_score": 10
    },
    {
      "name": "Correct severity classification",
      "description": "Findings are classified using the four-level severity guide (Critical/High/Medium/Low) with appropriate placement — assertionless tests or implementation-copying oracles are not rated Critical; weak oracle issues on a financial total are rated at least High",
      "max_score": 8
    },
    {
      "name": "Concrete file references",
      "description": "Findings cite specific file names and line numbers (e.g. tests/test_parser.py:line) rather than only making broad claims about 'the test suite'",
      "max_score": 8
    },
    {
      "name": "Identifies assertionless tests",
      "description": "The report flags at least one test that runs without any meaningful assertion (such as test_empty_invoice or test_full_parse_no_assertion) as a quality issue",
      "max_score": 8
    },
    {
      "name": "Identifies self-referential oracle",
      "description": "The report identifies at least one test whose expected value is derived from the implementation under test rather than an independent source (e.g. test_invoice_number or test_total_is_numeric)",
      "max_score": 8
    },
    {
      "name": "Does not modify code",
      "description": "The output file(s) contain only analysis and recommendations — no rewritten test files, no new test code written, no source code edits are present in the workspace",
      "max_score": 8
    },
    {
      "name": "Coverage not treated as proof",
      "description": "The report explicitly states that coverage percentage or a passing suite is not treated as evidence of fault detection quality",
      "max_score": 6
    },
    {
      "name": "TODOs not credited",
      "description": "The report does not count TODO comments in the test file (e.g. 'TODO: add tests for malformed invoices') as implemented or planned test evidence",
      "max_score": 6
    },
    {
      "name": "Open evidence gaps listed",
      "description": "The Open Evidence Gaps section names at least two categories of missing evidence (e.g. mutation reports, CI history, production defect records) without claiming they were reviewed",
      "max_score": 8
    },
    {
      "name": "Remediation sequenced by risk",
      "description": "The Prioritized Remediation Plan lists actions ordered by risk or impact rather than alphabetically or by file order, and does not recommend a full test rewrite as the first or only action",
      "max_score": 6
    }
  ]
}

evals

scenario-1

criteria.json

task.md

README.md

tile.json