CtrlK
BlogDocsLog inGet started
Tessl Logo

coding-agent-helpers/regression-scout

Use when the user wants regression hunting after a change. Identify nearby flows, shared code paths, error states, and configuration edges that may have broken even if the main fix works. Good triggers include "check for regressions", "what else might this have broken", and "test the surrounding area".

96

2.72x
Quality

94%

Does it follow best practices?

Impact

98%

2.72x

Average score across 8 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-8/

{
  "context": "The agent was asked to produce a regression scout report (report.md) for a Python batch job that was changed from reading all records at once to chunked OFFSET/LIMIT reads. The criteria evaluate whether the report specifically checks persistence correctness, performance/timing edges, and data integrity rather than just confirming chunked reading works.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Has Change Surface section",
      "description": "The report.md file contains a '### Change Surface' section heading",
      "max_score": 7
    },
    {
      "name": "Has Regression Checks section",
      "description": "The report.md file contains a '### Regression Checks' section heading",
      "max_score": 7
    },
    {
      "name": "Has Findings section",
      "description": "The report.md file contains a '### Findings' section heading",
      "max_score": 7
    },
    {
      "name": "Has Risk Left Open section",
      "description": "The report.md file contains a '### Risk Left Open' section heading",
      "max_score": 7
    },
    {
      "name": "Change Surface identifies process_transactions.py or chunked reading",
      "description": "The Change Surface section identifies process_transactions.py, the chunked reading change, or the OFFSET/LIMIT approach as the change surface",
      "max_score": 8
    },
    {
      "name": "Regression Checks includes data correctness check",
      "description": "The Regression Checks section includes a check on data correctness: duplicate transaction_id handling via upsert across chunk boundaries, NULL merchant_id handling across chunks, or per-merchant summary aggregation correctness when the same merchant appears in multiple chunks",
      "max_score": 10
    },
    {
      "name": "Regression Checks includes persistence path check",
      "description": "The Regression Checks section includes a check on a persistence path: database upsert behavior for processed_transactions, the report file (daily_summary.json) output correctness when written from chunked data, or partial writes if the job is interrupted mid-run",
      "max_score": 10
    },
    {
      "name": "Regression Checks includes performance or timing edge check",
      "description": "The Regression Checks section includes a check on a performance or timing edge: OFFSET query performance degradation at large offsets (e.g. offset 499000), whether the job can complete within the 2-hour maintenance window, or the READ COMMITTED isolation level causing rows to be missed or double-counted across chunks if concurrent writes occur",
      "max_score": 10
    },
    {
      "name": "Regression Checks lists at least 3 checks with results",
      "description": "The Regression Checks section lists at least 3 separate checks, each with an outcome or result stated",
      "max_score": 8
    },
    {
      "name": "Risk Left Open has concrete specific risk",
      "description": "The Risk Left Open section contains a concrete specific risk such as: OFFSET degradation causing the job to exceed the 2-hour window, READ COMMITTED isolation allowing concurrent inserts to cause rows to be skipped between chunks, duplicate upsert behavior changing if the same transaction_id appears in two chunks, or the report file containing partial results if the job fails mid-run",
      "max_score": 8
    },
    {
      "name": "Findings includes explicit verdict",
      "description": "The Findings section includes an explicit verdict — either stating no regressions were found or naming specific regressions identified",
      "max_score": 8
    },
    {
      "name": "Report does not primarily re-verify chunked reading works",
      "description": "The report does NOT dedicate most of its content to confirming that chunked reading produces the same results as the full read — the primary focus is on edge cases, data integrity risks, and performance/timing concerns introduced by the chunking approach",
      "max_score": 10
    }
  ]
}

evals

tile.json