Use when the user wants regression hunting after a change. Identify nearby flows, shared code paths, error states, and configuration edges that may have broken even if the main fix works. Good triggers include "check for regressions", "what else might this have broken", and "test the surrounding area".
96
94%
Does it follow best practices?
Impact
98%
2.72xAverage score across 8 eval scenarios
Passed
No known issues
{
"context": "The agent was asked to produce a regression scout report (report.md) for a TypeScript CLI tool that replaced a custom argument parser with yargs. All existing tests pass. The criteria evaluate whether the report explicitly documents risk areas even when nothing is found to be broken, and whether Risk Left Open is substantive.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Has Change Surface section",
"description": "The report.md file contains a '### Change Surface' section heading",
"max_score": 7
},
{
"name": "Has Regression Checks section",
"description": "The report.md file contains a '### Regression Checks' section heading",
"max_score": 7
},
{
"name": "Has Findings section",
"description": "The report.md file contains a '### Findings' section heading",
"max_score": 7
},
{
"name": "Has Risk Left Open section",
"description": "The report.md file contains a '### Risk Left Open' section heading",
"max_score": 7
},
{
"name": "Findings explicitly states none found or equivalent",
"description": "The Findings section explicitly states that no regressions were found (e.g. 'none found', 'no regressions identified', 'all checks passed') — it does NOT omit a verdict or leave the section blank",
"max_score": 12
},
{
"name": "Risk Left Open lists at least one concrete specific risk",
"description": "The Risk Left Open section names at least one concrete specific risk such as: unknown/invalid flags now causing errors instead of being silently ignored, boolean flag defaults being undefined instead of false in the deploy command, short flag aliases not working for non-aliased flags, or shell completion scripts being incompatible with yargs",
"max_score": 12
},
{
"name": "Risk Left Open does not just say none or no risk",
"description": "The Risk Left Open section does NOT simply say 'none', 'no risk', 'nothing to report', or equivalent — it must name at least one concrete area even though no regressions were confirmed",
"max_score": 10
},
{
"name": "Regression Checks lists at least 3 checks with results",
"description": "The Regression Checks section lists at least 3 separate checks, each with an outcome or result stated",
"max_score": 10
},
{
"name": "Change Surface identifies parser change or parser.ts",
"description": "The Change Surface section identifies parser.ts, the argument parser replacement, or the yargs migration as the change surface",
"max_score": 8
},
{
"name": "Report covers at least one documented parser behavior difference",
"description": "The report addresses at least one of the three documented parser behavior differences: unknown flag handling (silent ignore vs error), short flag alias support, or boolean flag defaults (false vs undefined)",
"max_score": 10
},
{
"name": "Report does not primarily re-verify refactor works",
"description": "The report does NOT dedicate most of its content to confirming that the yargs refactor produces the same outputs as the old parser — the primary focus is on identifying gaps and risks introduced by behavioral differences",
"max_score": 10
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
skills
regression-scout