Optimizes AI skills for activation, clarity, and cross-model reliability. Use when creating or editing skill packs, diagnosing weak skill uptake, reducing regressions, tuning instruction salience, improving examples, shrinking context cost, or setting benchmark and release gates for skills. Trigger terms: skill optimization, activation gap, benchmark skill, with/without skill delta, regression, context budget, prompt salience.
87
87%
Does it follow best practices?
Impact
87%
1.14xAverage score across 5 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent produces a benchmark report in the prescribed format, including the correct table structure, universal failure and regression callouts, and correct interpretation guidance per the benchmark-loop rules.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Table column order",
"description": "The summary table in benchmark-report.md has columns in this order: Model, Without, With, Delta (or equivalent per-scenario breakdown with those four values)",
"max_score": 10
},
{
"name": "Delta computed",
"description": "The Delta column values are computed as (with_skill - without_skill), and at least one is shown as a negative number for regressions",
"max_score": 10
},
{
"name": "Universal failures section",
"description": "benchmark-report.md contains a clearly labeled 'Universal failures' section or line (e.g. 'Universal failures (0% with skill):')",
"max_score": 10
},
{
"name": "Regressions section",
"description": "benchmark-report.md contains a clearly labeled 'Regressions' section or line (e.g. 'Regressions (negative deltas):')",
"max_score": 10
},
{
"name": "Identifies ModelC 0% criterion",
"description": "The report flags ModelC's 'noisy-context-large-diff' score of 0 with skill as a universal or zero-score failure",
"max_score": 10
},
{
"name": "Identifies ModelB regressions",
"description": "The report flags ModelB's negative deltas (basic-feature-commit: -1, bug-fix-commit: -1, noisy-context-large-diff: -2, multi-file-refactor: -2) as regressions",
"max_score": 10
},
{
"name": "High baseline + tiny delta interpretation",
"description": "Interpretation recommends reducing verbosity or specializing edge-cases for ModelB's high-baseline scenarios (does NOT recommend preserving unchanged)",
"max_score": 10
},
{
"name": "Low baseline + high delta interpretation",
"description": "Interpretation recommends preserving and refining for ModelA or similar low-baseline + high-delta patterns found in the data",
"max_score": 10
},
{
"name": "Low baseline + low skill-on interpretation",
"description": "Interpretation labels ModelC's pattern as 'skill content is weak or unclear' or equivalent language (NOT just 'model is weak')",
"max_score": 10
},
{
"name": "Negative delta patch-immediately",
"description": "Interpretation states that regressions (negative deltas) should be patched immediately or flagged as urgent (not just 'investigate later')",
"max_score": 10
}
]
}