Audit and improve skill collections with a 9-dimension scoring framework (Knowledge Delta, Mindset, Anti-Patterns, Specification Compliance, Progressive Disclosure, Freedom Calibration, Pattern Recognition, Practical Usability, Eval Validation), duplication detection, remediation planning, baseline comparison, and CI quality gates; use when evaluating skill quality, generating remediation plans, detecting duplicates, validating artifact conventions, or enforcing publication thresholds.
93
89%
Does it follow best practices?
Impact
99%
1.26xAverage score across 5 eval scenarios
Passed
No known issues
All skills must use evals/scenario-NN.md — one Markdown file per scenario, numbered from 01.
# Scenario NN: Title
## User Prompt
"Exact trigger phrase the user would type."
## Expected Behavior
1. Step the agent takes
2. Next step
3. ...
## Success Criteria
- Measurable outcome 1
- Measurable outcome 2
## Failure Conditions
- What a bad agent response looks like
- Another failure modeAll four sections are required. Success criteria must be measurable (files created, commands run, specific output verified) — never vague ("agent does well").
Minimum 5 scenarios per skill. Target 7–9 for skills with broad trigger surfaces.
Cover:
evals/scenario-01.md, evals/scenario-02.md, … evals/scenario-09.md
Zero-padded two digits. No gaps in numbering.
List each scenario file in the files array:
{
"files": [
"evals/scenario-01.md",
"evals/scenario-02.md"
]
}| Format | Problem |
|---|---|
evals/*.yaml | Not linkable from tile.json files; diverges from markdown-first convention |
evals.md (single file) | Cannot reference individual scenarios; does not scale beyond 3–4 scenarios |
evals/instructions.json | Meta-artifact from a retired eval framework; remove if present |
evals/summary.json | Retired; remove if present |
assets
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
references
scripts