Interactive skill creation and eval-driven optimization. Triggers: create a skill, make a skill, new skill, scaffold skill, optimize skill, run evals, improve skill. Uses AskUserQuestion for interview; WebSearch for research; Bash for eval execution. Outputs: complete skill directory with SKILL.md, tile.json, evals, and repo integration.
93
94%
Does it follow best practices?
Impact
91%
1.26xAverage score across 3 eval scenarios
Passed
No known issues
The AI tooling team at a software consultancy has been running a skill called git-commit-helper for several weeks. This skill helps engineers write clear, consistent git commit messages. Recently the team ran a round of evaluations to measure how much the skill actually improves agent behaviour compared to a baseline (no skill).
The results are mixed. Some scenarios show strong improvement, but one scenario actually shows the skill making things worse compared to the baseline, and several criteria scored zero even with the skill present. The team lead wants to understand what's going wrong and get a concrete action plan before the next sprint.
Your job is to analyse the eval results, record the outcome in the project's benchmark log, and produce a prioritised list of specific proposed edits to address the failures. The analysis should be ready to hand to the engineer who will actually make the edits.
benchmark-log.md — update this file with the new eval run results (the existing file is provided; preserve its history)optimization-proposals.md — a prioritised list of specific proposed edits to SKILL.md, ready for an engineer to actionDo NOT modify SKILL.md or tile.json directly — capture all proposed changes in optimization-proposals.md.
The following files are provided as inputs. Extract them before beginning.
=============== FILE: inputs/benchmark-log.md ===============
| Scenario | Baseline | With Skill | Delta |
|---|---|---|---|
| scenario-0: basic commit | 52 | 81 | +29 |
| scenario-1: breaking change | 48 | 74 | +26 |
| scenario-2: merge commit | 61 | 79 | +18 |
Overall: Baseline avg 53.7 → With-skill avg 78.0 | Δ +24.3
| Scenario | Baseline | With Skill | Delta |
|---|---|---|---|
| scenario-0: basic commit | 55 | 83 | +28 |
| scenario-1: breaking change | 50 | 77 | +27 |
| scenario-2: merge commit | 58 | 76 | +18 |
Overall: Baseline avg 54.3 → With-skill avg 78.7 | Δ +24.4
=============== FILE: inputs/eval-results.json =============== { "date": "2026-04-07", "method": "llm-as-judge", "model": "claude-opus-4-6", "scenarios": [ { "scenario": "scenario-0: basic commit", "baseline": 56, "withSkill": 82, "delta": 26, "criteria": [ { "name": "Conventional commit prefix", "baseline": 60, "withSkill": 90, "delta": 30 }, { "name": "Subject line length", "baseline": 70, "withSkill": 95, "delta": 25 }, { "name": "Imperative mood", "baseline": 45, "withSkill": 80, "delta": 35 }, { "name": "No period at end", "baseline": 55, "withSkill": 75, "delta": 20 }, { "name": "Security patterns", "baseline": 40, "withSkill": 40, "delta": 0 } ] }, { "scenario": "scenario-1: breaking change", "baseline": 48, "withSkill": 71, "delta": 23, "criteria": [ { "name": "BREAKING CHANGE footer", "baseline": 30, "withSkill": 75, "delta": 45 }, { "name": "Scope in prefix", "baseline": 55, "withSkill": 80, "delta": 25 }, { "name": "Body explains why", "baseline": 40, "withSkill": 65, "delta": 25 }, { "name": "Security patterns", "baseline": 38, "withSkill": 38, "delta": 0 }, { "name": "Blank line after subject", "baseline": 75, "withSkill": 90, "delta": 15 } ] }, { "scenario": "scenario-2: merge commit", "baseline": 65, "withSkill": 60, "delta": -5, "criteria": [ { "name": "Merge commit format", "baseline": 72, "withSkill": 68, "delta": -4 }, { "name": "Changelog check", "baseline": 60, "withSkill": 48, "delta": -12 }, { "name": "No squash marker", "baseline": 80, "withSkill": 77, "delta": -3 }, { "name": "Co-author attribution", "baseline": 50, "withSkill": 62, "delta": 12 }, { "name": "Blank line after subject", "baseline": 70, "withSkill": 85, "delta": 15 } ] } ] }
Helps engineers write clear, consistent git commit messages following Conventional Commits.
Read the diff and identify the type of change: feat, fix, docs, refactor, chore, etc.
Write a subject under 72 chars using imperative mood with the correct prefix.
For non-trivial changes, explain why the change was made, not what files changed.
For breaking changes, add a BREAKING CHANGE footer. For co-authored commits, add Co-authored-by lines.
Input: staged changes adding a new API endpoint for user authentication
Output:
feat(auth): add user authentication endpoint
Adds POST /api/auth/login that validates credentials against the user
store and returns a JWT. Replaces the legacy session-based flow.