Interactive skill creation and eval-driven optimization. Triggers: create a skill, make a skill, new skill, scaffold skill, optimize skill, run evals, improve skill. Uses AskUserQuestion for interview; WebSearch for research; Bash for eval execution. Outputs: complete skill directory with SKILL.md, tile.json, evals, and repo integration.
93
94%
Does it follow best practices?
Impact
91%
1.26xAverage score across 3 eval scenarios
Passed
No known issues
Use these patterns when generating eval scenarios and analyzing eval results.
For each scenario, capture:
Every skill should have:
CLI-generated scenarios: When scenarios come from tessl scenario generate / tessl scenario download, treat them as a starting set. Re-read the checklist above after download; if anything is missing (e.g. noisy context or omission stress), add or extend scenarios by hand under evals/<slug>/ so the matrix still matches this section before you rely on eval results for optimization.
Scores in criteria.json must sum to 100. Target distribution:
| Pattern | Signal | Action |
|---|---|---|
| High baseline + tiny delta | Model already knows this | Reduce verbosity or specialize edge-cases |
| Low baseline + high delta | Skill adds strong value | Preserve and refine |
| Low baseline + low with-skill | Skill content is weak or unclear | Rewrite instructions |
| Negative delta | Skill introduces confusion | Patch immediately |
Use this table structure in benchmark-log.md:
## Run: <ISO-8601 timestamp>
**Method:** tessl-cli | llm-as-judge | **Model:** <model-name>
| Scenario | Baseline | With Skill | Delta |
|----------|----------|------------|-------|
| ... | ... | ... | ... |
**Changes applied:** <summary of edits made before this run>