Interactive skill creation and eval-driven optimization. Triggers: create a skill, make a skill, new skill, scaffold skill, optimize skill, run evals, improve skill. Uses AskUserQuestion for interview; WebSearch for research; Bash for eval execution. Outputs: complete skill directory with SKILL.md, tile.json, evals, and repo integration.
93
94%
Does it follow best practices?
Impact
91%
1.26xAverage score across 3 eval scenarios
Passed
No known issues
Eval scenarios live under skills/<skill-name>/evals/<scenario-slug>/ as task.md + criteria.json. Scenario shape and weighting follow benchmark-loop.
Constraint: Evaluation is read-only for skill sources. Do not edit SKILL.md, tile.json, or rules during eval execution. Changes happen only in the optimizer apply step after user approval, or in separate steps such as tessl skill review (never interleaved with tessl eval run).
Run these from the repository root so paths like ./skills/<skill-name> resolve consistently. If any step fails, capture stderr, report to the user, and fall back to manual scenarios (Phase 3 Path M) and/or Path B below as appropriate.
Skill review (mutates skill — run before eval, not during):
tessl skill review --optimize --yes ./skills/<skill-name>Tile lint (when tile.json exists):
cd skills/<skill-name> && tessl tile lintScenario generation — parse the generation id from stdout; do not guess.
tessl scenario generate ./skills/<skill-name>Download scenarios:
tessl scenario download <generation>Place evals under the skill. Often the CLI writes ./evals/ at cwd:
mv ./evals/ ./skills/<skill-name>/If files land elsewhere, move that directory into skills/<skill-name>/evals/. If evals/ already exists, use AskUserQuestion: replace, merge, or use a temp path — never silent overwrite.
Eval run — prefer --json when the agent must parse scores for Phase 5–6:
tessl eval run ./skills/<skill-name> --jsonOptional: pin the judge/agent, e.g. tessl eval run ./skills/<skill-name> --json --agent=claude:claude-opus-4-6, when your workflow requires a fixed model. Add only flags you need; --json stays the default for machine-readable output.
Detect CLI: which tessl (or equivalent).
From repo root:
tessl eval run ./skills/<skill-name> --jsonParse JSON into a normalized result (see Unified schema below). Surface per-scenario totals and per-criterion scores if present in the CLI output.
If the command fails, capture stderr, report to the user, and consider Path B if appropriate.
Use when Tessl is missing, eval run fails, or the user explicitly wants judge-only scoring.
For each scenario:
evals/<slug>/task.md as the user task prompt.evals/<slug>/criteria.json (weighted checklist; scores sum to 100).SKILL.md content prepended or clearly labeled as system/skill context. Record full output.name, max_score, description{"score": <number 0..max_score>, "reasoning": "<brief>"}Aggregate scenario score as the sum of criterion scores (should align with checklist weights totaling 100).
Repeat for baseline and with-skill to compute delta (with-skill total minus baseline total) per scenario.
Normalize both paths to:
| Field | Meaning |
|---|---|
date | ISO-8601 timestamp when the run completed |
method | tessl-cli or llm-as-judge |
model | Model id used for agent/judge where applicable |
scenarios | Array of { "slug", "baseline", "withSkill", "delta", "criteria": [{ "name", "score", "max", ... }] } |
Per-criterion breakdown may be partial for Path A depending on CLI JSON shape; preserve whatever the tool returns.
If Path A and Path B both succeed on the same scenarios:
skills/<skill-name>/benchmark-log.md (Phase 6); see main SKILL.md for table format and gate checks.