Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
86
91%
Does it follow best practices?
Impact
86%
1.22xAverage score across 29 eval scenarios
Advisory
Suggest reviewing before use
Download the generated scenarios using the run ID from Phase 2:
tessl scenario download --last -o <tile-dir>/evals/Use --strategy merge to add new scenarios alongside existing ones — safe to use even on a first download when evals/ is empty (merge is the default):
tessl scenario download --last -o <tile-dir>/evals/ --strategy mergeUse --strategy replace only if the user explicitly asked to replace existing scenarios.
ls <tile-dir>/evals/*/task.mdShow the user the downloaded scenario structure:
Downloaded scenarios:
evals/
checkout-flow/
task.md
criteria.json
scenario.json
webhook-setup/
task.md
criteria.json
scenario.jsonBefore asking the user, read each criteria.json and task.md yourself and flag these common problems:
Rubric anti-patterns to catch:
task.md contain specific values (version numbers, URLs, class names) that are also rubric criteria? If a criterion just checks whether the agent copied a value from the task prompt, it's a free point. Remove the value from the task or remove the criterion.no_unrelated_changes included as a criterion? This scores 1 on nearly every solution and doesn't discriminate. Remove it unless the scenario specifically tests scope discipline.Present your findings and offer review options:
"You can also:
- Review task.md — see what the agent will be asked to do
- Review criteria.json — see what the rubric checks for
- Edit criteria weights — adjust which criteria matter most
- Proceed to eval run — use the scenarios as-is"
If the user wants to review, read and display the relevant files. Apply any edits they request.
Before kicking off the eval run, check each scenario for missing environment preparation it clearly needs. This step infers fixtures and setup scripts silently when the tile content makes the source obvious, and only prompts the user when sourcing genuinely can't be determined.
Read references/phase3-fixtures-and-setup.md for:
fixtures / setup / existing setup.sh)setup.sh generation (with shebang + chmod +x)Before moving to Phase 4, show the pre-run summary so the user can catch wrong inferences:
Generated X fixtures and Y setup scripts across Z scenarios.
- <scenario>: <what was generated, or "no preparation needed">
...If the user skipped fixture generation on any scenario (declined to provide a source), list it explicitly so they know that scenario will run in an empty workdir.
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
scenario-26
scenario-27
scenario-28
scenario-29
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions