Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
88
93%
Does it follow best practices?
Impact
88%
1.07xAverage score across 24 eval scenarios
Passed
No known issues
This plugin was archived by the owner on May 19, 2026
Reason: Tile archived: Superceded by tessl/skill-optimizer - go to https://tessl.io/registry/tessl/skill-optimizer
94%
Run the full optimization cycle for a tile — review best practices, generate eval scenarios, run evals, diagnose gaps, fix, and re-run until scores improve. Use when someone says "optimize my skill", "improve my tile", "run evals", "benchmark my tile", or wants to measure and improve how well a tile helps agents solve tasks.
90%
Generate eval scenarios from a tile, run baseline evals, and present results. Use when setting up evaluation pipelines, running benchmarks, generating test scenarios for a tile, or measuring how well a skill helps agents solve tasks.
90%
Run task evals, analyze results, diagnose failures, apply targeted fixes, and re-run to verify improvements. Use when debugging evaluation scores, fixing failing or regressed criteria, improving tile content after an eval run, or iterating on agent performance test results.
85%
Run task evals across multiple Claude models, compare results side-by-side, and optimise. Use when you want to understand how a skill performs across different models, identify model-specific gaps versus universal tile issues, or validate a skill before publishing it to the registry.
100%
Review and improve your SKILL.md with actionable recommendations. Reads skill bundle (SKILL.md + related docs), validates syntax, explains rubric, shows before/after scores. Use when reviewing skill quality, improving a skill file, checking skill scoring, making your skill better, or learning the skill rubric. This is the standalone review skill — for the full optimization cycle (review + evals + improve), use the `optimize-skill-performance-and-instructions` skill instead.
Quality
Discovery
89%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a solid description that clearly communicates both what the skill does and when to use it, with an explicit 'Use when...' clause containing relevant trigger terms. The main weakness is the use of domain-specific jargon ('tile') without clarification, which slightly reduces specificity for those unfamiliar with the terminology. Overall, it performs well across all dimensions and would be distinguishable in a large skill library.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names several actions ('generate eval scenarios', 'run baseline evals', 'present results') but uses domain-specific jargon like 'tile' without explanation, and the actions are somewhat high-level rather than deeply concrete. | 2 / 3 |
Completeness | Clearly answers both 'what' (generate eval scenarios from a tile, run baseline evals, present results) and 'when' (explicit 'Use when...' clause covering evaluation pipelines, benchmarks, test scenarios, and measuring skill performance). | 3 / 3 |
Trigger Term Quality | Includes strong natural trigger terms: 'evaluation pipelines', 'running benchmarks', 'test scenarios', 'tile', 'measuring', 'eval scenarios'. These cover multiple ways a user might phrase requests related to evaluation workflows. | 3 / 3 |
Distinctiveness Conflict Risk | The combination of 'tile', 'eval scenarios', 'baseline evals', and 'skill helps agents solve tasks' creates a very specific niche that is unlikely to conflict with generic testing or benchmarking skills. | 3 / 3 |
Total | 11 / 12 Passed |
Implementation
85%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a well-structured orchestration skill that excels at workflow clarity and progressive disclosure. The scope selection mechanism and phase mapping are particularly strong. The main weakness is that actionability depends almost entirely on the referenced phase files — the SKILL.md itself provides no executable examples or complete commands, making it hard to evaluate whether the full pipeline is truly copy-paste ready without seeing those references.
Suggestions
Include at least one complete CLI command example inline (e.g., a full `tessl scenario generate` invocation with typical flags) so the skill is partially actionable even without loading reference files.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is lean and well-structured. It doesn't explain concepts Claude already knows, avoids unnecessary padding, and every section serves a clear purpose. Time expectations and scope mapping are genuinely useful additions, not filler. | 3 / 3 |
Actionability | The skill provides a clear decision framework (scope table, phase mapping) and references specific CLI commands like `tessl scenario generate` and `tessl eval run`, but all actual procedures are delegated to reference files. The SKILL.md itself contains no executable code or complete command examples — it's an orchestration overview that depends entirely on external files for concrete guidance. | 2 / 3 |
Workflow Clarity | The multi-step workflow is clearly sequenced across 6 numbered phases with explicit scope-to-phase mapping. Phase 3 includes quality-check validation (rubric anti-patterns), and the overall flow has clear entry/exit conditions and a 'when to stop' section. The scope selection table elegantly handles partial runs. | 3 / 3 |
Progressive Disclosure | Excellent progressive disclosure — the SKILL.md serves as a concise overview with one-level-deep references to 6 phase-specific files. Each reference is clearly signaled with a brief description of what the phase covers, and the instruction to skip loading unused reference files for partial runs is a thoughtful touch. | 3 / 3 |
Total | 11 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
Reviewed
Table of Contents