Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
88
93%
Does it follow best practices?
Impact
88%
1.07xAverage score across 24 eval scenarios
Passed
No known issues
This plugin was archived by the owner on May 19, 2026
Reason: Tile archived: Superceded by tessl/skill-optimizer - go to https://tessl.io/registry/tessl/skill-optimizer
94%
Run the full optimization cycle for a tile — review best practices, generate eval scenarios, run evals, diagnose gaps, fix, and re-run until scores improve. Use when someone says "optimize my skill", "improve my tile", "run evals", "benchmark my tile", or wants to measure and improve how well a tile helps agents solve tasks.
90%
Generate eval scenarios from a tile, run baseline evals, and present results. Use when setting up evaluation pipelines, running benchmarks, generating test scenarios for a tile, or measuring how well a skill helps agents solve tasks.
90%
Run task evals, analyze results, diagnose failures, apply targeted fixes, and re-run to verify improvements. Use when debugging evaluation scores, fixing failing or regressed criteria, improving tile content after an eval run, or iterating on agent performance test results.
85%
Run task evals across multiple Claude models, compare results side-by-side, and optimise. Use when you want to understand how a skill performs across different models, identify model-specific gaps versus universal tile issues, or validate a skill before publishing it to the registry.
100%
Review and improve your SKILL.md with actionable recommendations. Reads skill bundle (SKILL.md + related docs), validates syntax, explains rubric, shows before/after scores. Use when reviewing skill quality, improving a skill file, checking skill scoring, making your skill better, or learning the skill rubric. This is the standalone review skill — for the full optimization cycle (review + evals + improve), use the `optimize-skill-performance-and-instructions` skill instead.
Quality
Discovery
85%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong description that clearly articulates what the skill does and when to use it, with an explicit 'Use when' clause covering multiple trigger scenarios. The main weakness is in trigger term coverage—it uses some domain-specific jargon ('tile issues') and misses common synonyms users might naturally use like 'benchmark', 'test', or 'evaluate'. Overall it would perform well in skill selection among a large set of skills.
Suggestions
Replace the jargon 'tile issues' with clearer language, and add natural trigger terms users might say such as 'benchmark', 'test performance', 'evaluate skill', or 'model comparison'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: 'Run task evals across multiple Claude models', 'compare results side-by-side', and 'optimise'. These are clear, actionable capabilities. | 3 / 3 |
Completeness | Clearly answers both 'what' (run task evals, compare results, optimise) and 'when' with an explicit 'Use when' clause covering three specific trigger scenarios: understanding cross-model performance, identifying model-specific gaps, and validating before publishing. | 3 / 3 |
Trigger Term Quality | Includes some relevant terms like 'task evals', 'models', 'compare', 'skill', 'registry', and 'publishing', but misses common variations users might say such as 'benchmark', 'evaluation', 'testing', 'performance comparison', or 'model selection'. The term 'tile issues' is jargon that users wouldn't naturally use. | 2 / 3 |
Distinctiveness Conflict Risk | Occupies a clear niche around multi-model eval comparison and skill validation before registry publishing. The combination of cross-model evaluation, side-by-side comparison, and registry publishing creates a distinct identity unlikely to conflict with other skills. | 3 / 3 |
Total | 11 / 12 Passed |
Implementation
77%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, highly actionable skill with excellent workflow clarity and concrete commands throughout. Its main weakness is length — the content is thorough but could benefit from splitting diagnostic/interpretation guidance into a referenced file to reduce the token footprint of the main skill. The phased structure with validation checkpoints is exemplary for a complex multi-step process.
Suggestions
Consider extracting Phase 6 (Diagnose and interpret) into a separate DIAGNOSIS.md file referenced from the main skill, reducing the main file's token cost while preserving the detailed diagnostic framework.
Tighten Phase 6.1 baseline interpretation — Claude can infer what high/low baselines mean; reduce to a brief decision table rather than prose explanations.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is generally well-structured and avoids explaining basic concepts, but it's quite lengthy (~200 lines) with some verbose sections. The diagnostic framework in Phase 6 could be tightened, and some guidance (e.g., explaining what baselines mean) is somewhat obvious for Claude. However, most content is domain-specific and earns its place. | 2 / 3 |
Actionability | Every phase includes concrete, copy-paste-ready commands (find, tessl eval run, tessl eval view --json, tessl eval retry). Output formats are fully specified with table templates and emoji indicators. The skill leaves no ambiguity about what to execute at each step. | 3 / 3 |
Workflow Clarity | The 7-phase workflow is clearly sequenced with explicit validation checkpoints: verify tile exists (1.1), verify evals exist (1.2), verify login (1.3), confirm config before running (Phase 2), poll for completion with retry on failure (Phase 4), and a clear 'when to stop' section. Feedback loops are present for failed runs and for re-running after fixes. | 3 / 3 |
Progressive Disclosure | The content is well-organized with clear phases and sub-sections, but it's entirely monolithic — all ~200 lines live in a single file with no references to external documents. The diagnostic framework (Phase 6) and the detailed table formats (Phase 5) could be split into referenced files. It does mention other skills (optimize-skill-performance, setup-skill-performance) which is good cross-referencing. | 2 / 3 |
Total | 10 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
Reviewed
Table of Contents