Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
88
93%
Does it follow best practices?
Impact
88%
1.07xAverage score across 24 eval scenarios
Passed
No known issues
This plugin was archived by the owner on May 19, 2026
Reason: Tile archived: Superceded by tessl/skill-optimizer - go to https://tessl.io/registry/tessl/skill-optimizer
94%
Run the full optimization cycle for a tile — review best practices, generate eval scenarios, run evals, diagnose gaps, fix, and re-run until scores improve. Use when someone says "optimize my skill", "improve my tile", "run evals", "benchmark my tile", or wants to measure and improve how well a tile helps agents solve tasks.
90%
Generate eval scenarios from a tile, run baseline evals, and present results. Use when setting up evaluation pipelines, running benchmarks, generating test scenarios for a tile, or measuring how well a skill helps agents solve tasks.
90%
Run task evals, analyze results, diagnose failures, apply targeted fixes, and re-run to verify improvements. Use when debugging evaluation scores, fixing failing or regressed criteria, improving tile content after an eval run, or iterating on agent performance test results.
85%
Run task evals across multiple Claude models, compare results side-by-side, and optimise. Use when you want to understand how a skill performs across different models, identify model-specific gaps versus universal tile issues, or validate a skill before publishing it to the registry.
100%
Review and improve your SKILL.md with actionable recommendations. Reads skill bundle (SKILL.md + related docs), validates syntax, explains rubric, shows before/after scores. Use when reviewing skill quality, improving a skill file, checking skill scoring, making your skill better, or learning the skill rubric. This is the standalone review skill — for the full optimization cycle (review + evals + improve), use the `optimize-skill-performance-and-instructions` skill instead.
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a well-crafted skill description that excels across all dimensions. It provides specific concrete actions, includes natural trigger terms users would actually say, explicitly addresses both what and when, and carves out a distinct niche around evaluation debugging and iteration workflows.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: 'Run task evals, analyze results, diagnose failures, apply targeted fixes, and re-run to verify improvements.' These are clear, actionable capabilities. | 3 / 3 |
Completeness | Clearly answers both what (run evals, analyze, diagnose, fix, re-run) AND when with explicit 'Use when...' clause covering debugging scores, fixing failures, improving content, and iterating on test results. | 3 / 3 |
Trigger Term Quality | Includes natural keywords users would say: 'evaluation scores', 'failing', 'regressed criteria', 'eval run', 'agent performance test results', 'debugging'. Good coverage of domain-specific terms. | 3 / 3 |
Distinctiveness Conflict Risk | Clear niche focused on task evaluation debugging and iteration. Terms like 'eval run', 'regressed criteria', 'agent performance test results' are distinct and unlikely to conflict with general debugging or testing skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
77%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a well-structured, highly actionable skill with excellent workflow clarity and validation checkpoints throughout the improvement cycle. The main weakness is length—at ~300 lines, some content could be more concise or split into reference files. The explicit bucket classification system and before/after reporting patterns are particularly strong.
Suggestions
Tighten the bucket definitions in Phase 1.2—the explanations after each bullet are somewhat redundant with the classification logic itself
Consider moving Phase 5 (Scenario Quality Review) to a separate reference file since it's marked as 'Bonus' and adds significant length
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is comprehensive but includes some redundant explanations (e.g., explaining what each bucket means multiple times, verbose example outputs). Some sections could be tightened without losing clarity, though it avoids explaining concepts Claude already knows. | 2 / 3 |
Actionability | Provides fully executable bash commands throughout, specific classification criteria with exact thresholds (>=80%), concrete example outputs, and copy-paste ready code blocks. The guidance is precise and immediately usable. | 3 / 3 |
Workflow Clarity | Excellent multi-phase workflow with clear sequencing (Phases 0-5), explicit validation checkpoints (lint after each fix, poll for completion, before/after comparison), and feedback loops (re-run and verify cycle, 'Want me to take another pass?'). | 3 / 3 |
Progressive Disclosure | Content is well-organized with clear phase headers, but the entire workflow is in one monolithic file. Advanced content like scenario quality review could be split to a separate reference file. The skill does reference a companion skill appropriately. | 2 / 3 |
Total | 10 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
Reviewed
Table of Contents