Eval-driven process for improving best-practice skills — analyse eval results, research what agents get wrong, rewrite for maximum uplift, and measure improvement with scenarios.
84
84%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Quality
Discovery
85%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a well-structured description that clearly articulates both capabilities and usage triggers. The main weakness is reliance on domain-specific terminology ('tessl', 'verifiers', 'eval-driven') which may not match natural user language. The explicit 'Use when...' clause and comprehensive action list are strong points.
Suggestions
Add more natural language trigger terms that users might say, such as 'make better', 'refine', 'enhance', or 'fix' alongside the existing technical terms
Consider adding common user phrasings like 'skill isn't working well' or 'improve skill performance' to capture troubleshooting scenarios
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: 'analysing eval results', 'identifying high-uplift practices', 'rewriting skills', 'updating verifiers', and 'measuring improvement with scenarios'. | 3 / 3 |
Completeness | Clearly answers both what ('Eval-driven process for improving tessl skills...Covers analysing eval results, identifying high-uplift practices, rewriting skills, updating verifiers, and measuring improvement') AND when ('Use when asked to improve, optimize, or iterate on a tessl tile or skill, or when creating a new best-practice skill from scratch'). | 3 / 3 |
Trigger Term Quality | Includes some natural keywords like 'improve', 'optimize', 'iterate', 'skill', 'tile', but uses domain-specific jargon ('tessl', 'verifiers', 'eval-driven') that users may not naturally say. Missing common variations like 'make better', 'refine', 'enhance'. | 2 / 3 |
Distinctiveness Conflict Risk | Clear niche focused specifically on 'tessl skills' and 'tessl tile' with distinct domain terminology. The combination of eval-driven improvement and tessl-specific context makes it unlikely to conflict with general skill improvement or other evaluation skills. | 3 / 3 |
Total | 11 / 12 Passed |
Implementation
77%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a high-quality, actionable skill for an eval-driven improvement workflow. Its greatest strengths are the clear phase-based structure with explicit exit criteria and the concrete CLI commands throughout. The main weakness is length—the document tries to be comprehensive which makes it token-heavy, and some content (like the extensive anti-patterns section and detailed leakage examples) could be moved to reference files.
Suggestions
Extract Phase 8 (Audit Scenario Quality) into a separate SCENARIO_QUALITY.md file and reference it—this section alone is ~100 lines and could stand alone as a reference
Move the anti-patterns section to a separate ANTI_PATTERNS.md file, keeping only a brief summary with a link in the main skill
Trim the 'good task formula' and 'proactive application' explanations—these are valuable but could be condensed to examples with one-line explanations rather than multi-paragraph prose
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is comprehensive but could be tightened. Some sections like the anti-patterns list and the detailed explanations of leakage patterns are valuable but verbose. The skill assumes Claude's competence in most areas but occasionally over-explains concepts like what uplift means. | 2 / 3 |
Actionability | Excellent actionability with specific CLI commands (tessl install, eval run, scenario generate), concrete code examples for verifier JSON structure, and copy-paste ready templates. Every phase has explicit commands and expected outputs. | 3 / 3 |
Workflow Clarity | Outstanding workflow structure with 9 clearly sequenced phases, each with explicit goals and exit criteria. Validation checkpoints are built into the process (Phase 6 checks for regressions, Phase 8 audits scenario quality). The feedback loop of eval → diagnose → fix → re-eval is explicit throughout. | 3 / 3 |
Progressive Disclosure | The skill is a monolithic document (~400 lines) that could benefit from splitting into separate files for each phase or topic area. While internally well-organized with clear headers, there are no references to external files for detailed content like the verifier JSON schema or scenario writing guidelines. | 2 / 3 |
Total | 10 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
Reviewed
Table of Contents