Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
94
Does it follow best practices?
Validation for skill structure
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an excellent skill description that excels across all dimensions. It provides specific concrete actions, comprehensive trigger terms including natural user phrasings, explicit 'Use when...' guidance, and domain-specific terminology that makes it highly distinctive. The description effectively balances technical precision with user-friendly language.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: 'Analyzes Tessl tile eval results', 'classifies criteria into buckets (working/gap/redundant/regression)', 'diagnoses root causes', 'applies targeted fixes to tile content (SKILL.md, rules, docs)', and 're-runs evals to verify improvements'. | 3 / 3 |
Completeness | Clearly answers both what (analyzes results, classifies, diagnoses, fixes, re-runs) AND when with explicit 'Use when...' clause covering multiple trigger scenarios plus example user phrases. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural terms users would say: 'debug eval failures', 'improve low evaluation scores', 'fix failing tile tests', 'why is my eval failing', 'how do I fix my tile', 'my test results are bad'. Includes both formal terms and casual phrasings. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive with domain-specific terminology ('Tessl tile eval', 'SKILL.md', 'working/gap/redundant/regression buckets') that creates a clear niche unlikely to conflict with general debugging or testing skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
77%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, actionable skill with excellent workflow clarity and concrete executable guidance. The multi-phase structure with explicit checkpoints and decision trees makes it easy to follow. Main weaknesses are moderate verbosity (some explanations could be tighter) and all content being inline rather than using progressive disclosure for the longer reference sections.
Suggestions
Tighten the bucket classification section - the table already defines conditions clearly, so the repeated explanations in Phase 1.3 example output are partially redundant
Consider moving Phase 5 (Scenario Quality Review) to a separate reference file since it's marked 'on request' and adds ~30 lines to the main skill
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is reasonably efficient but includes some redundant explanations (e.g., explaining what each bucket means multiple times, verbose example outputs). The command reference table is lean, but phases could be tightened. | 2 / 3 |
Actionability | Provides fully executable bash commands throughout, specific classification criteria with exact thresholds (>=80%, <80%), concrete example outputs, and copy-paste ready command sequences for each phase. | 3 / 3 |
Workflow Clarity | Excellent multi-phase workflow with clear sequencing (Phase 0-5), explicit validation checkpoints (lint after fixes, poll for completion), decision trees for different starting states, and feedback loops (re-run and verify cycle). | 3 / 3 |
Progressive Disclosure | Content is well-structured with clear phases and tables, but everything is inline in one file. For a skill this long (~200 lines), some content (like the detailed bucket classification rules or scenario quality review) could be split into reference files. | 2 / 3 |
Total | 10 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
Install with Tessl CLI
npx tessl i experiments/eval-improve@0.5.0Reviewed
Table of Contents