Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
Does it follow best practices?
Evaluation — 100%
↑ 1.02xAgent success when using this tile
Validation for skill structure
Discovery
43%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description excels at listing specific concrete actions with domain-specific terminology like bucket classifications. However, it critically lacks any 'Use when...' guidance, making it difficult for Claude to know when to select this skill from a large pool. The technical jargon may also miss natural user language patterns.
Suggestions
Add a 'Use when...' clause with explicit triggers, e.g., 'Use when the user asks to debug eval failures, analyze test results, or fix tile content issues'
Include natural language variations users might say: 'evaluation results', 'test failures', 'debugging evals', 'why is my eval failing'
Clarify what 'tile content' refers to, as this domain-specific term may not match user vocabulary
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: 'analyze eval results', 'classify criteria into buckets (working/gap/redundant/regression)', 'diagnose root causes', 'apply targeted fixes to tile content', and 're-run evals to verify improvements'. | 3 / 3 |
Completeness | Describes what the skill does but completely lacks a 'Use when...' clause or any explicit trigger guidance for when Claude should select this skill. Per rubric guidelines, missing explicit trigger guidance caps completeness at 2, and this has no 'when' component at all. | 1 / 3 |
Trigger Term Quality | Contains some relevant keywords like 'eval results', 'root causes', 'tile content', but uses somewhat technical jargon. Missing common variations users might say like 'evaluation', 'test results', 'debugging', or 'fix failing tests'. | 2 / 3 |
Distinctiveness Conflict Risk | The specific bucket classifications (working/gap/redundant/regression) and 'tile content' terminology provide some distinctiveness, but 'analyze eval results' and 'diagnose root causes' could overlap with general debugging or testing skills. | 2 / 3 |
Total | 8 / 12 Passed |
Implementation
77%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, highly actionable skill with excellent workflow clarity and concrete executable guidance. The multi-phase structure with explicit validation checkpoints is well-designed for a complex iterative process. Main weaknesses are moderate verbosity and the monolithic structure that could benefit from splitting detailed reference content into separate files.
Suggestions
Consider extracting the bucket classification rules and 'Rules for good fixes' into a separate REFERENCE.md file to reduce the main skill's length
Tighten Phase 1.2 bucket definitions - the criteria are clear but could be expressed more concisely (e.g., as a table rather than prose)
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is reasonably efficient but includes some redundancy (e.g., repeated command examples across phases, verbose explanations of bucket classifications). Some sections could be tightened without losing clarity. | 2 / 3 |
Actionability | Excellent actionability with specific, executable bash commands throughout (tessl eval view, tessl eval run, git commands). Clear examples of expected output formats and concrete decision criteria for each bucket classification. | 3 / 3 |
Workflow Clarity | Outstanding multi-phase workflow with clear sequencing (Phase 0-5), explicit validation checkpoints (lint after fixes, poll for completion, before/after comparison), and feedback loops (re-run and verify cycle, 'take another pass' option). | 3 / 3 |
Progressive Disclosure | Content is well-structured with clear phase headers, but the entire workflow is in one monolithic file. Complex sections like bucket classification rules and fix guidelines could be split into reference files for cleaner navigation. | 2 / 3 |
Total | 10 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
Install with Tessl CLI
npx tessl i experiments/eval-improveReviewed
Table of Contents