Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
94
Quality
89%
Does it follow best practices?
Impact
98%
1.30xAverage score across 7 eval scenarios
Passed
No known issues
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a well-crafted skill description that excels across all dimensions. It provides specific concrete actions, uses natural trigger terms that users would actually say, includes an explicit 'Use when...' clause with multiple scenarios, and carves out a distinct niche around evaluation debugging workflows.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: 'Analyze eval results', 'diagnose failures', 'apply targeted fixes', and 're-run to verify improvements'. These are clear, actionable capabilities. | 3 / 3 |
Completeness | Clearly answers both what (analyze, diagnose, fix, re-run) AND when with explicit 'Use when...' clause covering multiple trigger scenarios: debugging scores, fixing failures, improving content, iterating on test results. | 3 / 3 |
Trigger Term Quality | Includes natural keywords users would say: 'eval results', 'debugging evaluation scores', 'failing or regressed criteria', 'eval run', 'agent performance test results'. Good coverage of domain-specific terms users would naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | Clear niche focused specifically on evaluation/testing workflows with distinct triggers like 'eval results', 'regressed criteria', 'agent performance test'. Unlikely to conflict with general debugging or code skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
77%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, actionable skill with excellent workflow clarity and concrete executable guidance. The multi-phase structure with validation checkpoints is well-designed for the complex eval improvement cycle. Main weakness is the monolithic structure - some detailed reference content (bucket definitions, fix rules) could be extracted to separate files for better progressive disclosure.
Suggestions
Extract the bucket classification rules and 'Rules for good fixes' into a separate reference file (e.g., docs/eval-criteria.md) to reduce SKILL.md length
Tighten the bucket explanations - Claude can infer meanings from the classification logic without the detailed prose descriptions
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is reasonably efficient but includes some unnecessary explanations (e.g., explaining what each bucket means in detail, time expectations repeated). Some sections could be tightened, particularly the verbose bucket classification explanations that Claude could infer from examples. | 2 / 3 |
Actionability | Provides fully executable commands throughout (tessl eval view, tessl eval run, git commands), specific output formats to show users, and concrete decision trees. Every phase has copy-paste ready commands with proper flags. | 3 / 3 |
Workflow Clarity | Excellent multi-phase workflow with clear sequencing (Phase 0-5), explicit validation checkpoints (lint after fixes, poll for completion, before/after comparison), and feedback loops (re-run and verify cycle, 'take another pass' option). | 3 / 3 |
Progressive Disclosure | Content is well-structured with clear phases and sections, but this is a monolithic 300+ line file that could benefit from splitting detailed content (like bucket classification rules, fix rules) into separate reference files. The companion skill reference is good but internal organization could improve. | 2 / 3 |
Total | 10 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
Reviewed
Table of Contents