Name: experiments/eval-improve
Rating: 94 (1 reviews)
Author: experiments

experiments/eval-improve

Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated

1.02x

Quality

90%

Does it follow best practices?

Impact

100%

1.02x

Average score across 5 eval scenarios

Securityby

Passed

No known issues

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, actionable skill with excellent workflow clarity and concrete executable guidance. The multi-phase structure with explicit checkpoints and decision trees makes it easy to follow. Main weaknesses are moderate verbosity (some explanations could be tighter) and all content being inline rather than using progressive disclosure for the longer reference sections.

Suggestions

Tighten the bucket classification section - the table already defines conditions clearly, so the repeated explanations in Phase 1.3 example output are partially redundant

Consider moving Phase 5 (Scenario Quality Review) to a separate reference file since it's marked 'on request' and adds ~30 lines to the main skill

Dimension	Reasoning	Score
Conciseness	The skill is reasonably efficient but includes some redundant explanations (e.g., explaining what each bucket means multiple times, verbose example outputs). The command reference table is lean, but phases could be tightened.	2 / 3
Actionability	Provides fully executable bash commands throughout, specific classification criteria with exact thresholds (>=80%, <80%), concrete example outputs, and copy-paste ready command sequences for each phase.	3 / 3
Workflow Clarity	Excellent multi-phase workflow with clear sequencing (Phase 0-5), explicit validation checkpoints (lint after fixes, poll for completion), decision trees for different starting states, and feedback loops (re-run and verify cycle).	3 / 3
Progressive Disclosure	Content is well-structured with clear phases and tables, but everything is inline in one file. For a skill this long (~200 lines), some content (like the detailed bucket classification rules or scenario quality review) could be split into reference files.	2 / 3
	Total	10 / 12 Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that excels across all dimensions. It provides specific concrete actions, comprehensive trigger terms including natural user phrasings, explicit 'Use when...' guidance, and domain-specific terminology that makes it highly distinctive. The description effectively balances technical precision with user-friendly language.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: 'Analyzes Tessl tile eval results', 'classifies criteria into buckets (working/gap/redundant/regression)', 'diagnoses root causes', 'applies targeted fixes to tile content (SKILL.md, rules, docs)', and 're-runs evals to verify improvements'.	3 / 3
Completeness	Clearly answers both what (analyzes results, classifies, diagnoses, fixes, re-runs) AND when with explicit 'Use when...' clause covering multiple trigger scenarios plus example user phrases.	3 / 3
Trigger Term Quality	Excellent coverage of natural terms users would say: 'debug eval failures', 'improve low evaluation scores', 'fix failing tile tests', 'why is my eval failing', 'how do I fix my tile', 'my test results are bad'. Includes both formal terms and casual phrasings.	3 / 3
Distinctiveness Conflict Risk	Highly distinctive with domain-specific terminology ('Tessl tile eval', 'SKILL.md', 'working/gap/redundant/regression buckets') that creates a clear niche unlikely to conflict with general debugging or testing skills.	3 / 3
	Total	12 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Reviewed

4 months ago

Table of Contents

Discovery Implementation Validation