Name: experiments/eval-improve
Author: experiments

experiments/eval-improve

Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated

Review — 71%

Does it follow best practices?

Evaluation — 100%

↑ 1.02x

Agent success when using this tile

Validation — 11 / 11 Passed

Validation for skill structure

Discovery

43%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description excels at listing specific concrete actions with domain-specific terminology like bucket classifications. However, it critically lacks any 'Use when...' guidance, making it difficult for Claude to know when to select this skill from a large pool. The technical jargon may also miss natural user language patterns.

Suggestions

Add a 'Use when...' clause with explicit triggers, e.g., 'Use when the user asks to debug eval failures, analyze test results, or fix tile content issues'

Include natural language variations users might say: 'evaluation results', 'test failures', 'debugging evals', 'why is my eval failing'

Clarify what 'tile content' refers to, as this domain-specific term may not match user vocabulary

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: 'analyze eval results', 'classify criteria into buckets (working/gap/redundant/regression)', 'diagnose root causes', 'apply targeted fixes to tile content', and 're-run evals to verify improvements'.	3 / 3
Completeness	Describes what the skill does but completely lacks a 'Use when...' clause or any explicit trigger guidance for when Claude should select this skill. Per rubric guidelines, missing explicit trigger guidance caps completeness at 2, and this has no 'when' component at all.	1 / 3
Trigger Term Quality	Contains some relevant keywords like 'eval results', 'root causes', 'tile content', but uses somewhat technical jargon. Missing common variations users might say like 'evaluation', 'test results', 'debugging', or 'fix failing tests'.	2 / 3
Distinctiveness Conflict Risk	The specific bucket classifications (working/gap/redundant/regression) and 'tile content' terminology provide some distinctiveness, but 'analyze eval results' and 'diagnose root causes' could overlap with general debugging or testing skills.	2 / 3
	Total	8 / 12 Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, highly actionable skill with excellent workflow clarity and concrete executable guidance. The multi-phase structure with explicit validation checkpoints is well-designed for a complex iterative process. Main weaknesses are moderate verbosity and the monolithic structure that could benefit from splitting detailed reference content into separate files.

Suggestions

Consider extracting the bucket classification rules and 'Rules for good fixes' into a separate REFERENCE.md file to reduce the main skill's length

Tighten Phase 1.2 bucket definitions - the criteria are clear but could be expressed more concisely (e.g., as a table rather than prose)

Dimension	Reasoning	Score
Conciseness	The skill is reasonably efficient but includes some redundancy (e.g., repeated command examples across phases, verbose explanations of bucket classifications). Some sections could be tightened without losing clarity.	2 / 3
Actionability	Excellent actionability with specific, executable bash commands throughout (tessl eval view, tessl eval run, git commands). Clear examples of expected output formats and concrete decision criteria for each bucket classification.	3 / 3
Workflow Clarity	Outstanding multi-phase workflow with clear sequencing (Phase 0-5), explicit validation checkpoints (lint after fixes, poll for completion, before/after comparison), and feedback loops (re-run and verify cycle, 'take another pass' option).	3 / 3
Progressive Disclosure	Content is well-structured with clear phase headers, but the entire workflow is in one monolithic file. Complex sections like bucket classification rules and fix guidelines could be split into reference files for cleaner navigation.	2 / 3
	Total	10 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Install with Tessl CLI

npx tessl i experiments/eval-improve@0.4.0

Reviewed

about 4 hours ago

Table of Contents

Discovery Implementation Validation