CtrlK
BlogDocsLog inGet started
Tessl Logo

experiments/eval-improve

Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated

Does it follow best practices?

Evaluation100%

1.02x

Agent success when using this tile

Validation for skill structure

Overview
Skills
Evals
Files

Discovery

43%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description excels at listing specific concrete actions with domain-specific terminology like bucket classifications. However, it critically lacks any 'Use when...' guidance, making it difficult for Claude to know when to select this skill from a large pool. The technical jargon may also miss natural user language patterns.

Suggestions

Add a 'Use when...' clause with explicit triggers, e.g., 'Use when the user asks to debug eval failures, analyze test results, or fix tile content issues'

Include natural language variations users might say: 'evaluation results', 'test failures', 'debugging evals', 'why is my eval failing'

Clarify what 'tile content' refers to, as this domain-specific term may not match user vocabulary

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: 'analyze eval results', 'classify criteria into buckets (working/gap/redundant/regression)', 'diagnose root causes', 'apply targeted fixes to tile content', and 're-run evals to verify improvements'.

3 / 3

Completeness

Describes what the skill does but completely lacks a 'Use when...' clause or any explicit trigger guidance for when Claude should select this skill. Per rubric guidelines, missing explicit trigger guidance caps completeness at 2, and this has no 'when' component at all.

1 / 3

Trigger Term Quality

Contains some relevant keywords like 'eval results', 'root causes', 'tile content', but uses somewhat technical jargon. Missing common variations users might say like 'evaluation', 'test results', 'debugging', or 'fix failing tests'.

2 / 3

Distinctiveness Conflict Risk

The specific bucket classifications (working/gap/redundant/regression) and 'tile content' terminology provide some distinctiveness, but 'analyze eval results' and 'diagnose root causes' could overlap with general debugging or testing skills.

2 / 3

Total

8

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, highly actionable skill with excellent workflow clarity and concrete executable guidance. The multi-phase structure with explicit validation checkpoints is well-designed for a complex iterative process. Main weaknesses are moderate verbosity and the monolithic structure that could benefit from splitting detailed reference content into separate files.

Suggestions

Consider extracting the bucket classification rules and 'Rules for good fixes' into a separate REFERENCE.md file to reduce the main skill's length

Tighten Phase 1.2 bucket definitions - the criteria are clear but could be expressed more concisely (e.g., as a table rather than prose)

DimensionReasoningScore

Conciseness

The skill is reasonably efficient but includes some redundancy (e.g., repeated command examples across phases, verbose explanations of bucket classifications). Some sections could be tightened without losing clarity.

2 / 3

Actionability

Excellent actionability with specific, executable bash commands throughout (tessl eval view, tessl eval run, git commands). Clear examples of expected output formats and concrete decision criteria for each bucket classification.

3 / 3

Workflow Clarity

Outstanding multi-phase workflow with clear sequencing (Phase 0-5), explicit validation checkpoints (lint after fixes, poll for completion, before/after comparison), and feedback loops (re-run and verify cycle, 'take another pass' option).

3 / 3

Progressive Disclosure

Content is well-structured with clear phase headers, but the entire workflow is in one monolithic file. Complex sections like bucket classification rules and fix guidelines could be split into reference files for cleaner navigation.

2 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Install with Tessl CLI

npx tessl i experiments/eval-improve@0.4.0

Reviewed

Table of Contents