CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl-labs/eval-improve

Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated

94

1.30x

Quality

89%

Does it follow best practices?

Impact

98%

1.30x

Average score across 7 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-crafted skill description that excels across all dimensions. It provides specific concrete actions, uses natural trigger terms that users would actually say, includes an explicit 'Use when...' clause with multiple scenarios, and carves out a distinct niche around evaluation debugging workflows.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: 'Analyze eval results', 'diagnose failures', 'apply targeted fixes', and 're-run to verify improvements'. These are clear, actionable capabilities.

3 / 3

Completeness

Clearly answers both what (analyze, diagnose, fix, re-run) AND when with explicit 'Use when...' clause covering multiple trigger scenarios: debugging scores, fixing failures, improving content, iterating on test results.

3 / 3

Trigger Term Quality

Includes natural keywords users would say: 'eval results', 'debugging evaluation scores', 'failing or regressed criteria', 'eval run', 'agent performance test results'. Good coverage of domain-specific terms users would naturally use.

3 / 3

Distinctiveness Conflict Risk

Clear niche focused specifically on evaluation/testing workflows with distinct triggers like 'eval results', 'regressed criteria', 'agent performance test'. Unlikely to conflict with general debugging or code skills.

3 / 3

Total

12

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, actionable skill with excellent workflow clarity and concrete executable guidance. The multi-phase structure with validation checkpoints is well-designed for the complex eval improvement cycle. Main weakness is the monolithic structure - some detailed reference content (bucket definitions, fix rules) could be extracted to separate files for better progressive disclosure.

Suggestions

Extract the bucket classification rules and 'Rules for good fixes' into a separate reference file (e.g., docs/eval-criteria.md) to reduce SKILL.md length

Tighten the bucket explanations - Claude can infer meanings from the classification logic without the detailed prose descriptions

DimensionReasoningScore

Conciseness

The skill is reasonably efficient but includes some unnecessary explanations (e.g., explaining what each bucket means in detail, time expectations repeated). Some sections could be tightened, particularly the verbose bucket classification explanations that Claude could infer from examples.

2 / 3

Actionability

Provides fully executable commands throughout (tessl eval view, tessl eval run, git commands), specific output formats to show users, and concrete decision trees. Every phase has copy-paste ready commands with proper flags.

3 / 3

Workflow Clarity

Excellent multi-phase workflow with clear sequencing (Phase 0-5), explicit validation checkpoints (lint after fixes, poll for completion, before/after comparison), and feedback loops (re-run and verify cycle, 'take another pass' option).

3 / 3

Progressive Disclosure

Content is well-structured with clear phases and sections, but this is a monolithic 300+ line file that could benefit from splitting detailed content (like bucket classification rules, fix rules) into separate reference files. The companion skill reference is good but internal organization could improve.

2 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Reviewed

Table of Contents