CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl-labs/skill-optimizer

Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.

88

1.07x
Quality

93%

Does it follow best practices?

Impact

88%

1.07x

Average score across 24 eval scenarios

SecuritybySnyk

Passed

No known issues

This plugin was archived by the owner on May 19, 2026

Reason: Tile archived: Superceded by tessl/skill-optimizer - go to https://tessl.io/registry/tessl/skill-optimizer

Overview
Quality
Evals
Security
Files

Quality

Discovery

85%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a strong description that clearly articulates what the skill does and when to use it, with an explicit 'Use when' clause covering multiple trigger scenarios. The main weakness is in trigger term coverage—it uses some domain-specific jargon ('tile issues') and misses common synonyms users might naturally use like 'benchmark', 'test', or 'evaluate'. Overall it would perform well in skill selection among a large set of skills.

Suggestions

Replace the jargon 'tile issues' with clearer language, and add natural trigger terms users might say such as 'benchmark', 'test performance', 'evaluate skill', or 'model comparison'.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: 'Run task evals across multiple Claude models', 'compare results side-by-side', and 'optimise'. These are clear, actionable capabilities.

3 / 3

Completeness

Clearly answers both 'what' (run task evals, compare results, optimise) and 'when' with an explicit 'Use when' clause covering three specific trigger scenarios: understanding cross-model performance, identifying model-specific gaps, and validating before publishing.

3 / 3

Trigger Term Quality

Includes some relevant terms like 'task evals', 'models', 'compare', 'skill', 'registry', and 'publishing', but misses common variations users might say such as 'benchmark', 'evaluation', 'testing', 'performance comparison', or 'model selection'. The term 'tile issues' is jargon that users wouldn't naturally use.

2 / 3

Distinctiveness Conflict Risk

Occupies a clear niche around multi-model eval comparison and skill validation before registry publishing. The combination of cross-model evaluation, side-by-side comparison, and registry publishing creates a distinct identity unlikely to conflict with other skills.

3 / 3

Total

11

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, highly actionable skill with excellent workflow clarity and concrete commands throughout. Its main weakness is length — the content is thorough but could benefit from splitting diagnostic/interpretation guidance into a referenced file to reduce the token footprint of the main skill. The phased structure with validation checkpoints is exemplary for a complex multi-step process.

Suggestions

Consider extracting Phase 6 (Diagnose and interpret) into a separate DIAGNOSIS.md file referenced from the main skill, reducing the main file's token cost while preserving the detailed diagnostic framework.

Tighten Phase 6.1 baseline interpretation — Claude can infer what high/low baselines mean; reduce to a brief decision table rather than prose explanations.

DimensionReasoningScore

Conciseness

The skill is generally well-structured and avoids explaining basic concepts, but it's quite lengthy (~200 lines) with some verbose sections. The diagnostic framework in Phase 6 could be tightened, and some guidance (e.g., explaining what baselines mean) is somewhat obvious for Claude. However, most content is domain-specific and earns its place.

2 / 3

Actionability

Every phase includes concrete, copy-paste-ready commands (find, tessl eval run, tessl eval view --json, tessl eval retry). Output formats are fully specified with table templates and emoji indicators. The skill leaves no ambiguity about what to execute at each step.

3 / 3

Workflow Clarity

The 7-phase workflow is clearly sequenced with explicit validation checkpoints: verify tile exists (1.1), verify evals exist (1.2), verify login (1.3), confirm config before running (Phase 2), poll for completion with retry on failure (Phase 4), and a clear 'when to stop' section. Feedback loops are present for failed runs and for re-running after fixes.

3 / 3

Progressive Disclosure

The content is well-organized with clear phases and sub-sections, but it's entirely monolithic — all ~200 lines live in a single file with no references to external documents. The diagnostic framework (Phase 6) and the detailed table formats (Phase 5) could be split into referenced files. It does mention other skills (optimize-skill-performance, setup-skill-performance) which is good cross-referencing.

2 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Reviewed

Table of Contents