CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl-labs/skill-optimizer

Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.

84

0.97x
Quality

91%

Does it follow best practices?

Impact

84%

0.97x

Average score across 24 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

Evaluation results

93%

-2%

Approval-Gated Skill Change Proposal

Criteria
Without context
With context

SKILL.md not modified

100%

100%

Explicit approval request

100%

100%

Trade-off discussion

100%

100%

Risk assessment per recommendation

100%

100%

Grouped presentation

100%

100%

All key issues addressed

100%

100%

Priority summary present

80%

50%

Current score per recommendation

70%

80%

97%

Commit Selection for Eval Scenario Generation

Criteria
Without context
With context

skips_trivial_commits

100%

100%

skips_docs_and_config_only

100%

100%

skips_mechanical_generated_commit

100%

100%

scores_payment_commit_high

100%

100%

scores_auth_refactor_highest

70%

70%

references_complexity_signals

100%

100%

recommends_two_or_three_commits

100%

100%

explains_selection_rationale

100%

100%

100%

Context File Detection for Scenario Generation

Criteria
Without context
With context

identifies_mdc_files

100%

100%

identifies_claude_md

100%

100%

identifies_agents_md

100%

100%

identifies_tessl_json

100%

100%

excludes_tessl_cache

100%

100%

excludes_generic_docs

100%

100%

excludes_source_and_build_config

100%

100%

constructs_valid_context_flag

100%

100%

100%

Data Pipeline Tile: Consistency Audit

Criteria
Without context
With context

Retry count contradiction found

100%

100%

Auth failure contradiction found

100%

100%

All three files referenced

100%

100%

File attribution per contradiction

100%

100%

Auth contradiction despite scope

100%

100%

Verbatim quotes included

100%

100%

100%

Payments Tile Eval Analysis

Criteria
Without context
With context

Bucket A: idempotency key

100%

100%

Bucket B: webhook signature

100%

100%

Bucket C: HTTP status codes

100%

100%

Bucket B: currency precision

100%

100%

Bucket D: API version pinning

100%

100%

Bucket D highest priority

100%

100%

Bucket B diagnosis present

100%

100%

Bucket C action suggested

100%

100%

Bucket A no-action

100%

100%

80% threshold applied

100%

100%

14%

-70%

Scenario 6

Criteria
Without context
With context

checks_prerequisites

50%

100%

browses_commits

100%

0%

auto_detects_context_files

29%

0%

uses_context_flag

100%

0%

workspace_in_eval_run

100%

0%

explains_baseline_vs_context

100%

37%

0%

-75%

Scenario 7

Criteria
Without context
With context

does_not_use_last_only

100%

0%

finds_generation_ids

100%

0%

downloads_each_separately

66%

0%

explains_why

25%

0%

92%

9%

Model Benchmark Comparison Report

Criteria
Without context
With context

Overall summary table

100%

100%

Per-scenario breakdown

100%

100%

Per-criterion table

100%

100%

Correct symbol thresholds

0%

100%

Baseline interpretation

100%

100%

Universal Failure identified

100%

100%

Capability Gradient identified

60%

100%

Regression identified

70%

100%

Fix before publish recommendation

100%

100%

eval-improve mentioned

100%

0%

Re-run offer

100%

100%

100%

Skill Quality Improvement

Criteria
Without context
With context

REFERENCE.md not recreated

100%

100%

No REFERENCE.md changes proposed

100%

100%

SKILL.md produced

100%

100%

Use when clause added

100%

100%

Inline duplication removed

100%

100%

REFERENCE.md linked

100%

100%

Core examples retained

100%

100%

SKILL.md shorter

100%

100%

Change log documents SKILL.md changes

100%

100%

Change log explains why

100%

100%

97%

10%

Bundle File Audit

Criteria
Without context
With context

Lists all bundle files

100%

100%

Identifies referenced files

100%

100%

Identifies orphaned files

100%

100%

TRANSACTIONS.md recommendation

70%

100%

PERFORMANCE.md recommendation

70%

100%

SECURITY.md recommendation

70%

100%

LEGACY_EXAMPLES.md recommendation

100%

100%

DRAFT_REPLICATION.md recommendation

100%

100%

Bloat reduction framing

80%

40%

Clear routing signals emphasis

40%

100%

Link vs remove justification

100%

100%

63%

-19%

Skill Bundle Validation

Criteria
Without context
With context

Python via ast.parse

66%

0%

Python error identified

100%

100%

JavaScript via node --check

66%

0%

Command flag validation

20%

30%

File reference check

100%

100%

Broken reference identified

100%

100%

Validation before application

100%

100%

Per-check pass/fail

100%

100%

Fix suggestions

100%

100%

100%

Skill Optimization Results Report

Criteria
Without context
With context

Overall before/after format

100%

100%

Percentage delta shown

100%

100%

Per-dimension breakdown

100%

100%

Arrow notation or equivalent

100%

100%

Dimension change labelled

100%

100%

Dimensions impact explained

100%

100%

Correct overall scores

100%

100%

Completeness improvement noted

100%

100%

Actionability improvement noted

100%

100%

Conciseness unchanged noted

100%

100%

Robustness improvement noted

100%

100%

70%

-1%

Skill Post-Edit Quality Audit

Criteria
Without context
With context

Code syntax check included

100%

100%

Python syntax error found

100%

100%

Command flags check included

37%

37%

File references check included

100%

100%

File reference passes

100%

100%

Use when clause check included

0%

0%

Use when clause fails

0%

0%

Known concepts check included

100%

90%

Known concepts issue found

100%

100%

Readiness summary

100%

100%

72%

17%

Tile Eval Readiness Checker

Criteria
Without context
With context

Excludes .tessl cache

0%

50%

.tessl/tiles warning

0%

100%

Scenario existence check

80%

100%

Scenario generation guidance

100%

60%

Login verification

100%

100%

No --workspace flag

100%

100%

Default model names

100%

100%

Model subset confirmation

0%

12%

Time estimate provided

41%

100%

Run count option

33%

0%

87%

9%

Skill Improvement Recommendations

Criteria
Without context
With context

Critical issues first

100%

100%

High before Medium/Low

100%

100%

Summary with priorities

50%

50%

Expected improvement in summary

87%

100%

Dimension score included

70%

80%

Before/after examples

75%

91%

Impact stated per recommendation

37%

75%

Educational WHY included

100%

100%

All four issues addressed

100%

100%

Approval framing

50%

70%

79%

-9%

Progressive Disclosure Evaluation

Criteria
Without context
With context

Identifies good references

100%

100%

Explains why good

100%

90%

Identifies poor references

100%

80%

Explains why poor

100%

90%

Token efficiency framing

90%

50%

Routing gate test

90%

90%

Improves CONFIGURATION.md

100%

100%

Improves GUIDE.md

100%

100%

Improves EXAMPLES.md

100%

100%

Improves ADVANCED.md or REFERENCE.md

100%

100%

Questions blind split recommendation

0%

0%

89%

-11%

Skill Length Reduction

Criteria
Without context
With context

Linking over inlining

100%

100%

Reference file identified

100%

100%

Severity mappings removed

100%

60%

Flag tables removed

100%

90%

Template list removed

100%

100%

SKILL.md substantially shorter

100%

50%

Core examples preserved

100%

100%

Before/after shown

100%

100%

WHY explained

100%

100%

REFERENCE.md not modified

100%

100%

100%

API Integration Tile: Eval Rubric Review

Criteria
Without context
With context

All redundant criteria identified

100%

100%

Options presented per criterion

100%

100%

Useful criteria preserved

100%

100%

Weight redistribution correct

100%

100%

80% threshold applied

100%

100%

Non-redundant scores unchanged

100%

100%

Below-threshold excluded

100%

100%

Removal option named explicitly

100%

100%

100%

Code Review Tile: Regression Investigation

Criteria
Without context
With context

Contradicting clause identified

100%

100%

Contradiction mechanism explained

100%

100%

Remove/clarify approach taken

100%

100%

Specific text targeted

100%

100%

No compensating additions

100%

100%

Other sections preserved

100%

100%

Pre-review list intact

100%

100%

100%

14%

Eval Scenario Quality Review

Criteria
Without context
With context

identifies_scenario_1_acceptable

50%

100%

detects_answer_leakage

100%

100%

explains_leakage_impact

100%

100%

detects_double_counting

95%

100%

detects_free_point_criterion

100%

100%

proposes_specific_fixes

93%

100%

no_false_positives_scenario_1

30%

100%

100%

6%

Skill Score Maximization

Criteria
Without context
With context

Completeness weight correct

100%

100%

Conciseness weight correct

100%

100%

Actionability weight correct

100%

100%

Use when clause highest impact

100%

100%

Use when quantified

100%

100%

Revised description includes Use when

100%

100%

Executable code recommended

100%

100%

Known concepts flagged

40%

100%

High-impact first ordering

100%

100%

Dimension coverage

100%

100%

100%

23%

Multi-Model Tile Benchmark Automation

Criteria
Without context
With context

Correct base command

100%

100%

--agent flag format

20%

100%

All three default models

40%

100%

Sequential execution

100%

100%

Run ID capture

100%

100%

Model-to-ID mapping

62%

100%

Monitoring URL output

25%

100%

Polls with tessl eval view

100%

100%

Retry on failure

100%

100%

Waits for all to complete

100%

100%

No --workspace flag

100%

100%

100%

Webhook Processor Tile: Retry Reliability Fix

Criteria
Without context
With context

Explicit retry intervals

100%

100%

Rubric language used

100%

100%

HMAC section unchanged

100%

100%

TLS section unchanged

100%

100%

Observability section unchanged

100%

100%

Processing section unchanged

100%

100%

Retry section only changed

100%

100%

Concise addition

100%

100%

Max retry count preserved

100%

100%

Fast acknowledgement preserved

100%

100%

84%

53%

Skill Optimization Automation

Criteria
Without context
With context

tessl skill review command

0%

100%

Review before changes

0%

100%

Review after changes

0%

100%

Validation before apply

70%

100%

Python ast.parse validation

0%

0%

node --check JS validation

0%

100%

Command --help flag validation

0%

0%

File reference validation

0%

100%

Before/after score output

100%

100%

Script accepts SKILL.md path

100%

100%

Phases are ordered

100%

100%

Evaluated
Agent
Claude
Model
Claude Sonnet 4.6