CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl-labs/skill-optimizer

Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.

88

1.07x
Quality

94%

Does it follow best practices?

Impact

88%

1.07x

Average score across 24 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

Evaluation results

100%

Data Pipeline Tile: Consistency Audit

Criteria
Without context
With context

Retry count contradiction found

100%

100%

Auth failure contradiction found

100%

100%

All three files referenced

100%

100%

File attribution per contradiction

100%

100%

Auth contradiction despite scope

100%

100%

Verbatim quotes included

100%

100%

100%

5%

Payments Tile Eval Analysis

Criteria
Without context
With context

Bucket A: idempotency key

100%

100%

Bucket B: webhook signature

100%

100%

Bucket C: HTTP status codes

100%

100%

Bucket B: currency precision

100%

100%

Bucket D: API version pinning

100%

100%

Bucket D highest priority

100%

100%

Bucket B diagnosis present

100%

100%

Bucket C action suggested

70%

100%

Bucket A no-action

75%

100%

80% threshold applied

100%

100%

100%

API Integration Tile: Eval Rubric Review

Criteria
Without context
With context

All redundant criteria identified

100%

100%

Options presented per criterion

100%

100%

Useful criteria preserved

100%

100%

Weight redistribution correct

100%

100%

80% threshold applied

100%

100%

Non-redundant scores unchanged

100%

100%

Below-threshold excluded

100%

100%

Removal option named explicitly

100%

100%

100%

Code Review Tile: Regression Investigation

Criteria
Without context
With context

Contradicting clause identified

100%

100%

Contradiction mechanism explained

100%

100%

Remove/clarify approach taken

100%

100%

Specific text targeted

100%

100%

No compensating additions

100%

100%

Other sections preserved

100%

100%

Pre-review list intact

100%

100%

100%

10%

Webhook Processor Tile: Retry Reliability Fix

Criteria
Without context
With context

Explicit retry intervals

100%

100%

Rubric language used

100%

100%

HMAC section unchanged

100%

100%

TLS section unchanged

100%

100%

Observability section unchanged

100%

100%

Processing section unchanged

100%

100%

Retry section only changed

100%

100%

Concise addition

0%

100%

Max retry count preserved

100%

100%

Fast acknowledgement preserved

100%

100%

92%

13%

Skill Bundle Validation

Phase 4 syntax and reference validation

Criteria
Without context
With context

Python via ast.parse

100%

100%

Python error identified

100%

100%

JavaScript via node --check

0%

100%

Command flag validation

40%

20%

File reference check

100%

100%

Broken reference identified

100%

100%

Validation before application

100%

100%

Per-check pass/fail

100%

100%

Fix suggestions

100%

100%

90%

12%

Skill Improvement Recommendations

Prioritized recommendation generation from review output

Criteria
Without context
With context

Critical issues first

100%

100%

High before Medium/Low

100%

100%

Summary with priorities

20%

100%

Expected improvement in summary

100%

75%

Dimension score included

80%

100%

Before/after examples

75%

75%

Impact stated per recommendation

25%

62%

Educational WHY included

100%

100%

All four issues addressed

100%

100%

Approval framing

70%

80%

100%

14%

Skill Length Reduction

Progressive disclosure via reference file linking

Criteria
Without context
With context

Linking over inlining

100%

100%

Reference file identified

100%

100%

Severity mappings removed

100%

100%

Flag tables removed

60%

100%

Template list removed

100%

100%

SKILL.md substantially shorter

25%

100%

Core examples preserved

100%

100%

Before/after shown

100%

100%

WHY explained

90%

100%

REFERENCE.md not modified

100%

100%

100%

Skill Optimization Results Report

Phase 7 before/after score comparison

Criteria
Without context
With context

Overall before/after format

100%

100%

Percentage delta shown

100%

100%

Per-dimension breakdown

100%

100%

Arrow notation or equivalent

100%

100%

Dimension change labelled

100%

100%

Dimensions impact explained

100%

100%

Correct overall scores

100%

100%

Completeness improvement noted

100%

100%

Actionability improvement noted

100%

100%

Conciseness unchanged noted

100%

100%

Robustness improvement noted

100%

100%

76%

2%

Skill Post-Edit Quality Audit

Phase 8 final accuracy check (5 criteria)

Criteria
Without context
With context

Code syntax check included

100%

100%

Python syntax error found

100%

100%

Command flags check included

100%

100%

File references check included

100%

100%

File reference passes

100%

100%

Use when clause check included

0%

0%

Use when clause fails

0%

0%

Known concepts check included

80%

100%

Known concepts issue found

100%

100%

Readiness summary

100%

100%

92%

75%

Skill Optimization Automation

tessl skill review command and workflow scripting

Criteria
Without context
With context

tessl skill review command

0%

100%

Review before changes

0%

100%

Review after changes

0%

100%

Validation before apply

0%

100%

Python ast.parse validation

0%

100%

node --check JS validation

0%

100%

Command --help flag validation

0%

0%

File reference validation

0%

100%

Before/after score output

75%

100%

Script accepts SKILL.md path

100%

100%

Phases are ordered

37%

100%

100%

Skill Quality Improvement

Only modify SKILL.md, not other bundle files

Criteria
Without context
With context

REFERENCE.md not recreated

100%

100%

No REFERENCE.md changes proposed

100%

100%

SKILL.md produced

100%

100%

Use when clause added

100%

100%

Inline duplication removed

100%

100%

REFERENCE.md linked

100%

100%

Core examples retained

100%

100%

SKILL.md shorter

100%

100%

Change log documents SKILL.md changes

100%

100%

Change log explains why

100%

100%

100%

15%

Multi-Model Tile Benchmark Automation

Sequential multi-model eval execution

Criteria
Without context
With context

Correct base command

100%

100%

--agent flag format

80%

100%

All three default models

70%

100%

Sequential execution

100%

100%

Run ID capture

100%

100%

Model-to-ID mapping

50%

100%

Monitoring URL output

25%

100%

Polls with tessl eval view

100%

100%

Retry on failure

100%

100%

Waits for all to complete

100%

100%

No --workspace flag

100%

100%

85%

3%

Progressive Disclosure Evaluation

Criteria
Without context
With context

Identifies good references

93%

100%

Explains why good

100%

100%

Identifies poor references

80%

100%

Explains why poor

100%

100%

Token efficiency framing

70%

70%

Routing gate test

90%

80%

Improves CONFIGURATION.md

100%

100%

Improves GUIDE.md

100%

100%

Improves EXAMPLES.md

100%

100%

Improves ADVANCED.md or REFERENCE.md

100%

100%

Questions blind split recommendation

0%

0%

100%

4%

Bundle File Audit

Criteria
Without context
With context

Lists all bundle files

100%

100%

Identifies referenced files

100%

100%

Identifies orphaned files

100%

100%

TRANSACTIONS.md recommendation

100%

100%

PERFORMANCE.md recommendation

100%

100%

SECURITY.md recommendation

100%

100%

LEGACY_EXAMPLES.md recommendation

100%

100%

DRAFT_REPLICATION.md recommendation

100%

100%

Bloat reduction framing

40%

100%

Clear routing signals emphasis

80%

100%

Link vs remove justification

100%

100%

98%

5%

Approval-Gated Skill Change Proposal

Criteria
Without context
With context

SKILL.md not modified

100%

100%

Explicit approval request

100%

100%

Trade-off discussion

100%

100%

Risk assessment per recommendation

100%

100%

Grouped presentation

100%

100%

All key issues addressed

100%

100%

Priority summary present

100%

80%

Current score per recommendation

30%

100%

100%

Context File Detection for Scenario Generation

Criteria
Without context
With context

identifies_mdc_files

100%

100%

identifies_claude_md

100%

100%

identifies_agents_md

100%

100%

identifies_tessl_json

100%

100%

excludes_tessl_cache

100%

100%

excludes_generic_docs

100%

100%

excludes_source_and_build_config

100%

100%

constructs_valid_context_flag

100%

100%

0%

-28%

Scenario 18

Criteria
Without context
With context

does_not_use_last_only

16%

0%

finds_generation_ids

60%

0%

downloads_each_separately

40%

0%

explains_why

0%

0%

92%

8%

Model Benchmark Comparison Report

Criteria
Without context
With context

Overall summary table

100%

100%

Per-scenario breakdown

100%

100%

Per-criterion table

100%

100%

Correct symbol thresholds

0%

100%

Baseline interpretation

100%

100%

Universal Failure identified

80%

100%

Capability Gradient identified

80%

100%

Regression identified

100%

100%

Fix before publish recommendation

100%

100%

eval-improve mentioned

100%

0%

Re-run offer

80%

100%

81%

60%

Tile Eval Readiness Checker

Criteria
Without context
With context

Excludes .tessl cache

0%

80%

.tessl/tiles warning

0%

90%

Scenario existence check

70%

100%

Scenario generation guidance

0%

100%

Login verification

10%

100%

No --workspace flag

100%

100%

Default model names

0%

100%

Model subset confirmation

0%

50%

Time estimate provided

41%

100%

Run count option

0%

0%

96%

-4%

Eval Scenario Quality Review

Criteria
Without context
With context

identifies_scenario_1_acceptable

100%

100%

detects_answer_leakage

100%

80%

explains_leakage_impact

100%

100%

detects_double_counting

100%

100%

detects_free_point_criterion

100%

100%

proposes_specific_fixes

100%

100%

no_false_positives_scenario_1

100%

100%

100%

Skill Score Maximization

Criteria
Without context
With context

Completeness weight correct

100%

100%

Conciseness weight correct

100%

100%

Actionability weight correct

100%

100%

Use when clause highest impact

100%

100%

Use when quantified

100%

100%

Revised description includes Use when

100%

100%

Executable code recommended

100%

100%

Known concepts flagged

100%

100%

High-impact first ordering

100%

100%

Dimension coverage

100%

100%

97%

-1%

Commit Selection for Eval Scenario Generation

Criteria
Without context
With context

skips_trivial_commits

100%

100%

skips_docs_and_config_only

100%

100%

skips_mechanical_generated_commit

100%

100%

scores_payment_commit_high

100%

100%

scores_auth_refactor_highest

80%

70%

references_complexity_signals

100%

100%

recommends_two_or_three_commits

100%

100%

explains_selection_rationale

100%

100%

24%

-10%

Scenario 24

Criteria
Without context
With context

checks_prerequisites

100%

100%

browses_commits

0%

0%

auto_detects_context_files

0%

0%

uses_context_flag

58%

0%

workspace_in_eval_run

0%

0%

explains_baseline_vs_context

100%

100%

Evaluated
Agent
Claude
Model
Claude Sonnet 4.6