CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/skill-optimizer

Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.

91

1.10x
Quality

91%

Does it follow best practices?

Impact

92%

1.10x

Average score across 25 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

Evaluation results

76%

-24%

Routing Health Report for content-tools Tile

Criteria
Without context
With context

Routing table present

100%

100%

Skill coverage summary correct

100%

100%

rewrite-intro out-of-scope determination

100%

100%

generate-bibliography routing gap determination

100%

0%

fix-heading-hierarchy routing gap determination

100%

100%

citation-generator description rewrite

100%

0%

markdown-formatter description rewrite

100%

100%

Minimal rewrite principle

100%

100%

Rewrites presented together

100%

100%

Scored eval data cited

100%

100%

100%

5%

Approval-Gated Skill Change Proposal

Criteria
Without context
With context

SKILL.md not modified

100%

100%

Explicit approval request

100%

100%

Trade-off discussion

100%

100%

Risk assessment per recommendation

100%

100%

Grouped presentation

100%

100%

All key issues addressed

100%

100%

Priority summary present

80%

100%

Current score per recommendation

70%

100%

100%

Data Pipeline Tile: Consistency Audit

Criteria
Without context
With context

Retry count contradiction found

100%

100%

Auth failure contradiction found

100%

100%

All three files referenced

100%

100%

File attribution per contradiction

100%

100%

Auth contradiction despite scope

100%

100%

Verbatim quotes included

100%

100%

100%

6%

Payments Tile Eval Analysis

Criteria
Without context
With context

Bucket A: idempotency key

100%

100%

Bucket B: webhook signature

100%

100%

Bucket C: HTTP status codes

100%

100%

Bucket B: currency precision

100%

100%

Bucket D: API version pinning

100%

100%

Bucket D highest priority

100%

100%

Bucket B diagnosis present

100%

100%

Bucket C action suggested

60%

100%

Bucket A no-action

75%

100%

80% threshold applied

100%

100%

92%

2%

Model Benchmark Comparison Report

Criteria
Without context
With context

Overall summary table

100%

100%

Per-scenario breakdown

100%

100%

Per-criterion table

100%

100%

Correct symbol thresholds

0%

100%

Baseline interpretation

100%

100%

Universal Failure identified

100%

100%

Capability Gradient identified

100%

100%

Regression identified

100%

100%

Fix before publish recommendation

100%

100%

eval-improve mentioned

100%

0%

Re-run offer

100%

100%

100%

39%

Eval Kickoff Plan for invoice-processor Tile

Criteria
Without context
With context

Skill count detection command

78%

100%

Activation eval run first

25%

100%

Scored eval follows activation

85%

100%

Routing-clean gate explained

91%

100%

Skip activation condition stated

100%

100%

Correct eval run command format

0%

100%

--solver=activation flag used

0%

100%

Tile path used consistently

100%

100%

100%

Skill Quality Improvement

Criteria
Without context
With context

REFERENCE.md not recreated

100%

100%

No REFERENCE.md changes proposed

100%

100%

SKILL.md produced

100%

100%

Use when clause added

100%

100%

Inline duplication removed

100%

100%

REFERENCE.md linked

100%

100%

Core examples retained

100%

100%

SKILL.md shorter

100%

100%

Change log documents SKILL.md changes

100%

100%

Change log explains why

100%

100%

100%

33%

Optimization Decision Point for pull-request-reviewer Tile

Criteria
Without context
With context

Regression identified

100%

100%

Regression is highest priority

100%

100%

High baseline warning present

0%

100%

Scenario regeneration suggested

0%

100%

Tile is actively hurting

100%

100%

Per-criterion regression analysis

100%

100%

Correct prioritization order

100%

100%

99%

1%

Bundle File Audit

Criteria
Without context
With context

Lists all bundle files

100%

100%

Identifies referenced files

100%

100%

Identifies orphaned files

100%

100%

TRANSACTIONS.md recommendation

100%

100%

PERFORMANCE.md recommendation

100%

100%

SECURITY.md recommendation

100%

100%

LEGACY_EXAMPLES.md recommendation

100%

100%

DRAFT_REPLICATION.md recommendation

100%

100%

Bloat reduction framing

60%

80%

Clear routing signals emphasis

100%

100%

Link vs remove justification

100%

100%

85%

8%

Skill Bundle Validation

Criteria
Without context
With context

Python via ast.parse

53%

86%

Python error identified

100%

100%

JavaScript via node --check

33%

40%

Command flag validation

40%

60%

File reference check

100%

100%

Broken reference identified

100%

100%

Validation before application

100%

100%

Per-check pass/fail

100%

100%

Fix suggestions

100%

100%

100%

Skill Optimization Results Report

Criteria
Without context
With context

Overall before/after format

100%

100%

Percentage delta shown

100%

100%

Per-dimension breakdown

100%

100%

Arrow notation or equivalent

100%

100%

Dimension change labelled

100%

100%

Dimensions impact explained

100%

100%

Correct overall scores

100%

100%

Completeness improvement noted

100%

100%

Actionability improvement noted

100%

100%

Conciseness unchanged noted

100%

100%

Robustness improvement noted

100%

100%

100%

29%

Skill Post-Edit Quality Audit

Criteria
Without context
With context

Code syntax check included

100%

100%

Python syntax error found

100%

100%

Command flags check included

100%

100%

File references check included

100%

100%

File reference passes

100%

100%

Use when clause check included

0%

100%

Use when clause fails

0%

100%

Known concepts check included

70%

100%

Known concepts issue found

83%

100%

Readiness summary

100%

100%

83%

65%

Tile Eval Readiness Checker

Criteria
Without context
With context

Excludes .tessl cache

0%

100%

.tessl/tiles warning

0%

100%

Scenario existence check

40%

100%

Scenario generation guidance

0%

100%

Login verification

20%

100%

No --workspace flag

100%

100%

Default model names

0%

100%

Model subset confirmation

0%

37%

Time estimate provided

33%

100%

Run count option

0%

0%

100%

23%

Skill Improvement Recommendations

Criteria
Without context
With context

Critical issues first

100%

100%

High before Medium/Low

100%

100%

Summary with priorities

50%

100%

Expected improvement in summary

100%

100%

Dimension score included

80%

100%

Before/after examples

75%

100%

Impact stated per recommendation

12%

100%

Educational WHY included

100%

100%

All four issues addressed

100%

100%

Approval framing

40%

100%

87%

Progressive Disclosure Evaluation

Criteria
Without context
With context

Identifies good references

100%

100%

Explains why good

100%

100%

Identifies poor references

100%

80%

Explains why poor

100%

100%

Token efficiency framing

70%

100%

Routing gate test

100%

100%

Improves CONFIGURATION.md

100%

100%

Improves GUIDE.md

100%

100%

Improves EXAMPLES.md

100%

100%

Improves ADVANCED.md or REFERENCE.md

100%

100%

Questions blind split recommendation

0%

0%

98%

-2%

Skill Length Reduction

Criteria
Without context
With context

Linking over inlining

100%

100%

Reference file identified

100%

100%

Severity mappings removed

100%

80%

Flag tables removed

100%

100%

Template list removed

100%

100%

SKILL.md substantially shorter

100%

100%

Core examples preserved

100%

100%

Before/after shown

100%

100%

WHY explained

100%

100%

REFERENCE.md not modified

100%

100%

100%

API Integration Tile: Eval Rubric Review

Criteria
Without context
With context

All redundant criteria identified

100%

100%

Options presented per criterion

100%

100%

Useful criteria preserved

100%

100%

Weight redistribution correct

100%

100%

80% threshold applied

100%

100%

Non-redundant scores unchanged

100%

100%

Below-threshold excluded

100%

100%

Removal option named explicitly

100%

100%

100%

Code Review Tile: Regression Investigation

Criteria
Without context
With context

Contradicting clause identified

100%

100%

Contradiction mechanism explained

100%

100%

Remove/clarify approach taken

100%

100%

Specific text targeted

100%

100%

No compensating additions

100%

100%

Other sections preserved

100%

100%

Pre-review list intact

100%

100%

100%

4%

Expand Eval Coverage for shopify-connector Tile

Criteria
Without context
With context

Uses --strategy merge

100%

100%

Does NOT use --strategy replace

100%

100%

Correct base command

100%

100%

Output directory specified

100%

100%

Verification step present

100%

100%

Run ID or --last used

100%

100%

Existing scenarios preserved

50%

100%

74%

-6%

Eval Scenario Quality Review

Criteria
Without context
With context

identifies_scenario_1_acceptable

0%

0%

detects_answer_leakage

100%

95%

explains_leakage_impact

100%

80%

detects_double_counting

100%

90%

detects_free_point_criterion

100%

93%

proposes_specific_fixes

100%

86%

no_false_positives_scenario_1

0%

20%

100%

1%

Skill Score Maximization

Criteria
Without context
With context

Completeness weight correct

100%

100%

Conciseness weight correct

100%

100%

Actionability weight correct

100%

100%

Use when clause highest impact

100%

100%

Use when quantified

100%

100%

Revised description includes Use when

100%

100%

Executable code recommended

100%

100%

Known concepts flagged

90%

100%

High-impact first ordering

100%

100%

Dimension coverage

100%

100%

100%

21%

Multi-Model Tile Benchmark Automation

Criteria
Without context
With context

Correct base command

100%

100%

--agent flag format

20%

100%

All three default models

40%

100%

Sequential execution

100%

100%

Run ID capture

100%

100%

Model-to-ID mapping

100%

100%

Monitoring URL output

12%

100%

Polls with tessl eval view

100%

100%

Retry on failure

100%

100%

Waits for all to complete

100%

100%

No --workspace flag

100%

100%

96%

-4%

Webhook Processor Tile: Retry Reliability Fix

Criteria
Without context
With context

Explicit retry intervals

100%

100%

Rubric language used

100%

100%

HMAC section unchanged

100%

100%

TLS section unchanged

100%

100%

Observability section unchanged

100%

100%

Processing section unchanged

100%

100%

Retry section only changed

100%

100%

Concise addition

100%

60%

Max retry count preserved

100%

100%

Fast acknowledgement preserved

100%

100%

76%

42%

Skill Optimization Automation

Criteria
Without context
With context

tessl skill review command

0%

100%

Review before changes

0%

100%

Review after changes

0%

100%

Validation before apply

100%

100%

Python ast.parse validation

0%

0%

node --check JS validation

0%

0%

Command --help flag validation

0%

0%

File reference validation

0%

100%

Before/after score output

100%

100%

Script accepts SKILL.md path

100%

100%

Phases are ordered

100%

100%

57%

2%

Post-Fix Validation for database-migrator Tile

Criteria
Without context
With context

tessl tile lint command used

0%

0%

Tile path argument provided

0%

50%

Lint run after each change set

33%

46%

Token cost ballooning flagged

100%

92%

Move to docs recommended

100%

95%

Docs vs rules distinction

78%

64%

Does NOT recommend rules for heavy content

100%

100%

Evaluated
Agent
Claude
Model
Claude Sonnet 4.6