Optimize your skills and plugins: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
88
87%
Does it follow best practices?
Impact
89%
1.14xAverage score across 29 eval scenarios
Advisory
Suggest reviewing before use
SKILL.md not modified
100%
100%
Explicit approval request
100%
100%
Trade-off discussion
100%
100%
Risk assessment per recommendation
100%
100%
Grouped presentation
100%
100%
All key issues addressed
100%
100%
Priority summary present
60%
100%
Current score per recommendation
100%
100%
REFERENCE.md not recreated
100%
100%
No REFERENCE.md changes proposed
100%
100%
SKILL.md produced
100%
100%
Use when clause added
100%
100%
Inline duplication removed
100%
100%
REFERENCE.md linked
100%
100%
Core examples retained
100%
100%
SKILL.md shorter
100%
100%
Change log documents SKILL.md changes
100%
100%
Change log explains why
100%
100%
Overall before/after format
100%
100%
Percentage delta shown
100%
100%
Per-dimension breakdown
100%
100%
Arrow notation or equivalent
100%
100%
Dimension change labelled
100%
100%
Dimensions impact explained
100%
100%
Correct overall scores
100%
100%
Completeness improvement noted
100%
100%
Actionability improvement noted
100%
100%
Conciseness unchanged noted
100%
100%
Robustness improvement noted
100%
100%
Code syntax check included
100%
100%
Python syntax error found
100%
100%
Command flags check included
100%
100%
File references check included
100%
100%
File reference passes
100%
100%
Use when clause check included
0%
0%
Use when clause fails
0%
0%
Known concepts check included
60%
70%
Known concepts issue found
100%
100%
Readiness summary
100%
100%
Critical issues first
100%
100%
High before Medium/Low
100%
100%
Summary with priorities
50%
30%
Expected improvement in summary
75%
100%
Dimension score included
70%
60%
Before/after examples
58%
66%
Impact stated per recommendation
75%
50%
Educational WHY included
100%
100%
All four issues addressed
100%
100%
Approval framing
80%
80%
Identifies good references
100%
100%
Explains why good
100%
100%
Identifies poor references
80%
100%
Explains why poor
90%
100%
Token efficiency framing
100%
100%
Routing gate test
100%
100%
Improves CONFIGURATION.md
100%
100%
Improves GUIDE.md
100%
100%
Improves EXAMPLES.md
100%
100%
Improves ADVANCED.md or REFERENCE.md
100%
100%
Questions blind split recommendation
0%
0%
Linking over inlining
100%
100%
Reference file identified
100%
100%
Severity mappings removed
100%
100%
Flag tables removed
100%
100%
Template list removed
100%
100%
SKILL.md substantially shorter
100%
100%
Core examples preserved
100%
100%
Before/after shown
100%
100%
WHY explained
100%
100%
REFERENCE.md not modified
100%
100%
identifies_scenario_1_acceptable
0%
0%
detects_answer_leakage
100%
100%
explains_leakage_impact
100%
100%
detects_double_counting
100%
100%
detects_free_point_criterion
100%
0%
proposes_specific_fixes
100%
66%
no_false_positives_scenario_1
0%
0%
Completeness weight correct
100%
100%
Conciseness weight correct
100%
100%
Actionability weight correct
100%
100%
Use when clause highest impact
100%
100%
Use when quantified
100%
100%
Revised description includes Use when
100%
100%
Executable code recommended
100%
100%
Known concepts flagged
70%
100%
High-impact first ordering
100%
100%
Dimension coverage
100%
100%
tessl skill review command
0%
100%
Review before changes
0%
100%
Review after changes
0%
100%
Validation before apply
90%
100%
Python ast.parse validation
0%
100%
node --check JS validation
0%
100%
Command --help flag validation
0%
0%
File reference validation
28%
100%
Before/after score output
87%
100%
Script accepts SKILL.md path
100%
100%
Phases are ordered
50%
100%
Explicit retry intervals
100%
100%
Rubric language used
100%
100%
HMAC section unchanged
100%
100%
TLS section unchanged
100%
100%
Observability section unchanged
100%
100%
Processing section unchanged
100%
100%
Retry section only changed
100%
100%
Concise addition
100%
100%
Max retry count preserved
100%
100%
Fast acknowledgement preserved
100%
100%
Bucket A: idempotency key
100%
100%
Bucket B: webhook signature
100%
100%
Bucket C: HTTP status codes
100%
100%
Bucket B: currency precision
100%
100%
Bucket D: API version pinning
100%
100%
Bucket D highest priority
100%
100%
Bucket B diagnosis present
100%
100%
Bucket C action suggested
50%
40%
Bucket A no-action
100%
100%
80% threshold applied
100%
100%
Uses activation eval to surface collisions
10%
96%
Proposes description disambiguation
80%
94%
All redundant criteria identified
100%
100%
Options presented per criterion
100%
100%
Useful criteria preserved
100%
100%
Weight redistribution correct
100%
100%
80% threshold applied
100%
87%
Non-redundant scores unchanged
100%
100%
Below-threshold excluded
100%
100%
Removal option named explicitly
100%
100%
Recommends activation eval first
10%
0%
Defines pass/fail criteria
60%
0%
Contradicting clause identified
100%
100%
Contradiction mechanism explained
100%
100%
Remove/clarify approach taken
100%
100%
Specific text targeted
100%
100%
No compensating additions
100%
100%
Other sections preserved
100%
100%
Pre-review list intact
100%
100%
Lists all bundle files
100%
100%
Identifies referenced files
100%
100%
Identifies orphaned files
100%
100%
TRANSACTIONS.md recommendation
100%
100%
PERFORMANCE.md recommendation
100%
100%
SECURITY.md recommendation
100%
100%
LEGACY_EXAMPLES.md recommendation
100%
100%
DRAFT_REPLICATION.md recommendation
100%
100%
Bloat reduction framing
80%
100%
Clear routing signals emphasis
100%
100%
Link vs remove justification
100%
100%
Distinguishes routing gap from out-of-scope
100%
96%
Addresses never-fired skills
100%
92%
Points to activation eval as the fast check
10%
100%
Suggests before/after comparison
70%
100%
Retry count contradiction found
100%
100%
Auth failure contradiction found
100%
100%
All three files referenced
100%
100%
File attribution per contradiction
100%
100%
Auth contradiction despite scope
100%
100%
Verbatim quotes included
100%
100%
Python via ast.parse
0%
0%
Python error identified
100%
100%
JavaScript via node --check
0%
0%
Command flag validation
0%
20%
File reference check
100%
86%
Broken reference identified
100%
100%
Validation before application
100%
100%
Per-check pass/fail
100%
100%
Fix suggestions
100%
90%
Routing table present
100%
100%
Skill coverage summary correct
100%
100%
rewrite-intro out-of-scope determination
100%
100%
generate-bibliography routing gap determination
100%
100%
fix-heading-hierarchy routing gap determination
100%
100%
citation-generator description rewrite
100%
100%
markdown-formatter description rewrite
100%
100%
Minimal rewrite principle
100%
100%
Rewrites presented together
100%
100%
Scored eval data cited
100%
100%
Skill count detection command
35%
100%
Activation eval run first
0%
100%
Scored eval follows activation
57%
100%
Routing-clean gate explained
100%
100%
Skip activation condition stated
100%
28%
Correct eval run command format
0%
100%
--skip-forced-context-activation --skip-scoring flags used
0%
100%
Plugin path used consistently
100%
100%
Regression identified
100%
100%
Regression is highest priority
100%
100%
High baseline warning present
0%
77%
Scenario regeneration suggested
0%
60%
Plugin is actively hurting
100%
100%
Per-criterion regression analysis
100%
100%
Correct prioritization order
100%
100%
Uses --strategy merge
100%
100%
Does NOT use --strategy replace
100%
100%
Correct base command
100%
100%
Output directory specified
100%
100%
Verification step present
100%
100%
Run ID or --last used
100%
100%
Existing scenarios preserved
100%
100%
tessl plugin lint command used
0%
100%
Plugin path argument provided
20%
100%
Lint run after each change set
53%
100%
Token cost ballooning flagged
92%
100%
Move to docs recommended
90%
100%
Docs vs rules distinction
78%
71%
Does NOT recommend rules for heavy content
100%
80%
Excludes .tessl cache
0%
100%
.tessl/plugins warning
0%
100%
Scenario existence check
40%
100%
Scenario generation guidance
0%
100%
Login verification
0%
100%
No --workspace flag
100%
100%
Default model names
20%
100%
Model subset confirmation
0%
50%
Time estimate provided
25%
100%
Run count option
0%
66%
Correct base command
100%
100%
--agent flag format
90%
100%
All three default models
80%
100%
Sequential execution
100%
100%
Run ID capture
100%
100%
Model-to-ID mapping
100%
50%
Monitoring URL output
37%
100%
Polls with tessl eval view
100%
100%
Retry on failure
100%
100%
Waits for all to complete
100%
100%
No --workspace flag
100%
100%
Overall summary table
100%
100%
Per-scenario breakdown
100%
100%
Per-criterion table
100%
100%
Correct symbol thresholds
40%
100%
Baseline interpretation
100%
100%
Universal Failure identified
90%
100%
Capability Gradient identified
90%
100%
Regression identified
90%
100%
Fix before publish recommendation
100%
100%
eval-improve mentioned
100%
0%
Re-run offer
100%
100%
Table of Contents