CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/skill-optimizer

Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.

91

1.10x
Quality

91%

Does it follow best practices?

Impact

92%

1.10x

Average score across 25 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-8/

{
  "context": "Tests whether the agent applies the correct triage logic from the optimization cycle skill: identifying regressions as highest priority, recognizing that high baselines suggest scenarios may be too easy (consider regenerating), and recommending the right corrective actions before proceeding to tile content edits.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Regression identified",
      "description": "Report identifies the security-review scenario as a regression (with_context 79% < baseline 84%, delta -5)",
      "max_score": 15
    },
    {
      "name": "Regression is highest priority",
      "description": "Report explicitly labels the regression as the highest priority issue — either using the word 'highest priority', 'most urgent', or equivalent strong language that makes it the top item to address",
      "max_score": 18
    },
    {
      "name": "High baseline warning present",
      "description": "Report notes that all three scenarios have high baseline scores (84%, 87%, 82% — all above 80%), suggesting the scenarios may be too easy and agents can solve them without the tile",
      "max_score": 18
    },
    {
      "name": "Scenario regeneration suggested",
      "description": "Report recommends considering regenerating harder scenarios (before or alongside tile content fixes) as a response to the high baseline scores",
      "max_score": 15
    },
    {
      "name": "Tile is actively hurting",
      "description": "Report states or implies that the tile is actively hurting agent performance on security-review (not just failing to help) — because with-tile scores are lower than without-tile scores on that scenario",
      "max_score": 12
    },
    {
      "name": "Per-criterion regression analysis",
      "description": "Report identifies at least two specific criteria within security-review that regressed (e.g., flags_injection_risks dropped from 9→7, checks_auth_bypass 8→6, references_cwe 7→5)",
      "max_score": 12
    },
    {
      "name": "Correct prioritization order",
      "description": "Recommendations are ordered with regression fix BEFORE any suggestion to improve the tile for the other two scenarios — NOT presented as equally weighted items",
      "max_score": 10
    }
  ]
}

evals

README.md

tile.json