CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl-labs/skill-optimizer

Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.

88

1.07x
Quality

94%

Does it follow best practices?

Impact

88%

1.07x

Average score across 24 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-16/

{
  "context": "Tests whether the agent correctly implements an approval-gated change workflow: producing a proposal document with grouped recommendations, trade-off analysis, risk assessment, and an explicit approval request — without modifying the SKILL.md file. Distinct from prioritized-recommendation-generation which tests recommendation quality and formatting.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "SKILL.md not modified",
      "description": "The output does NOT include a modified SKILL.md file — only the proposal document is produced. The agent resisted the temptation to apply fixes directly.",
      "max_score": 20
    },
    {
      "name": "Explicit approval request",
      "description": "The proposal ends with a clear, explicit approval gate — asking the tech lead to approve, reject, or request changes before any edits proceed. Not just 'framed as a proposal' but actively requesting a decision.",
      "max_score": 15
    },
    {
      "name": "Trade-off discussion",
      "description": "The proposal surfaces at least one genuine tension between recommendations (e.g., adding detail for completeness vs. removing detail for conciseness) and recommends a resolution while acknowledging the cost.",
      "max_score": 15
    },
    {
      "name": "Risk assessment per recommendation",
      "description": "Each recommendation (or each group) includes what could go wrong — e.g., routing regressions, lost context, broken references, or unintended side effects if the change is applied incorrectly.",
      "max_score": 12
    },
    {
      "name": "Grouped presentation",
      "description": "Related changes are batched into logical groups (e.g., 'Routing & Discovery', 'Content Quality') rather than presented as a flat numbered list. Each group can be approved or rejected as a unit.",
      "max_score": 8
    },
    {
      "name": "All key issues addressed",
      "description": "Recommendations cover all key issues from the review: missing Use-when clause, HMAC explanation bloat, and missing idempotency pattern.",
      "max_score": 10
    },
    {
      "name": "Priority summary present",
      "description": "Proposal includes a summary section that conveys priority levels (Critical/High/Medium) so the tech lead can quickly gauge urgency.",
      "max_score": 10
    },
    {
      "name": "Current score per recommendation",
      "description": "Each recommendation or group references the current dimension score it targets (e.g., 'Completeness: 1/3 (33%)') to ground the proposal in data.",
      "max_score": 10
    }
  ]
}

evals

README.md

tile.json