CtrlK
BlogDocsLog inGet started
Tessl Logo

oh-my-ai/skill-maker

Interactive skill creation and eval-driven optimization. Triggers: create a skill, make a skill, new skill, scaffold skill, optimize skill, run evals, improve skill. Uses AskUserQuestion for interview; WebSearch for research; Bash for eval execution. Outputs: complete skill directory with SKILL.md, tile.json, evals, and repo integration.

93

1.26x
Quality

94%

Does it follow best practices?

Impact

91%

1.26x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

benchmark-loop.mdrules/

name:
benchmark-loop
description:
Process for structuring evals, interpreting results, and driving optimization

Benchmark Loop (skill-maker reference)

Use these patterns when generating eval scenarios and analyzing eval results.

Minimum eval matrix

For each scenario, capture:

  • Overall score (sum of weighted criteria)
  • Per-criterion score (0 to max_score)
  • Delta: with-skill score minus baseline score

Required scenario coverage

Every skill should have:

  • One scenario per core capability the skill claims
  • One scenario that stresses omission-prone outputs (footers, checklists, required sections)
  • One scenario with noisy context to test retrieval under pressure

CLI-generated scenarios: When scenarios come from tessl scenario generate / tessl scenario download, treat them as a starting set. Re-read the checklist above after download; if anything is missing (e.g. noisy context or omission stress), add or extend scenarios by hand under evals/<slug>/ so the matrix still matches this section before you rely on eval results for optimization.

Criteria weighting

Scores in criteria.json must sum to 100. Target distribution:

  • Core behaviors: 60–70% of total weight
  • Anti-pattern avoidance: 15–20% of total weight
  • Output format compliance: 10–20% of total weight

Result interpretation

PatternSignalAction
High baseline + tiny deltaModel already knows thisReduce verbosity or specialize edge-cases
Low baseline + high deltaSkill adds strong valuePreserve and refine
Low baseline + low with-skillSkill content is weak or unclearRewrite instructions
Negative deltaSkill introduces confusionPatch immediately

Readout format

Use this table structure in benchmark-log.md:

## Run: <ISO-8601 timestamp>

**Method:** tessl-cli | llm-as-judge | **Model:** <model-name>

| Scenario | Baseline | With Skill | Delta |
|----------|----------|------------|-------|
| ...      | ...      | ...        | ...   |

**Changes applied:** <summary of edits made before this run>

Optimization priorities

  1. Fix negative deltas first (regressions)
  2. Then fix 0% criteria with skill enabled (universal failures)
  3. Then improve lowest-delta scenarios
  4. Then improve lowest-scoring individual criteria

SKILL.md

tile.json