oh-my-ai/skill-maker

Interactive skill creation and eval-driven optimization. Triggers: create a skill, make a skill, new skill, scaffold skill, optimize skill, run evals, improve skill. Uses AskUserQuestion for interview; WebSearch for research; Bash for eval execution. Outputs: complete skill directory with SKILL.md, tile.json, evals, and repo integration.

1.26x

Quality

94%

Does it follow best practices?

Impact

91%

1.26x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

name:: benchmark-loop
description:: Process for structuring evals, interpreting results, and driving optimization

Benchmark Loop (skill-maker reference)

Name: oh-my-ai/skill-maker
Rating: 93.4 (1 reviews)
Author: oh-my-ai

Use these patterns when generating eval scenarios and analyzing eval results.

Minimum eval matrix

For each scenario, capture:

Overall score (sum of weighted criteria)
Per-criterion score (0 to max_score)
Delta: with-skill score minus baseline score

Required scenario coverage

Every skill should have:

One scenario per core capability the skill claims
One scenario that stresses omission-prone outputs (footers, checklists, required sections)
One scenario with noisy context to test retrieval under pressure

CLI-generated scenarios: When scenarios come from tessl scenario generate / tessl scenario download, treat them as a starting set. Re-read the checklist above after download; if anything is missing (e.g. noisy context or omission stress), add or extend scenarios by hand under evals/<slug>/ so the matrix still matches this section before you rely on eval results for optimization.

Criteria weighting

Scores in criteria.json must sum to 100. Target distribution:

Core behaviors: 60–70% of total weight
Anti-pattern avoidance: 15–20% of total weight
Output format compliance: 10–20% of total weight

Result interpretation

Pattern	Signal	Action
High baseline + tiny delta	Model already knows this	Reduce verbosity or specialize edge-cases
Low baseline + high delta	Skill adds strong value	Preserve and refine
Low baseline + low with-skill	Skill content is weak or unclear	Rewrite instructions
Negative delta	Skill introduces confusion	Patch immediately

Readout format

Use this table structure in benchmark-log.md:

## Run: <ISO-8601 timestamp>

**Method:** tessl-cli | llm-as-judge | **Model:** <model-name>

| Scenario | Baseline | With Skill | Delta |
|----------|----------|------------|-------|
| ...      | ...      | ...        | ...   |

**Changes applied:** <summary of edits made before this run>