CtrlK
BlogDocsLog inGet started
Tessl Logo

paker-it/aie26-skill-judge

Evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions (8 official Tessl rubric + 3 bonus). Use when you say 'judge my AIE26 contest skill', 'score this SKILL.md for the contest', 'review my skill submission', or 'how would this score on the leaderboard'. Accepts GitHub repo URLs, file paths, or raw pastes.

82

1.80x
Quality

94%

Does it follow best practices?

Impact

65%

1.80x

Average score across 5 eval scenarios

SecuritybySnyk

Risky

Do not use without reviewing

Overview
Quality
Evals
Security
Files

2026-04-12-aie26-skill-judge-design.mddocs/superpowers/specs/

Design: aie26-skill-judge

Overview

A dual-purpose evaluation skill for the AI Engineer London 2026 Skills Contest (skillleaderboard.alan-626.workers.dev/AIE26). Designed for Tessl judges scoring submissions, but equally useful for contestants self-checking before they submit.

Name: aie26-skill-judge

Input Handling

The skill accepts a SKILL.md in three ways:

  1. GitHub repo URL — fetches SKILL.md from the repo root
  2. Raw paste — contestant drops SKILL.md content into chat
  3. Local file path — judge points at a cloned file

The skill auto-detects which format it's receiving and normalizes before evaluation. Detection logic:

  • Starts with https://github.com or github.com → repo URL, fetch SKILL.md
  • Starts with --- (YAML frontmatter) → raw paste
  • Starts with / or ~ or contains .md file extension → local file path
  • Ambiguous → ask the user

Scoring Model

Core Rubric (Official Tessl Dimensions)

8 dimensions, each scored 1-3. Mirrors the official contest rubric for comparability.

CategoryDimensionWhat it measures
DescriptionSpecificityConcrete, actionable capabilities listed
DescriptionTrigger TermsNatural phrases users would actually say
DescriptionCompletenessClear "what" (purpose) and "when" (usage scenarios)
DescriptionDistinctivenessLow conflict with existing skills; clear niche
ContentConcisenessToken efficiency; no padding or over-explanation
ContentActionabilityExecutable instructions, concrete examples, specific constraints
ContentWorkflow ClaritySequenced phases with explicit exit gates and loop-back conditions
ContentProgressive DisclosureLayered references loaded only when needed

Core score: 8 dimensions x 3 max = 24 raw, normalized to 0-100.

Bonus Dimensions (Judge Depth)

3 additional dimensions, each scored 1-3. Reported separately to preserve comparability with the official rubric.

DimensionWhat it measures
InnovationNovel approach, creative problem framing, not a rehash of existing tools
StyleAuthorial voice, tone consistency, reads like a human wrote it (not AI slop)
Vibes"Would I install this?", solves a real itch, compelling hook, confident attitude

Bonus score: reported as "+X/9" alongside the core score.

Scoring Criteria Detail

Each dimension uses a 1-3 scale:

  • 1 = Weak: Missing, vague, or actively harmful to the skill's purpose
  • 2 = Adequate: Present and functional, but room for improvement
  • 3 = Strong: Exemplary, nothing meaningful to improve

Output Format

The skill produces a structured scorecard followed by per-dimension detailed feedback.

## Scorecard: <skill-name>

### Core Score: XX/100

| Dimension            | Score | Reasoning                    |
|----------------------|-------|------------------------------|
| Specificity          | X/3   | <one line>                   |
| Trigger Terms        | X/3   | <one line>                   |
| Completeness         | X/3   | <one line>                   |
| Distinctiveness      | X/3   | <one line>                   |
| Conciseness          | X/3   | <one line>                   |
| Actionability        | X/3   | <one line>                   |
| Workflow Clarity     | X/3   | <one line>                   |
| Progressive Disclosure | X/3 | <one line>                   |

### Bonus Score: +X/9

| Dimension  | Score | Reasoning                    |
|------------|-------|------------------------------|
| Innovation | X/3   | <one line>                   |
| Style      | X/3   | <one line>                   |
| Vibes      | X/3   | <one line>                   |

### Detailed Feedback

#### Specificity (X/3)
<paragraph: what's strong, what to fix, specific examples from the SKILL.md>

#### Trigger Terms (X/3)
<paragraph>

#### Completeness (X/3)
<paragraph>

#### Distinctiveness (X/3)
<paragraph>

#### Conciseness (X/3)
<paragraph>

#### Actionability (X/3)
<paragraph>

#### Workflow Clarity (X/3)
<paragraph>

#### Progressive Disclosure (X/3)
<paragraph>

#### Innovation (X/3)
<paragraph>

#### Style (X/3)
<paragraph>

#### Vibes (X/3)
<paragraph>

### Verdict
<2-3 sentence summary: is this competition-ready, what's the single
highest-leverage improvement the author should make>

Workflow

5-phase sequential evaluation:

  1. Ingest — Detect input format (URL / paste / path), fetch or read the SKILL.md content, display what was received
  2. Structural check — Validate: frontmatter exists with name and description fields, line count is under 500, SKILL.md is parseable. If structural issues exist, report them and stop (don't score a broken submission)
  3. Core evaluation — Score each of the 8 official Tessl dimensions with one-line reasoning
  4. Bonus evaluation — Score innovation, style, vibes with one-line reasoning
  5. Synthesize — Produce the full scorecard table, detailed per-dimension feedback paragraphs, and a verdict

Voice

Authoritative but constructive. Like a senior judge giving feedback at a pitch competition:

  • Honest and specific — never vague ("this is good")
  • When something is weak, name what's wrong AND how to fix it
  • When something is strong, say so briefly and move on
  • Never cruel or dismissive — the goal is to help the skill improve
  • Use direct quotes from the SKILL.md to ground feedback in evidence

Structural Validation Rules

Before scoring, the skill checks:

  • Frontmatter block exists (starts with ---, ends with ---)
  • name field present and non-empty
  • description field present and non-empty
  • Total line count <= 500
  • Description includes "when to use" language (e.g., "Use when..." phrases or natural trigger terms)

If any check fails, the skill reports the failures with fix instructions and does not proceed to scoring.

Edge Cases

  • No SKILL.md in repo: "I couldn't find a SKILL.md at the root of this repo. Is it in a subdirectory?"
  • Multiple skills in one repo: Score only the root SKILL.md unless the user specifies otherwise
  • Extremely short SKILL.md (< 20 lines): Flag as likely incomplete, still score what's there
  • Non-English content: Score as-is, note that the contest appears to be English-language

What This Skill Does NOT Do

  • Does not rank skills against each other (no access to the full submission pool)
  • Does not modify the SKILL.md (evaluation only, not optimization)
  • Does not run tessl skill review or any CLI commands — it's a pure conversational evaluation
  • Does not evaluate the skill's runtime behavior, only the SKILL.md document itself

docs

superpowers

README.md

SKILL.md

tessl.json

tile.json