paker-it/aie26-skill-judge

Evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions (8 official Tessl rubric + 3 bonus). Use when you say 'judge my AIE26 contest skill', 'score this SKILL.md for the contest', 'review my skill submission', or 'how would this score on the leaderboard'. Accepts GitHub repo URLs, file paths, or raw pastes.

1.80x

Quality

94%

Does it follow best practices?

Impact

65%

1.80x

Average score across 5 eval scenarios

Securityby

Risky

Do not use without reviewing

Design: aie26-skill-judge

Name: paker-it/aie26-skill-judge
Rating: 82.39999999999999 (1 reviews)
Author: paker-it

Overview

A dual-purpose evaluation skill for the AI Engineer London 2026 Skills Contest (skillleaderboard.alan-626.workers.dev/AIE26). Designed for Tessl judges scoring submissions, but equally useful for contestants self-checking before they submit.

Name: aie26-skill-judge

Input Handling

The skill accepts a SKILL.md in three ways:

GitHub repo URL — fetches SKILL.md from the repo root
Raw paste — contestant drops SKILL.md content into chat
Local file path — judge points at a cloned file

The skill auto-detects which format it's receiving and normalizes before evaluation. Detection logic:

Starts with https://github.com or github.com → repo URL, fetch SKILL.md
Starts with --- (YAML frontmatter) → raw paste
Starts with / or ~ or contains .md file extension → local file path
Ambiguous → ask the user

Scoring Model

Core Rubric (Official Tessl Dimensions)

8 dimensions, each scored 1-3. Mirrors the official contest rubric for comparability.

Category	Dimension	What it measures
Description	Specificity	Concrete, actionable capabilities listed
Description	Trigger Terms	Natural phrases users would actually say
Description	Completeness	Clear "what" (purpose) and "when" (usage scenarios)
Description	Distinctiveness	Low conflict with existing skills; clear niche
Content	Conciseness	Token efficiency; no padding or over-explanation
Content	Actionability	Executable instructions, concrete examples, specific constraints
Content	Workflow Clarity	Sequenced phases with explicit exit gates and loop-back conditions
Content	Progressive Disclosure	Layered references loaded only when needed

Core score: 8 dimensions x 3 max = 24 raw, normalized to 0-100.

Bonus Dimensions (Judge Depth)

3 additional dimensions, each scored 1-3. Reported separately to preserve comparability with the official rubric.

Dimension	What it measures
Innovation	Novel approach, creative problem framing, not a rehash of existing tools
Style	Authorial voice, tone consistency, reads like a human wrote it (not AI slop)
Vibes	"Would I install this?", solves a real itch, compelling hook, confident attitude

Bonus score: reported as "+X/9" alongside the core score.

Scoring Criteria Detail

Each dimension uses a 1-3 scale:

1 = Weak: Missing, vague, or actively harmful to the skill's purpose
2 = Adequate: Present and functional, but room for improvement
3 = Strong: Exemplary, nothing meaningful to improve

Output Format

The skill produces a structured scorecard followed by per-dimension detailed feedback.

## Scorecard: <skill-name>

### Core Score: XX/100

| Dimension            | Score | Reasoning                    |
|----------------------|-------|------------------------------|
| Specificity          | X/3   | <one line>                   |
| Trigger Terms        | X/3   | <one line>                   |
| Completeness         | X/3   | <one line>                   |
| Distinctiveness      | X/3   | <one line>                   |
| Conciseness          | X/3   | <one line>                   |
| Actionability        | X/3   | <one line>                   |
| Workflow Clarity     | X/3   | <one line>                   |
| Progressive Disclosure | X/3 | <one line>                   |

### Bonus Score: +X/9

| Dimension  | Score | Reasoning                    |
|------------|-------|------------------------------|
| Innovation | X/3   | <one line>                   |
| Style      | X/3   | <one line>                   |
| Vibes      | X/3   | <one line>                   |

### Detailed Feedback

#### Specificity (X/3)
<paragraph: what's strong, what to fix, specific examples from the SKILL.md>

#### Trigger Terms (X/3)
<paragraph>

#### Completeness (X/3)
<paragraph>

#### Distinctiveness (X/3)
<paragraph>

#### Conciseness (X/3)
<paragraph>

#### Actionability (X/3)
<paragraph>

#### Workflow Clarity (X/3)
<paragraph>

#### Progressive Disclosure (X/3)
<paragraph>

#### Innovation (X/3)
<paragraph>

#### Style (X/3)
<paragraph>

#### Vibes (X/3)
<paragraph>

### Verdict
<2-3 sentence summary: is this competition-ready, what's the single
highest-leverage improvement the author should make>

Workflow

5-phase sequential evaluation:

Ingest — Detect input format (URL / paste / path), fetch or read the SKILL.md content, display what was received
Structural check — Validate: frontmatter exists with name and description fields, line count is under 500, SKILL.md is parseable. If structural issues exist, report them and stop (don't score a broken submission)
Core evaluation — Score each of the 8 official Tessl dimensions with one-line reasoning
Bonus evaluation — Score innovation, style, vibes with one-line reasoning
Synthesize — Produce the full scorecard table, detailed per-dimension feedback paragraphs, and a verdict

Voice

Authoritative but constructive. Like a senior judge giving feedback at a pitch competition:

Honest and specific — never vague ("this is good")
When something is weak, name what's wrong AND how to fix it
When something is strong, say so briefly and move on
Never cruel or dismissive — the goal is to help the skill improve
Use direct quotes from the SKILL.md to ground feedback in evidence

Structural Validation Rules

Before scoring, the skill checks:

Frontmatter block exists (starts with ---, ends with ---)
name field present and non-empty
description field present and non-empty
Total line count <= 500
Description includes "when to use" language (e.g., "Use when..." phrases or natural trigger terms)

If any check fails, the skill reports the failures with fix instructions and does not proceed to scoring.

Edge Cases

No SKILL.md in repo: "I couldn't find a SKILL.md at the root of this repo. Is it in a subdirectory?"
Multiple skills in one repo: Score only the root SKILL.md unless the user specifies otherwise
Extremely short SKILL.md (< 20 lines): Flag as likely incomplete, still score what's there
Non-English content: Score as-is, note that the contest appears to be English-language

What This Skill Does NOT Do

Does not rank skills against each other (no access to the full submission pool)
Does not modify the SKILL.md (evaluation only, not optimization)
Does not run tessl skill review or any CLI commands — it's a pure conversational evaluation
Does not evaluate the skill's runtime behavior, only the SKILL.md document itself