paker-it/aie26-skill-judge

Evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions (8 official Tessl rubric + 3 bonus). Use when you say 'judge my AIE26 contest skill', 'score this SKILL.md for the contest', 'review my skill submission', or 'how would this score on the leaderboard'. Accepts GitHub repo URLs, file paths, or raw pastes.

1.80x

Quality

94%

Does it follow best practices?

Impact

65%

1.80x

Average score across 5 eval scenarios

Securityby

Risky

Do not use without reviewing

aie26-skill-judge Implementation Plan

Name: paker-it/aie26-skill-judge
Rating: 82.39999999999999 (1 reviews)
Author: paker-it

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Build a Tessl-publishable evaluation skill that lets judges and contestants score SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions.

Architecture: Single SKILL.md with two reference files for progressive disclosure (detailed rubric + worked example). Packaged as a Tessl tile with CI lint/review on push.

Tech Stack: Tessl skill format (SKILL.md + references/), GitHub Actions, tessl CLI for lint/review/publish.

File Structure

SKILL.md                                    # Main skill — evaluation workflow + output format
references/
  scoring-rubric.md                         # Detailed per-dimension criteria (loaded at Phase 3-4)
  example-evaluation.md                     # Worked example: devcon-hack-coach scored (loaded on request)
tile.json                                   # Tessl tile metadata
tessl.json                                  # Tessl workspace/dependency config
.gitignore                                  # Standard ignores
.github/workflows/tessl-publish.yml         # CI: lint + review on push to main

Task 1: Project Scaffolding

Files:

Create: .gitignore
Create: tile.json
Create: tessl.json
Step 1: Create .gitignore

.claude/
.DS_Store
node_modules/

Step 2: Create tile.json

{
  "name": "paker-it/aie26-skill-judge",
  "version": "0.1.0",
  "private": false,
  "summary": "Evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions (8 official Tessl rubric + 3 bonus). Use when you say 'judge my AIE26 skill', 'score this SKILL.md for the contest', 'review my skill submission', or 'how would this score on the leaderboard'. Accepts GitHub repo URLs, file paths, or raw pastes.",
  "skills": {
    "aie26-skill-judge": {
      "path": "SKILL.md"
    }
  }
}

Step 3: Create tessl.json

{
  "name": "aie26-skill-judge",
  "dependencies": {
    "paker-it/aie26-skill-judge": {
      "version": "0.1.0"
    }
  }
}

Step 4: Commit

git add .gitignore tile.json tessl.json
git commit -m "feat: add project scaffolding (tile.json, tessl.json, .gitignore)"

Task 2: Scoring Rubric Reference

Files:

Create: references/scoring-rubric.md

This file is loaded by the SKILL.md at evaluation time (progressive disclosure). It contains the detailed per-dimension criteria that would bloat the main file.

Step 1: Create references directory

mkdir -p references

Step 2: Write references/scoring-rubric.md

# AIE26 Scoring Rubric — Detailed Criteria

## Core Dimensions (Official Tessl Rubric)

Each scored 1 (Weak), 2 (Adequate), 3 (Strong).

---

### Specificity

Does the description name concrete, actionable capabilities?

| Score | Criteria |
|-------|----------|
| 1     | Vague or abstract ("helps with development", "AI-powered assistant"). No specific actions named. |
| 2     | Names some capabilities but mixes concrete with vague. Unclear what the skill actually does vs. what it aspires to. |
| 3     | Every capability is a concrete verb + object ("generates Kubernetes RBAC policies", "scores SKILL.md submissions on 11 dimensions"). Reader knows exactly what they get. |

**Red flags for 1:** "helps", "assists", "enhances", "leverages" without a direct object.

---

### Trigger Terms

Does the description include natural phrases a user would actually say?

| Score | Criteria |
|-------|----------|
| 1     | No trigger phrases, or phrases no human would say ("utilize the skill to evaluate"). |
| 2     | 1-2 trigger phrases present but generic ("help me with X") or forced-sounding. |
| 3     | 3-6 natural trigger phrases that read like something a real person would type. Covers both direct requests and situational triggers. |

**Strong examples:** "judge my AIE26 skill", "score this for the contest", "will this win?"
**Weak examples:** "invoke skill evaluation mode", "perform assessment"

---

### Completeness

Does the description cover both what (purpose) and when (usage scenarios)?

| Score | Criteria |
|-------|----------|
| 1     | Missing either what or when entirely. Reader can't tell what it does OR when to use it. |
| 2     | Has what but weak/missing when, or vice versa. Purpose is clear but activation context is vague. |
| 3     | Crystal clear on both. Reader knows the skill's purpose AND can identify the exact moment they'd reach for it. |

---

### Distinctiveness

Is this skill clearly different from existing skills? Low conflict risk?

| Score | Criteria |
|-------|----------|
| 1     | Overlaps heavily with common built-in skills or well-known existing skills. Would confuse the skill router. |
| 2     | Somewhat distinct but shares surface area with adjacent skills. Trigger terms could collide. |
| 3     | Clear niche. No realistic conflict with existing skills. Name + description + triggers carve out unique territory. |

---

### Conciseness

Is the content token-efficient? No padding or over-explanation?

| Score | Criteria |
|-------|----------|
| 1     | Bloated with filler, redundant sections, or verbose explanations of simple concepts. Could be half the length. |
| 2     | Some unnecessary prose but core content is present. Could trim 20-30% without losing substance. |
| 3     | Every line earns its place. No padding, no redundancy, no over-explaining. Uses tables and lists over paragraphs where appropriate. |

**Red flags for 1:** Repeating the same instruction in different words, explaining what markdown formatting is, long preambles before the actual instructions.

---

### Actionability

Does the content contain executable instructions with concrete examples?

| Score | Criteria |
|-------|----------|
| 1     | Abstract methodology or theory. No examples, no constraints, no concrete steps. |
| 2     | Some concrete instructions but mixed with vague guidance ("handle edge cases appropriately"). |
| 3     | Every instruction is specific enough to execute without interpretation. Includes examples, constraints, expected outputs. |

---

### Workflow Clarity

Are instructions sequenced into clear phases with exit gates?

| Score | Criteria |
|-------|----------|
| 1     | Unstructured wall of instructions. No clear order or phases. |
| 2     | Has phases/sections but missing exit gates, or unclear when to move between phases. |
| 3     | Numbered/named phases with explicit entry conditions, exit gates, and loop-back conditions. Reader always knows where they are in the workflow. |

---

### Progressive Disclosure

Does the skill load information only when needed?

| Score | Criteria |
|-------|----------|
| 1     | Everything in one file. No reference files. Or: references exist but are loaded eagerly. |
| 2     | Some references exist but loading isn't well-timed, or reference structure is unclear. |
| 3     | Reference files loaded only at the phase that needs them. Main SKILL.md is lean. References are clearly named and scoped. |

**Note:** A simple skill that genuinely doesn't need references can still score 3 — progressive disclosure means "don't front-load what isn't needed yet", not "must have reference files."

---

## Bonus Dimensions

### Innovation

| Score | Criteria |
|-------|----------|
| 1     | Commodity wrapper around a well-known tool or API. No novel framing. "ChatGPT but for X." |
| 2     | Applies existing techniques to a specific domain in a useful way. Competent but not surprising. |
| 3     | Genuinely novel approach, creative problem framing, or addresses a gap nobody else has filled. Makes you think "why didn't this exist already?" |

---

### Style

| Score | Criteria |
|-------|----------|
| 1     | Reads like generic AI-generated text. No authorial voice. Corporate-bland or template-obvious. |
| 2     | Has some personality but inconsistent. Mixes voices or defaults to generic in places. |
| 3     | Consistent, confident authorial voice throughout. Reads like a person with opinions wrote it. Tone matches the skill's purpose. |

**Strong signal:** The voice section (if present) has specific examples, not just adjectives.
**Weak signal:** "Be helpful and professional" — says nothing.

---

### Vibes

The gut-check composite. Score based on these three sub-questions:

1. **Would I install this?** — Does it solve a real problem I've had, or is it a solution looking for a problem?
2. **Does the voice feel human?** — Confident and opinionated, or hedging and generic?
3. **Is the hook compelling?** — After reading the name + description + first 10 lines, do I want to keep reading?

| Score | Criteria |
|-------|----------|
| 1     | Fails all three. Academic exercise or toy demo. No pull. |
| 2     | Passes 1-2. Useful but not exciting, or exciting but not practical. |
| 3     | Passes all three. You'd install it, recommend it, and remember the name. |

Step 3: Commit

git add references/scoring-rubric.md
git commit -m "feat: add detailed scoring rubric reference (11 dimensions)"

Task 3: Example Evaluation Reference

Files:

Create: references/example-evaluation.md

A worked example showing what a completed evaluation looks like. Uses the devcon-hack-coach skill (known 100/100 scorer) as the subject. This helps judges calibrate and shows contestants what "good" looks like.

Step 1: Write references/example-evaluation.md

# Example Evaluation: devcon-hack-coach

This is a worked example of a completed evaluation. Use it to calibrate your expectations for the output format and scoring depth.

---

## Scorecard: devcon-hack-coach

### Core Score: 100/100

| Dimension              | Score | Reasoning                                                                                      |
|------------------------|-------|------------------------------------------------------------------------------------------------|
| Specificity            | 3/3   | Names exact deliverables: one-page spec, 4-checkpoint plan, 3-sentence pitch                   |
| Trigger Terms          | 3/3   | Six natural phrases including "coach me through a DevCon hack", "scope my 24h hack"            |
| Completeness           | 3/3   | Purpose (hackathon coaching) and activation context (DevCon 2026 prep) both crystal clear       |
| Distinctiveness        | 3/3   | Scoped to a single event + format; zero conflict risk with general coaching skills              |
| Conciseness            | 3/3   | Every section earns its place. Anti-patterns list is tight. No filler.                          |
| Actionability          | 3/3   | Exact questions to ask, exact exit gates, example dialogue for each phase                       |
| Workflow Clarity       | 3/3   | 4 named phases with explicit exit gates and loop-back conditions ("loop inside the phase")     |
| Progressive Disclosure | 3/3   | 5 reference files loaded only when their phase starts; main file stays lean                     |

### Bonus Score: +8/9

| Dimension  | Score | Reasoning                                                                        |
|------------|-------|----------------------------------------------------------------------------------|
| Innovation | 3/3   | Opinionated "no code before spec" stance is a novel coaching angle for hackathons |
| Style      | 3/3   | Pushy coach voice with specific example dialogue — distinctive and consistent     |
| Vibes      | 2/3   | Compelling hook and real utility, but narrow event window limits install appeal    |

### Verdict

Competition-ready. The 4-phase workflow with strict exit gates is the strongest element — it turns vague coaching into a repeatable process. The only soft spot is vibes: the skill is tied to a single event (DevCon June 2026), which limits its shelf life. A contestant might generalize the concept to "hackathon coach" for broader appeal, but for this contest the event specificity is actually a strength for distinctiveness.

Step 2: Commit

git add references/example-evaluation.md
git commit -m "feat: add worked example evaluation (devcon-hack-coach)"

Task 4: Write the SKILL.md

Files:

Create: SKILL.md

This is the core deliverable. Must stay under 500 lines. Uses progressive disclosure to keep the main file lean — detailed rubric criteria live in references/scoring-rubric.md.

Step 1: Write SKILL.md

---
name: aie26-skill-judge
description: Evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions (8 official Tessl rubric + 3 bonus). Use when you say "judge my AIE26 skill", "score this SKILL.md for the contest", "review my skill submission", "how would this score on the leaderboard", "rate my skill before I submit", or "give me judge feedback on this skill". Accepts GitHub repo URLs, file paths, or raw SKILL.md pastes.
---

# AIE26 Skill Judge

You evaluate SKILL.md files submitted to the **AI Engineer London 2026 Skills Contest** (`skillleaderboard.alan-626.workers.dev/AIE26`). You score on the official Tessl rubric plus three bonus dimensions, and provide actionable feedback.

You serve two audiences:
- **Tessl judges** systematically scoring a batch of submissions
- **Contestants** self-checking before they submit

## Do NOT use this when

- The user wants to *write* or *optimize* a skill — you evaluate, you don't edit
- The user asks about the contest rules, schedule, or prizes — direct them to the leaderboard page
- The SKILL.md is for a different contest or registry — this rubric is AIE26-specific

## Voice

Authoritative but constructive. Like a senior judge at a pitch competition.

- Specific and evidence-based — quote the SKILL.md directly
- When something is weak, name what's wrong AND how to fix it
- When something is strong, acknowledge it briefly and move on
- Never vague ("this is good"), never cruel ("this is terrible")

## Input Detection

Accept the skill in any of these formats:

1. **GitHub URL** (starts with `https://github.com` or `github.com`) → fetch the SKILL.md from the repo root using `gh` CLI or web fetch
2. **Raw paste** (starts with `---` frontmatter) → use directly
3. **Local file path** (starts with `/` or `~`, or ends in `.md`) → read the file
4. **Ambiguous** → ask: "Is that a file path, a repo URL, or the skill content itself?"

Once you have the SKILL.md content, confirm what you received: *"Got it — evaluating `<skill-name>` (<line count> lines). Running the gauntlet."*

## The 5-Phase Evaluation

Run these phases in order. Do not skip any.

---

### Phase 1 — Ingest

Read the SKILL.md. Extract:
- `name` from frontmatter
- `description` from frontmatter
- Total line count
- Whether reference files are mentioned

Display: *"Evaluating `<name>` — <line_count> lines, <N> reference files mentioned."*

---

### Phase 2 — Structural Check

Validate before scoring. Check all of these:

- [ ] Frontmatter block exists (starts with `---`, ends with `---`)
- [ ] `name` field present and non-empty
- [ ] `description` field present and non-empty
- [ ] Total line count <= 500
- [ ] Description includes trigger language ("Use when...", natural phrases, or usage scenarios)

**If any check fails:** Report each failure with a specific fix instruction. Do NOT proceed to scoring.

Example failure output:
> **Structural Issues (2 found):**
> 1. Missing `description` field in frontmatter — add a description with concrete capabilities and 3-6 trigger phrases
> 2. Line count is 523 (max 500) — cut 23 lines; check for redundant sections or over-explained concepts
>
> *Fix these before resubmitting for evaluation.*

**If all pass:** *"Structure checks passed. Scoring..."*

---

### Phase 3 — Core Evaluation

Load `references/scoring-rubric.md` for detailed per-dimension criteria.

Score the 8 official Tessl dimensions. For each dimension:
1. Read the detailed criteria from the rubric reference
2. Find specific evidence in the SKILL.md (quote it)
3. Assign 1 (Weak), 2 (Adequate), or 3 (Strong)
4. Write a one-line reasoning that references the evidence

**Core score formula:** `round((sum of 8 scores / 24) * 100)`

---

### Phase 4 — Bonus Evaluation

Using the bonus criteria from `references/scoring-rubric.md`, score the 3 bonus dimensions:

- **Innovation:** Is this a novel approach or a commodity wrapper?
- **Style:** Does it have a consistent, human authorial voice?
- **Vibes:** Would you install this? Does it solve a real itch? Is the hook compelling?

Report as `+X/9`.

---

### Phase 5 — Synthesize

Produce the full output in this exact format:

Scorecard: <skill-name>

Core Score: XX/100

Dimension	Score	Reasoning
Specificity	X/3	<one line with evidence>
Trigger Terms	X/3	<one line with evidence>
Completeness	X/3	<one line with evidence>
Distinctiveness	X/3	<one line with evidence>
Conciseness	X/3	<one line with evidence>
Actionability	X/3	<one line with evidence>
Workflow Clarity	X/3	<one line with evidence>
Progressive Disclosure	X/3	<one line with evidence>

Bonus Score: +X/9

Dimension	Score	Reasoning
Innovation	X/3	<one line with evidence>
Style	X/3	<one line with evidence>
Vibes	X/3	<one line with evidence>

Detailed Feedback

<Dimension Name> (X/3)

<paragraph: what's strong, what's weak, direct quotes from the SKILL.md, specific fix suggestions if score < 3>

Repeat the detailed feedback section for all 11 dimensions. End with:

Verdict

<2-3 sentences: is this competition-ready? What is the single highest-leverage improvement? Be specific.>

---

## Calibration

For reference, a worked example evaluation is available in `references/example-evaluation.md`. Load it if a user asks "what does a good score look like?" or "show me an example evaluation."

## Edge Cases

- **No SKILL.md found in repo:** *"I couldn't find a SKILL.md at the root of this repo. Is it in a subdirectory?"*
- **Multiple skills in repo:** Score only the root SKILL.md unless the user says otherwise
- **Very short (< 20 lines):** Flag as likely incomplete, still score what exists
- **Non-English:** Score as-is, note the contest appears to be English-language
- **User asks to evaluate multiple skills:** Evaluate them one at a time, each getting a full scorecard

## What This Skill Does NOT Do

- Rank skills against each other — no access to the full submission pool
- Modify or optimize the SKILL.md — evaluation only
- Run `tessl skill review` or any CLI tools — pure conversational evaluation
- Evaluate runtime behavior — only the SKILL.md document itself

Step 2: Verify line count is under 500

wc -l SKILL.md

Expected: under 500 lines.

Step 3: Commit

git add SKILL.md
git commit -m "feat: add aie26-skill-judge SKILL.md (evaluation workflow + 11 dimensions)"

Task 5: CI Workflow

Files:

Create: .github/workflows/tessl-publish.yml

Mirrors the pattern from devcon-hack-coach. Runs lint + review on push to main.

Step 1: Create workflow directory

mkdir -p .github/workflows

Step 2: Write .github/workflows/tessl-publish.yml

name: Lint

on:
  push:
    branches: [main]
    paths:
      - 'SKILL.md'
      - 'tile.json'
      - 'references/**'
      - 'evals/**'

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4.3.0

      - uses: tesslio/setup-tessl@v2
        with:
          token: ${{ secrets.TESSL_TOKEN }}

      - name: Lint
        run: tessl skill lint

      - name: Review
        run: tessl skill review

Step 3: Commit

git add .github/workflows/tessl-publish.yml
git commit -m "ci: add tessl lint + review workflow on push to main"

Task 6: Local Lint and Review

Files: None (validation only)

Run the Tessl CLI locally to verify the skill passes lint and review before pushing.

Step 1: Run tessl skill lint

tessl skill lint

Expected: passes with no errors. May warn about "orphaned files" for references — that's fine.

Step 2: Run tessl skill review

tessl skill review

Expected: scores across Description and Content dimensions. Target: 80+ overall. If below, read the feedback and fix the SKILL.md in a follow-up commit.

Step 3: Fix any issues found by lint or review

If lint/review surfaces problems, edit the relevant file, re-run the check, and commit the fix:

git add <fixed-files>
git commit -m "fix: address tessl lint/review feedback"

Task 7: Push and Submit

Files: None (git operations only)

Step 1: Check git auth

gh auth status

Verify you're authenticated as mertpaker (personal account).

Step 2: Create GitHub repo

gh repo create mertpaker/aie26-skill-judge --public --source=. --push

Step 3: Verify repo is live

gh repo view mertpaker/aie26-skill-judge --web

Step 4: Submit to leaderboard

Go to https://skillleaderboard.alan-626.workers.dev/AIE26 and submit:

GitHub repo URL: https://github.com/mertpaker/aie26-skill-judge
Name: Mert Paker
Email: (Mert's contest email)

docs

superpowers

plans

2026-04-12-aie26-skill-judge.md

specs

evals

references

paker-it/aie26-skill-judge

2026-04-12-aie26-skill-judge.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/superpowers/plans/

aie26-skill-judge Implementation Plan

File Structure

Task 1: Project Scaffolding

Task 2: Scoring Rubric Reference

Task 3: Example Evaluation Reference

Task 4: Write the SKILL.md

Scorecard: <skill-name>

Core Score: XX/100

Bonus Score: +X/9

Detailed Feedback

<Dimension Name> (X/3)

Verdict

Task 5: CI Workflow

Task 6: Local Lint and Review

Task 7: Push and Submit

2026-04-12-aie26-skill-judge.mddocs/superpowers/plans/