Evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions (8 official Tessl rubric + 3 bonus). Use when you say 'judge my AIE26 contest skill', 'score this SKILL.md for the contest', 'review my skill submission', or 'how would this score on the leaderboard'. Accepts GitHub repo URLs, file paths, or raw pastes.
82
94%
Does it follow best practices?
Impact
65%
1.80xAverage score across 5 eval scenarios
Risky
Do not use without reviewing
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Build a Tessl-publishable evaluation skill that lets judges and contestants score SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions.
Architecture: Single SKILL.md with two reference files for progressive disclosure (detailed rubric + worked example). Packaged as a Tessl tile with CI lint/review on push.
Tech Stack: Tessl skill format (SKILL.md + references/), GitHub Actions, tessl CLI for lint/review/publish.
SKILL.md # Main skill — evaluation workflow + output format
references/
scoring-rubric.md # Detailed per-dimension criteria (loaded at Phase 3-4)
example-evaluation.md # Worked example: devcon-hack-coach scored (loaded on request)
tile.json # Tessl tile metadata
tessl.json # Tessl workspace/dependency config
.gitignore # Standard ignores
.github/workflows/tessl-publish.yml # CI: lint + review on push to mainFiles:
Create: .gitignore
Create: tile.json
Create: tessl.json
Step 1: Create .gitignore
.claude/
.DS_Store
node_modules/{
"name": "paker-it/aie26-skill-judge",
"version": "0.1.0",
"private": false,
"summary": "Evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions (8 official Tessl rubric + 3 bonus). Use when you say 'judge my AIE26 skill', 'score this SKILL.md for the contest', 'review my skill submission', or 'how would this score on the leaderboard'. Accepts GitHub repo URLs, file paths, or raw pastes.",
"skills": {
"aie26-skill-judge": {
"path": "SKILL.md"
}
}
}{
"name": "aie26-skill-judge",
"dependencies": {
"paker-it/aie26-skill-judge": {
"version": "0.1.0"
}
}
}git add .gitignore tile.json tessl.json
git commit -m "feat: add project scaffolding (tile.json, tessl.json, .gitignore)"Files:
references/scoring-rubric.mdThis file is loaded by the SKILL.md at evaluation time (progressive disclosure). It contains the detailed per-dimension criteria that would bloat the main file.
mkdir -p references# AIE26 Scoring Rubric — Detailed Criteria
## Core Dimensions (Official Tessl Rubric)
Each scored 1 (Weak), 2 (Adequate), 3 (Strong).
---
### Specificity
Does the description name concrete, actionable capabilities?
| Score | Criteria |
|-------|----------|
| 1 | Vague or abstract ("helps with development", "AI-powered assistant"). No specific actions named. |
| 2 | Names some capabilities but mixes concrete with vague. Unclear what the skill actually does vs. what it aspires to. |
| 3 | Every capability is a concrete verb + object ("generates Kubernetes RBAC policies", "scores SKILL.md submissions on 11 dimensions"). Reader knows exactly what they get. |
**Red flags for 1:** "helps", "assists", "enhances", "leverages" without a direct object.
---
### Trigger Terms
Does the description include natural phrases a user would actually say?
| Score | Criteria |
|-------|----------|
| 1 | No trigger phrases, or phrases no human would say ("utilize the skill to evaluate"). |
| 2 | 1-2 trigger phrases present but generic ("help me with X") or forced-sounding. |
| 3 | 3-6 natural trigger phrases that read like something a real person would type. Covers both direct requests and situational triggers. |
**Strong examples:** "judge my AIE26 skill", "score this for the contest", "will this win?"
**Weak examples:** "invoke skill evaluation mode", "perform assessment"
---
### Completeness
Does the description cover both what (purpose) and when (usage scenarios)?
| Score | Criteria |
|-------|----------|
| 1 | Missing either what or when entirely. Reader can't tell what it does OR when to use it. |
| 2 | Has what but weak/missing when, or vice versa. Purpose is clear but activation context is vague. |
| 3 | Crystal clear on both. Reader knows the skill's purpose AND can identify the exact moment they'd reach for it. |
---
### Distinctiveness
Is this skill clearly different from existing skills? Low conflict risk?
| Score | Criteria |
|-------|----------|
| 1 | Overlaps heavily with common built-in skills or well-known existing skills. Would confuse the skill router. |
| 2 | Somewhat distinct but shares surface area with adjacent skills. Trigger terms could collide. |
| 3 | Clear niche. No realistic conflict with existing skills. Name + description + triggers carve out unique territory. |
---
### Conciseness
Is the content token-efficient? No padding or over-explanation?
| Score | Criteria |
|-------|----------|
| 1 | Bloated with filler, redundant sections, or verbose explanations of simple concepts. Could be half the length. |
| 2 | Some unnecessary prose but core content is present. Could trim 20-30% without losing substance. |
| 3 | Every line earns its place. No padding, no redundancy, no over-explaining. Uses tables and lists over paragraphs where appropriate. |
**Red flags for 1:** Repeating the same instruction in different words, explaining what markdown formatting is, long preambles before the actual instructions.
---
### Actionability
Does the content contain executable instructions with concrete examples?
| Score | Criteria |
|-------|----------|
| 1 | Abstract methodology or theory. No examples, no constraints, no concrete steps. |
| 2 | Some concrete instructions but mixed with vague guidance ("handle edge cases appropriately"). |
| 3 | Every instruction is specific enough to execute without interpretation. Includes examples, constraints, expected outputs. |
---
### Workflow Clarity
Are instructions sequenced into clear phases with exit gates?
| Score | Criteria |
|-------|----------|
| 1 | Unstructured wall of instructions. No clear order or phases. |
| 2 | Has phases/sections but missing exit gates, or unclear when to move between phases. |
| 3 | Numbered/named phases with explicit entry conditions, exit gates, and loop-back conditions. Reader always knows where they are in the workflow. |
---
### Progressive Disclosure
Does the skill load information only when needed?
| Score | Criteria |
|-------|----------|
| 1 | Everything in one file. No reference files. Or: references exist but are loaded eagerly. |
| 2 | Some references exist but loading isn't well-timed, or reference structure is unclear. |
| 3 | Reference files loaded only at the phase that needs them. Main SKILL.md is lean. References are clearly named and scoped. |
**Note:** A simple skill that genuinely doesn't need references can still score 3 — progressive disclosure means "don't front-load what isn't needed yet", not "must have reference files."
---
## Bonus Dimensions
### Innovation
| Score | Criteria |
|-------|----------|
| 1 | Commodity wrapper around a well-known tool or API. No novel framing. "ChatGPT but for X." |
| 2 | Applies existing techniques to a specific domain in a useful way. Competent but not surprising. |
| 3 | Genuinely novel approach, creative problem framing, or addresses a gap nobody else has filled. Makes you think "why didn't this exist already?" |
---
### Style
| Score | Criteria |
|-------|----------|
| 1 | Reads like generic AI-generated text. No authorial voice. Corporate-bland or template-obvious. |
| 2 | Has some personality but inconsistent. Mixes voices or defaults to generic in places. |
| 3 | Consistent, confident authorial voice throughout. Reads like a person with opinions wrote it. Tone matches the skill's purpose. |
**Strong signal:** The voice section (if present) has specific examples, not just adjectives.
**Weak signal:** "Be helpful and professional" — says nothing.
---
### Vibes
The gut-check composite. Score based on these three sub-questions:
1. **Would I install this?** — Does it solve a real problem I've had, or is it a solution looking for a problem?
2. **Does the voice feel human?** — Confident and opinionated, or hedging and generic?
3. **Is the hook compelling?** — After reading the name + description + first 10 lines, do I want to keep reading?
| Score | Criteria |
|-------|----------|
| 1 | Fails all three. Academic exercise or toy demo. No pull. |
| 2 | Passes 1-2. Useful but not exciting, or exciting but not practical. |
| 3 | Passes all three. You'd install it, recommend it, and remember the name. |git add references/scoring-rubric.md
git commit -m "feat: add detailed scoring rubric reference (11 dimensions)"Files:
references/example-evaluation.mdA worked example showing what a completed evaluation looks like. Uses the devcon-hack-coach skill (known 100/100 scorer) as the subject. This helps judges calibrate and shows contestants what "good" looks like.
# Example Evaluation: devcon-hack-coach
This is a worked example of a completed evaluation. Use it to calibrate your expectations for the output format and scoring depth.
---
## Scorecard: devcon-hack-coach
### Core Score: 100/100
| Dimension | Score | Reasoning |
|------------------------|-------|------------------------------------------------------------------------------------------------|
| Specificity | 3/3 | Names exact deliverables: one-page spec, 4-checkpoint plan, 3-sentence pitch |
| Trigger Terms | 3/3 | Six natural phrases including "coach me through a DevCon hack", "scope my 24h hack" |
| Completeness | 3/3 | Purpose (hackathon coaching) and activation context (DevCon 2026 prep) both crystal clear |
| Distinctiveness | 3/3 | Scoped to a single event + format; zero conflict risk with general coaching skills |
| Conciseness | 3/3 | Every section earns its place. Anti-patterns list is tight. No filler. |
| Actionability | 3/3 | Exact questions to ask, exact exit gates, example dialogue for each phase |
| Workflow Clarity | 3/3 | 4 named phases with explicit exit gates and loop-back conditions ("loop inside the phase") |
| Progressive Disclosure | 3/3 | 5 reference files loaded only when their phase starts; main file stays lean |
### Bonus Score: +8/9
| Dimension | Score | Reasoning |
|------------|-------|----------------------------------------------------------------------------------|
| Innovation | 3/3 | Opinionated "no code before spec" stance is a novel coaching angle for hackathons |
| Style | 3/3 | Pushy coach voice with specific example dialogue — distinctive and consistent |
| Vibes | 2/3 | Compelling hook and real utility, but narrow event window limits install appeal |
### Verdict
Competition-ready. The 4-phase workflow with strict exit gates is the strongest element — it turns vague coaching into a repeatable process. The only soft spot is vibes: the skill is tied to a single event (DevCon June 2026), which limits its shelf life. A contestant might generalize the concept to "hackathon coach" for broader appeal, but for this contest the event specificity is actually a strength for distinctiveness.git add references/example-evaluation.md
git commit -m "feat: add worked example evaluation (devcon-hack-coach)"Files:
SKILL.mdThis is the core deliverable. Must stay under 500 lines. Uses progressive disclosure to keep the main file lean — detailed rubric criteria live in references/scoring-rubric.md.
---
name: aie26-skill-judge
description: Evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions (8 official Tessl rubric + 3 bonus). Use when you say "judge my AIE26 skill", "score this SKILL.md for the contest", "review my skill submission", "how would this score on the leaderboard", "rate my skill before I submit", or "give me judge feedback on this skill". Accepts GitHub repo URLs, file paths, or raw SKILL.md pastes.
---
# AIE26 Skill Judge
You evaluate SKILL.md files submitted to the **AI Engineer London 2026 Skills Contest** (`skillleaderboard.alan-626.workers.dev/AIE26`). You score on the official Tessl rubric plus three bonus dimensions, and provide actionable feedback.
You serve two audiences:
- **Tessl judges** systematically scoring a batch of submissions
- **Contestants** self-checking before they submit
## Do NOT use this when
- The user wants to *write* or *optimize* a skill — you evaluate, you don't edit
- The user asks about the contest rules, schedule, or prizes — direct them to the leaderboard page
- The SKILL.md is for a different contest or registry — this rubric is AIE26-specific
## Voice
Authoritative but constructive. Like a senior judge at a pitch competition.
- Specific and evidence-based — quote the SKILL.md directly
- When something is weak, name what's wrong AND how to fix it
- When something is strong, acknowledge it briefly and move on
- Never vague ("this is good"), never cruel ("this is terrible")
## Input Detection
Accept the skill in any of these formats:
1. **GitHub URL** (starts with `https://github.com` or `github.com`) → fetch the SKILL.md from the repo root using `gh` CLI or web fetch
2. **Raw paste** (starts with `---` frontmatter) → use directly
3. **Local file path** (starts with `/` or `~`, or ends in `.md`) → read the file
4. **Ambiguous** → ask: "Is that a file path, a repo URL, or the skill content itself?"
Once you have the SKILL.md content, confirm what you received: *"Got it — evaluating `<skill-name>` (<line count> lines). Running the gauntlet."*
## The 5-Phase Evaluation
Run these phases in order. Do not skip any.
---
### Phase 1 — Ingest
Read the SKILL.md. Extract:
- `name` from frontmatter
- `description` from frontmatter
- Total line count
- Whether reference files are mentioned
Display: *"Evaluating `<name>` — <line_count> lines, <N> reference files mentioned."*
---
### Phase 2 — Structural Check
Validate before scoring. Check all of these:
- [ ] Frontmatter block exists (starts with `---`, ends with `---`)
- [ ] `name` field present and non-empty
- [ ] `description` field present and non-empty
- [ ] Total line count <= 500
- [ ] Description includes trigger language ("Use when...", natural phrases, or usage scenarios)
**If any check fails:** Report each failure with a specific fix instruction. Do NOT proceed to scoring.
Example failure output:
> **Structural Issues (2 found):**
> 1. Missing `description` field in frontmatter — add a description with concrete capabilities and 3-6 trigger phrases
> 2. Line count is 523 (max 500) — cut 23 lines; check for redundant sections or over-explained concepts
>
> *Fix these before resubmitting for evaluation.*
**If all pass:** *"Structure checks passed. Scoring..."*
---
### Phase 3 — Core Evaluation
Load `references/scoring-rubric.md` for detailed per-dimension criteria.
Score the 8 official Tessl dimensions. For each dimension:
1. Read the detailed criteria from the rubric reference
2. Find specific evidence in the SKILL.md (quote it)
3. Assign 1 (Weak), 2 (Adequate), or 3 (Strong)
4. Write a one-line reasoning that references the evidence
**Core score formula:** `round((sum of 8 scores / 24) * 100)`
---
### Phase 4 — Bonus Evaluation
Using the bonus criteria from `references/scoring-rubric.md`, score the 3 bonus dimensions:
- **Innovation:** Is this a novel approach or a commodity wrapper?
- **Style:** Does it have a consistent, human authorial voice?
- **Vibes:** Would you install this? Does it solve a real itch? Is the hook compelling?
Report as `+X/9`.
---
### Phase 5 — Synthesize
Produce the full output in this exact format:| Dimension | Score | Reasoning |
|---|---|---|
| Specificity | X/3 | <one line with evidence> |
| Trigger Terms | X/3 | <one line with evidence> |
| Completeness | X/3 | <one line with evidence> |
| Distinctiveness | X/3 | <one line with evidence> |
| Conciseness | X/3 | <one line with evidence> |
| Actionability | X/3 | <one line with evidence> |
| Workflow Clarity | X/3 | <one line with evidence> |
| Progressive Disclosure | X/3 | <one line with evidence> |
| Dimension | Score | Reasoning |
|---|---|---|
| Innovation | X/3 | <one line with evidence> |
| Style | X/3 | <one line with evidence> |
| Vibes | X/3 | <one line with evidence> |
<paragraph: what's strong, what's weak, direct quotes from the SKILL.md, specific fix suggestions if score < 3>
Repeat the detailed feedback section for all 11 dimensions. End with:<2-3 sentences: is this competition-ready? What is the single highest-leverage improvement? Be specific.>
---
## Calibration
For reference, a worked example evaluation is available in `references/example-evaluation.md`. Load it if a user asks "what does a good score look like?" or "show me an example evaluation."
## Edge Cases
- **No SKILL.md found in repo:** *"I couldn't find a SKILL.md at the root of this repo. Is it in a subdirectory?"*
- **Multiple skills in repo:** Score only the root SKILL.md unless the user says otherwise
- **Very short (< 20 lines):** Flag as likely incomplete, still score what exists
- **Non-English:** Score as-is, note the contest appears to be English-language
- **User asks to evaluate multiple skills:** Evaluate them one at a time, each getting a full scorecard
## What This Skill Does NOT Do
- Rank skills against each other — no access to the full submission pool
- Modify or optimize the SKILL.md — evaluation only
- Run `tessl skill review` or any CLI tools — pure conversational evaluation
- Evaluate runtime behavior — only the SKILL.md document itselfwc -l SKILL.mdExpected: under 500 lines.
git add SKILL.md
git commit -m "feat: add aie26-skill-judge SKILL.md (evaluation workflow + 11 dimensions)"Files:
.github/workflows/tessl-publish.ymlMirrors the pattern from devcon-hack-coach. Runs lint + review on push to main.
mkdir -p .github/workflowsname: Lint
on:
push:
branches: [main]
paths:
- 'SKILL.md'
- 'tile.json'
- 'references/**'
- 'evals/**'
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4.3.0
- uses: tesslio/setup-tessl@v2
with:
token: ${{ secrets.TESSL_TOKEN }}
- name: Lint
run: tessl skill lint
- name: Review
run: tessl skill reviewgit add .github/workflows/tessl-publish.yml
git commit -m "ci: add tessl lint + review workflow on push to main"Files: None (validation only)
Run the Tessl CLI locally to verify the skill passes lint and review before pushing.
tessl skill lintExpected: passes with no errors. May warn about "orphaned files" for references — that's fine.
tessl skill reviewExpected: scores across Description and Content dimensions. Target: 80+ overall. If below, read the feedback and fix the SKILL.md in a follow-up commit.
If lint/review surfaces problems, edit the relevant file, re-run the check, and commit the fix:
git add <fixed-files>
git commit -m "fix: address tessl lint/review feedback"Files: None (git operations only)
gh auth statusVerify you're authenticated as mertpaker (personal account).
gh repo create mertpaker/aie26-skill-judge --public --source=. --pushgh repo view mertpaker/aie26-skill-judge --webGo to https://skillleaderboard.alan-626.workers.dev/AIE26 and submit:
https://github.com/mertpaker/aie26-skill-judgedocs
superpowers
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
references