Evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions (8 official Tessl rubric + 3 bonus). Use when you say 'judge my AIE26 contest skill', 'score this SKILL.md for the contest', 'review my skill submission', or 'how would this score on the leaderboard'. Accepts GitHub repo URLs, file paths, or raw pastes.
82
94%
Does it follow best practices?
Impact
65%
1.80xAverage score across 5 eval scenarios
Risky
Do not use without reviewing
A Claude skill that evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions — 8 from the official Tessl rubric + 3 bonus (innovation, style, vibes).
Built for Tessl judges scoring a batch. Useful for contestants self-checking before they submit.
🏆 See it in action: We evaluated all 15 AIE26 submissions and ranked the top 10 — jump to the results
Feed it a SKILL.md (GitHub URL, file path, or raw paste) and it runs a 5-phase evaluation:
| Phase | What happens |
|---|---|
| 1. Ingest | Detects input format, extracts name + metadata |
| 2. Structural Check | Validates frontmatter, line count, trigger terms — blocks scoring if broken |
| 3. Core Evaluation | Scores 8 official Tessl dimensions (Specificity, Trigger Terms, Completeness, Distinctiveness, Conciseness, Actionability, Workflow Clarity, Progressive Disclosure) |
| 4. Bonus Evaluation | Scores Innovation, Style, and Vibes |
| 5. Synthesize | Produces a scorecard, per-dimension feedback, and a verdict |
Core score: 0-100 (normalized from 8 dimensions x 3 max). Bonus: +X/9 reported separately.
Say any of:
| Dimension | What it measures |
|---|---|
| Specificity | Concrete, actionable capabilities listed |
| Trigger Terms | Natural phrases users would actually say |
| Completeness | Clear "what" (purpose) and "when" (usage) |
| Distinctiveness | Low conflict risk; clear niche |
| Conciseness | Token efficiency; no padding |
| Actionability | Executable instructions, concrete examples |
| Workflow Clarity | Sequenced phases with exit gates |
| Progressive Disclosure | Layered references loaded on demand |
| Dimension | What it measures |
|---|---|
| Innovation | Novel approach, not a commodity wrapper |
| Style | Human authorial voice, tone consistency |
| Vibes | "Would I install this?" + compelling hook |
Each dimension scored 1 (Weak), 2 (Adequate), 3 (Strong). Detailed criteria in references/scoring-rubric.md.
aie26-skill-judge/
├── SKILL.md 5-phase evaluation workflow
├── references/
│ ├── scoring-rubric.md Detailed criteria for all 11 dimensions
│ └── example-evaluation.md Worked example (devcon-hack-coach, 100/100)
└── README.md This fileFrom the Tessl registry:
tessl install paker-it/aie26-skill-judgeOr directly — clone this repo and point your Claude Code config at the directory:
git clone https://github.com/mertpaker/aie26-skill-judge.git ~/.claude/skills/aie26-skill-judgeSee references/example-evaluation.md for a full worked evaluation of the devcon-hack-coach skill (Core: 100/100, Bonus: +8/9).
We used aie26-skill-judge to evaluate all 15 submissions on the AIE26 leaderboard and rank the top 10.
This evaluation is a weekend experiment — a way to dogfood the skill against real submissions, not a judgment on anyone's work. Every skill on the leaderboard represents time, creativity, and craft that we respect. Scores are generated by an LLM applying a rubric and will vary between runs; they are not definitive rankings. If your skill appears here and you'd like it removed, open an issue and we'll take it down immediately. We love all developers who showed up and built something.
"Here are all the AIE26 contest submissions. Judge each one using the aie26-skill-judge rubric (8 core dimensions scored 1-3 from references/scoring-rubric.md + 3 bonus dimensions). Produce a scorecard for each, then rank the top 10 and compare their pros and cons."
Each skill was evaluated independently by a separate agent using the scoring rubric, then results were compiled and ranked.
How to read the scores: Core is our judge score — round((sum of 8 dimension scores / 24) * 100) using the aie26-skill-judge rubric. Leaderboard is the official Tessl automated review score from the AIE26 contest page. These are two independent scoring systems and may disagree.
| Rank | Skill | Author | Core | Bonus | Leaderboard | Registry |
|---|---|---|---|---|---|---|
| 1 | wigo | Paulo Matos | 96/100 | +8/9 | 86/100 | 82/100 |
| 2 | devcon-hack-coach | Mert Paker | 96/100 | +8/9 | 100/100 | 100/100 |
| 3 | evidence-verifier | Macey Baker | 96/100 | +7/9 | 93/100 | — |
| 4 | k8s-security-audit | Juan | 92/100 | +7/9 | 100/100 | — |
| 5 | spec-interrogator | Jakub Czarnowski | 88/100 | +8/9 | 100/100 | — |
| 6 | skill-writer | Juan | 88/100 | +7/9 | 100/100 | — |
| 7 | shekel-ui | Omer Bresinski | 83/100 | +7/9 | 85/100 | — |
| 8 | de-llm-ify-writing | Alan Pope | 79/100 | +5/9 | 53/100 | — |
| 9 | agent-school | James Moss | 78/100 | +8/9 | 81/100 | 85/100 |
| 10 | writing-clearly-and-concisely | Martin Wimpress | 77/100 | +7/9 | 73/100 | — |
Registry scores from tessl.io. Paste-submitted skills (—) don't have registry pages.
| Dimension | wigo | devcon-hack-coach | evidence-verifier | k8s-security-audit | spec-interrogator |
|---|---|---|---|---|---|
| Specificity | 3 | 3 | 3 | 3 | 3 |
| Trigger Terms | 3 | 3 | 3 | 3 | 3 |
| Completeness | 3 | 3 | 3 | 3 | 3 |
| Distinctiveness | 3 | 3 | 3 | 3 | 3 |
| Conciseness | 3 | 3 | 3 | 2 | 3 |
| Actionability | 3 | 3 | 3 | 3 | 3 |
| Workflow Clarity | 3 | 3 | 2 | 3 | 2 |
| Progressive Disclosure | 2 | 3 | 3 | 2 | 3 |
| Innovation | 3 | 3 | 2 | 2 | 3 |
| Style | 2 | 3 | 3 | 2 | 3 |
| Vibes | 3 | 2 | 2 | 3 | 2 |
| Pros | Cons |
|---|---|
Most innovative technique: mines Claude's own .jsonl session logs to reconstruct context | All content in one file — Python script and suggestion matrix should be in references |
| Solves a universal problem: "where was I?" after context-switching | Voice is technically precise but personality-neutral |
| Explicit parallel/sequential phasing with conditional branching | Heavier than others — 9400 chars of inline code |
| Memorable name, strong hook, you'd share this with teammates |
| Pros | Cons |
|---|---|
| Best workflow clarity — 4 phases with named exit gates, loop-back conditions, and a terminal state | Narrow audience: only useful for DevCon 2026 attendees |
| Strongest voice: "That's three features. Pick one." — pushy coach persona never slips | Event-specific scoping limits shelf life |
| Textbook progressive disclosure: 5 references, each tied to a specific phase | Could be generalized to "any 24h hackathon" for broader reach |
| Spec-before-code hard gate is a genuinely original coaching angle |
| Pros | Cons |
|---|---|
| Leanest skill of all — zero waste, every line earns its place | No exit gate: what happens when a claim is blocked? |
| Evidence table output format is immediately actionable | Only 3 trigger phrases (rubric wants 3-6) |
| Strong epistemic stance: "refuse to certify without evidence" | Concept is "obviously good" rather than "surprisingly brilliant" |
| Worked mini-example grounds the template in reality | No reference files — simple but no room to deepen |
| Pros | Cons |
|---|---|
| 8+ trigger phrases covering every way someone asks for a k8s audit | Essential Evidence Commands block should be in references/, not inline |
| Real kubectl/jq commands make it production-ready | Voice is professional but impersonal |
| 8 audit categories with severity taxonomy = serious depth | Innovation limited — follows well-known CIS/NSA frameworks |
| Strongest practical "vibes" — you'd install this tomorrow | Partial progressive disclosure |
| Pros | Cons |
|---|---|
| Highest innovation density: "propose a recommended answer with every question" | Soft stop condition — relies on user to say "we're done" |
| Best craft: "Kill scope creep on sight", "never ask a cold question" | Implicit phases with no named exit gates |
| "Read the codebase instead of asking" — a rule nobody else thought of | No output by default — some users won't know to ask for a deliverable |
| Most concise: entire skill in ~50 lines | Slightly less "production-ready" feel than the top 4 |
| Pros | Cons |
|---|---|
| Meta-skill: teaches you to write skills — useful for the entire ecosystem | Workflow gates are thin — "return to the relevant step" lacks specifics |
| "Be pushy — Claude undertriggers" is opinionated, practical advice | Progressive disclosure reference isn't gated with load conditions |
| Copy-paste-ready minimal example included | Innovation is predictable: meta-skill for a skill ecosystem |
| Clear folder structure and YAML rules for newcomers |
| Pros | Cons |
|---|---|
| Textbook progressive disclosure: lean main file, 4 named reference docs | No trigger phrases — user-invocable: false means router can't find it |
| Extremely specific: named fonts, oklch tokens, exact utility classes | Workflow deferred entirely to reference file — main SKILL.md has no phases |
| Strong opinionated voice: "hermetically sealed", "warm editorial" | Niche audience: only useful if you're building with this specific design system |
| Concise and zero-waste throughout |
| Pros | Cons |
|---|---|
| Names specific LLM tells: "staccato sentence patterns", "stock contrast constructions" | No trigger phrases at all — users can't find it |
| Concrete quality checklist for evaluating prose | No workflow phases or sequencing — reference doc, not a skill |
| Strong authorial voice with real opinions about writing | Some redundancy in anti-pattern section |
| Addresses a timely, real problem (AI-sounding prose) |
| Pros | Cons |
|---|---|
| Novel concept: generate persistent tile artifacts to teach agents | Trigger terms are implicit, not conversational |
| 5 clear phases with user confirmation checkpoints | Somewhat verbose — could trim 15-20% |
| High innovation: bakes knowledge into agent systems, not one-off answers | Niche audience: tessl tile authors only |
| Strong authoritative voice throughout |
| Pros | Cons |
|---|---|
| Excellent specificity: concrete before/after examples for every rule | No conversational trigger phrases |
| Novel AI-pattern taxonomy: banned words, puffery detection | No workflow — rules listed but not sequenced |
| Progressive disclosure to prose-style-reference for extended tasks | Academic presentation — useful but not exciting |
| Every line earns its place — tight and concise |
Built for the AI Engineer Europe 2026 skills contest. Submission by Mert Paker.
Tessl review score: 94% (Description 100%, Content 85%).