Evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions (8 official Tessl rubric + 3 bonus). Use when you say 'judge my AIE26 contest skill', 'score this SKILL.md for the contest', 'review my skill submission', or 'how would this score on the leaderboard'. Accepts GitHub repo URLs, file paths, or raw pastes.
82
94%
Does it follow best practices?
Impact
65%
1.80xAverage score across 5 eval scenarios
Risky
Do not use without reviewing
Each scored 1 (Weak), 2 (Adequate), 3 (Strong).
Does the description name concrete, actionable capabilities?
| Score | Criteria |
|---|---|
| 1 | Vague or abstract ("helps with development", "AI-powered assistant"). No specific actions named. |
| 2 | Names some capabilities but mixes concrete with vague. Unclear what the skill actually does vs. what it aspires to. |
| 3 | Every capability is a concrete verb + object ("generates Kubernetes RBAC policies", "scores SKILL.md submissions on 11 dimensions"). Reader knows exactly what they get. |
Red flags for 1: "helps", "assists", "enhances", "leverages" without a direct object.
Does the description include natural phrases a user would actually say?
| Score | Criteria |
|---|---|
| 1 | No trigger phrases, or phrases no human would say ("utilize the skill to evaluate"). |
| 2 | 1-2 trigger phrases present but generic ("help me with X") or forced-sounding. |
| 3 | 3-6 natural trigger phrases that read like something a real person would type. Covers both direct requests and situational triggers. |
Strong examples: "judge my AIE26 contest skill", "score this for the contest", "will this win?" Weak examples: "invoke skill evaluation mode", "perform assessment"
Does the description cover both what (purpose) and when (usage scenarios)?
| Score | Criteria |
|---|---|
| 1 | Missing either what or when entirely. Reader can't tell what it does OR when to use it. |
| 2 | Has what but weak/missing when, or vice versa. Purpose is clear but activation context is vague. |
| 3 | Crystal clear on both. Reader knows the skill's purpose AND can identify the exact moment they'd reach for it. |
Is this skill clearly different from existing skills? Low conflict risk?
| Score | Criteria |
|---|---|
| 1 | Overlaps heavily with common built-in skills or well-known existing skills. Would confuse the skill router. |
| 2 | Somewhat distinct but shares surface area with adjacent skills. Trigger terms could collide. |
| 3 | Clear niche. No realistic conflict with existing skills. Name + description + triggers carve out unique territory. |
Is the content token-efficient? No padding or over-explanation?
| Score | Criteria |
|---|---|
| 1 | Bloated with filler, redundant sections, or verbose explanations of simple concepts. Could be half the length. |
| 2 | Some unnecessary prose but core content is present. Could trim 20-30% without losing substance. |
| 3 | Every line earns its place. No padding, no redundancy, no over-explaining. Uses tables and lists over paragraphs where appropriate. |
Red flags for 1: Repeating the same instruction in different words, explaining what markdown formatting is, long preambles before the actual instructions.
Does the content contain executable instructions with concrete examples?
| Score | Criteria |
|---|---|
| 1 | Abstract methodology or theory. No examples, no constraints, no concrete steps. |
| 2 | Some concrete instructions but mixed with vague guidance ("handle edge cases appropriately"). |
| 3 | Every instruction is specific enough to execute without interpretation. Includes examples, constraints, expected outputs. |
Are instructions sequenced into clear phases with exit gates?
| Score | Criteria |
|---|---|
| 1 | Unstructured wall of instructions. No clear order or phases. |
| 2 | Has phases/sections but missing exit gates, or unclear when to move between phases. |
| 3 | Numbered/named phases with explicit entry conditions, exit gates, and loop-back conditions. Reader always knows where they are in the workflow. |
Does the skill load information only when needed?
| Score | Criteria |
|---|---|
| 1 | Everything in one file. No reference files. Or: references exist but are loaded eagerly. |
| 2 | Some references exist but loading isn't well-timed, or reference structure is unclear. |
| 3 | Reference files loaded only at the phase that needs them. Main SKILL.md is lean. References are clearly named and scoped. |
Note: A simple skill that genuinely doesn't need references can still score 3 — progressive disclosure means "don't front-load what isn't needed yet", not "must have reference files."
| Score | Criteria |
|---|---|
| 1 | Commodity wrapper around a well-known tool or API. No novel framing. "ChatGPT but for X." |
| 2 | Applies existing techniques to a specific domain in a useful way. Competent but not surprising. |
| 3 | Genuinely novel approach, creative problem framing, or addresses a gap nobody else has filled. Makes you think "why didn't this exist already?" |
| Score | Criteria |
|---|---|
| 1 | Reads like generic AI-generated text. No authorial voice. Corporate-bland or template-obvious. |
| 2 | Has some personality but inconsistent. Mixes voices or defaults to generic in places. |
| 3 | Consistent, confident authorial voice throughout. Reads like a person with opinions wrote it. Tone matches the skill's purpose. |
Strong signal: The voice section (if present) has specific examples, not just adjectives. Weak signal: "Be helpful and professional" — says nothing.
The gut-check composite. Score based on these three sub-questions:
| Score | Criteria |
|---|---|
| 1 | Fails all three. Academic exercise or toy demo. No pull. |
| 2 | Passes 1-2. Useful but not exciting, or exciting but not practical. |
| 3 | Passes all three. You'd install it, recommend it, and remember the name. |
docs
superpowers
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
references