Evaluates SKILL.md submissions for the AI Engineer London 2026 Skills Contest across 11 dimensions (8 official Tessl rubric + 3 bonus). Use when you say 'judge my AIE26 contest skill', 'score this SKILL.md for the contest', 'review my skill submission', or 'how would this score on the leaderboard'. Accepts GitHub repo URLs, file paths, or raw pastes.
82
94%
Does it follow best practices?
Impact
65%
1.80xAverage score across 5 eval scenarios
Risky
Do not use without reviewing
{
"context": "Tests whether the agent loads and shows the calibration example when asked, and whether it uses detailed rubric criteria when evaluating a skill. The agent should demonstrate evidence of rubric-driven scoring rather than generic judgment, and should produce a complete scorecard.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Calibration example shown",
"description": "When asked 'what does a good score look like?', output references or reproduces content from the worked example — mentions the example skill name, shows example scores, or quotes from the worked example evaluation",
"max_score": 12
},
{
"name": "Receipt confirmation for submitted skill",
"description": "Output contains the receipt confirmation for pr-reviewer: 'Got it — evaluating `pr-reviewer` (N lines). Running the gauntlet.'",
"max_score": 8
},
{
"name": "Phase 1 display line",
"description": "Output contains Phase 1 display line: 'Evaluating `pr-reviewer` — N lines, N reference files mentioned.'",
"max_score": 8
},
{
"name": "Rubric level language used",
"description": "At least one dimension reasoning uses rubric-anchored vocabulary — mentions Weak/Adequate/Strong, or uses criteria-specific language like 'concrete verb + object', 'exit gates', '3-6 natural trigger phrases', or 'no human would say'",
"max_score": 12
},
{
"name": "Evidence quoted in scoring",
"description": "At least 3 dimension scores include a direct quote or specific reference to text from the submitted pr-reviewer SKILL.md",
"max_score": 10
},
{
"name": "Rubric criteria applied correctly",
"description": "At least one dimension is justified with threshold-level reasoning that references specific rubric criteria (e.g. 'has 5 trigger phrases, meets the 3-6 threshold', 'phases present but exit gates absent')",
"max_score": 12
},
{
"name": "All 11 dimensions scored",
"description": "Both the Core Score table (8 dimensions) and Bonus Score table (3 dimensions) are present in the evaluation",
"max_score": 10
},
{
"name": "Detailed feedback for all 11 dimensions",
"description": "A Detailed Feedback section is present with individual subsections for all 11 dimensions",
"max_score": 10
},
{
"name": "Core score formula applied",
"description": "The displayed Core Score (XX/100) is consistent with round((sum of 8 scores / 24) * 100)",
"max_score": 10
},
{
"name": "Verdict present and specific",
"description": "Verdict section is present and names a specific highest-leverage improvement that references the skill content (not just generic advice)",
"max_score": 8
}
]
}docs
superpowers
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
references