Launch a meta-judge then a judge sub-agent to evaluate results produced in the current conversation
45
32%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Risky
Do not use without reviewing
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/sadd/skills/judge/SKILL.mdQuality
Discovery
17%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description uses internal/technical jargon ('meta-judge', 'sub-agent') that would not match natural user queries, and lacks an explicit 'Use when...' clause to guide skill selection. While it hints at a specific evaluation workflow, the vague scope ('results produced in the current conversation') and absence of concrete trigger terms make it difficult for Claude to reliably select this skill at the right time.
Suggestions
Add an explicit 'Use when...' clause with natural trigger terms like 'evaluate output', 'score results', 'judge quality', 'assess response', or 'grade answers'.
Specify what types of results are evaluated and what the evaluation produces (e.g., 'Scores outputs against rubrics and produces structured feedback with ratings').
Replace or supplement jargon like 'meta-judge' and 'sub-agent' with plain-language descriptions of the workflow so users' natural language queries can match.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (evaluation/judging) and some actions ('launch a meta-judge', 'judge sub-agent', 'evaluate results'), but the actions are not comprehensively described — what kind of results? What does evaluation entail? The terms 'meta-judge' and 'sub-agent' are somewhat specific but lack concrete detail about what they produce. | 2 / 3 |
Completeness | Provides a partial 'what' (launch judges to evaluate results) but has no explicit 'when' clause or trigger guidance. Per the rubric, a missing 'Use when...' clause caps completeness at 2, and the 'what' itself is also weak, so this scores a 1. | 1 / 3 |
Trigger Term Quality | Uses technical jargon like 'meta-judge', 'sub-agent', and 'current conversation' that users would rarely naturally say. Missing natural trigger terms a user might use such as 'evaluate', 'score', 'grade', 'assess quality', 'review output', or 'judge results'. | 1 / 3 |
Distinctiveness Conflict Risk | The concept of a 'meta-judge' and 'judge sub-agent' is somewhat distinctive, but 'evaluate results produced in the current conversation' is broad enough to overlap with any evaluation, review, or quality-checking skill. | 2 / 3 |
Total | 6 / 12 Passed |
Implementation
47%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill defines a clear four-phase evaluation pipeline with good workflow sequencing and validation checkpoints. However, it is significantly over-verbose, repeating concepts like context isolation and evidence-based scoring multiple times across sections. The prompt templates provide structure but rely on placeholders and external agent instructions that aren't included, reducing actionability.
Suggestions
Cut the <task> and <context> blocks entirely - they explain the pattern to Claude rather than instructing it. Move any unique information into the workflow steps.
Remove the 'Important Guidelines' section or reduce it to 3-4 items that aren't already stated in the workflow phases (e.g., items 4, 6, and 8 are already covered in Phase 1 and Phase 4).
Provide a concrete, filled-in example of at least one prompt template (e.g., a sample meta-judge prompt for evaluating a Python module) so the skill is more actionable.
Split the scoring interpretation table and guidelines into a separate reference file to reduce the main skill's token footprint.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose for what it does. The skill explains concepts Claude already understands (what context isolation is, what evidence-based means, what bias mitigation is). The scoring interpretation table, extensive guidelines list, and notes section add significant token overhead. Much of the content restates the same principles multiple times (e.g., 'context isolation' is explained in the task, context, Phase 1, and guidelines sections). | 1 / 3 |
Actionability | Provides structured prompt templates and dispatch instructions with Task tool usage, which is somewhat concrete. However, the prompts contain placeholder variables without clear examples of what filled-in values look like, and the meta-judge/judge agent instructions reference external 'agent instructions' that aren't provided. The workflow is more of a framework than copy-paste executable guidance. | 2 / 3 |
Workflow Clarity | The four-phase workflow is clearly sequenced (Context Extraction → Meta-Judge → Judge Agent → Process Results) with explicit validation steps in Phase 4 (check score ranges, verify justifications, confirm calculations, check contradictions). Phase 2 explicitly states to wait before proceeding. The error recovery path ('If validation fails') is documented. | 3 / 3 |
Progressive Disclosure | The content is a monolithic wall of text with no references to external files despite being complex enough to warrant splitting (e.g., prompt templates, scoring rubric, and guidelines could be separate files). No bundle files are provided, and the skill doesn't reference any supporting documents. The internal structure uses headers but everything is inline. | 2 / 3 |
Total | 8 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
dedca19
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.