judge

Launch a meta-judge then a judge sub-agent to evaluate results produced in the current conversation

Quality

28%

Does it follow best practices?

Run evals on this skill

Adds up to 20 points to the overall score

View guide

Securityby

High

Do not use without reviewing

Fix and improve this skill with Tessl

tessl review fix ./plugins/sadd/skills/judge/SKILL.md

Quality

Content

39%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill has a well-structured four-phase workflow with clear sequencing and validation checkpoints, which is its strongest aspect. However, it is excessively verbose—explaining concepts like context isolation, bias mitigation, and evidence-based evaluation that Claude already understands—consuming significant token budget. The monolithic structure and reliance on undefined external references (agent instructions, CLAUDE_PLUGIN_ROOT) weaken both progressive disclosure and actionability.

Suggestions

Cut the <context> block, scoring interpretation table, notes section, and most of the 'Important Guidelines' list—these explain concepts Claude already knows and consume ~40% of the token budget unnecessarily.

Extract the scoring interpretation table and guidelines into a separate REFERENCE.md file, keeping SKILL.md focused on the four-phase workflow.

Replace placeholder-heavy prompt templates with a concrete, minimal example showing actual values for one realistic evaluation scenario (e.g., evaluating a Python script).

Define or link to the referenced 'agent instructions' and explain what CLAUDE_PLUGIN_ROOT resolves to, since these are critical dependencies that are currently undefined.

Dimension	Reasoning	Score
Conciseness	Extremely verbose at ~150+ lines. Extensively explains concepts Claude already understands (what context isolation is, what evidence-based means, bias types). The scoring interpretation table, extensive guidelines list, and notes section add significant token overhead. Much of the 'context' section and 'important guidelines' restate obvious evaluation principles.	1 / 3
Actionability	Provides structured prompt templates and dispatch instructions with placeholder variables, which is somewhat concrete. However, the prompts are templates with placeholders rather than fully executable examples, the meta-judge and judge agent instructions reference external 'agent instructions' that aren't provided, and the Task tool dispatch format is pseudocode-like rather than exact API calls.	2 / 3
Workflow Clarity	The four-phase workflow (Context Extraction → Meta-Judge → Judge → Process Results) is clearly sequenced with explicit dependencies ('Wait for the meta-judge to complete before proceeding'). Phase 4 includes validation checkpoints with specific checks (score range validation, contradiction detection) and a feedback loop for re-evaluation if validation fails.	3 / 3
Progressive Disclosure	All content is in a single monolithic file with no references to supporting files, despite the complexity warranting separation. The scoring interpretation table, guidelines, and notes could be in separate reference files. No bundle files are provided, and the skill references 'agent instructions' and CLAUDE_PLUGIN_ROOT without providing or linking to them.	1 / 3
	Total	7 / 12 Passed

Description

17%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is too jargon-heavy and vague to serve as an effective skill selector. It lacks natural trigger terms users would use, provides no explicit 'when to use' guidance, and doesn't clearly describe what concrete actions are performed or what outputs are produced. The internal architecture terms ('meta-judge', 'sub-agent') are implementation details rather than user-facing capability descriptions.

Suggestions

Add a 'Use when...' clause with natural trigger terms like 'evaluate', 'review quality', 'score output', 'assess results', or 'grade response'.

Replace implementation jargon ('meta-judge', 'sub-agent') with user-facing descriptions of what the skill actually does, e.g., 'Evaluates the quality of outputs using structured rubrics and multi-step assessment'.

Specify what types of results are evaluated and what the evaluation output looks like (e.g., scores, feedback, pass/fail judgments).

Dimension	Reasoning	Score
Specificity	Names a domain (evaluation/judging) and some actions ('launch a meta-judge', 'judge sub-agent', 'evaluate results'), but the actions are not concretely specified — what kind of results? What does evaluation entail? The terms 'meta-judge' and 'sub-agent' are somewhat specific but lack detail about concrete outputs or operations.	2 / 3
Completeness	Provides a weak 'what' (launch judges to evaluate results) but has no explicit 'when' clause or trigger guidance. The absence of a 'Use when...' clause caps this at 2 per the rubric, and the 'what' is also vague enough to warrant a 1.	1 / 3
Trigger Term Quality	Uses technical jargon like 'meta-judge', 'sub-agent', and 'evaluate results' which are not terms a user would naturally say. Missing natural trigger terms like 'review', 'score', 'assess quality', 'grade', or 'check output'.	1 / 3
Distinctiveness Conflict Risk	The mention of 'meta-judge' and 'judge sub-agent' provides some distinctiveness from generic evaluation skills, but 'evaluate results produced in the current conversation' is broad enough to potentially overlap with any quality-checking or review skill.	2 / 3
	Total	6 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Repository: NeoLabHQ/context-engineering-kit
Path: plugins/sadd/skills/judge/SKILL.md
Commit: 3711edf

Reviewed: about 14 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.