judge-with-debate

Evaluate solutions through multi-round debate between independent judges until consensus

Quality

47%

Does it follow best practices?

Run evals on this skill

Adds up to 20 points to the overall score

View guide

Securityby

High

Do not use without reviewing

Fix and improve this skill with Tessl

tessl review fix ./plugins/sadd/skills/judge-with-debate/SKILL.md

Quality

Content

62%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill provides an extremely thorough and actionable guide for multi-agent debate-based evaluation, with excellent workflow clarity including decision points, consensus criteria, and feedback loops. However, it is severely over-verbose — repeating the same information across the process description, orchestration instructions, best practices, and example sections. The content would benefit greatly from aggressive deduplication and splitting into supporting files.

Suggestions

Reduce content by at least 50% by eliminating redundancy: the orchestration steps (Steps 1-7) largely duplicate the Phase descriptions above them — merge into a single authoritative sequence.

Move prompt templates into separate referenced files (e.g., META_JUDGE_PROMPT.md, JUDGE_PROMPT.md, DEBATE_PROMPT.md) to reduce inline bulk and improve progressive disclosure.

Remove explanatory text Claude already knows (e.g., 'Independence in initial analysis prevents groupthink', 'Garbage in, garbage out', explanations of what debate achieves) — these waste tokens without adding actionable guidance.

Consolidate the 'Best Practices', 'Common Pitfalls', and 'Do This' sections into a single compact checklist — currently they repeat constraints already stated in the process sections.

Dimension	Reasoning	Score
Conciseness	Extremely verbose at ~350+ lines. Massive amounts of repetition (the orchestration steps essentially restate the phases, the example walkthrough restates the process again, best practices restate constraints already mentioned). Explains concepts Claude already knows (what debate is, what consensus means, what independence prevents groupthink). The meta-judge prompt template, judge prompt template, and debate prompt template could be much more compact.	1 / 3
Actionability	Highly actionable with concrete prompt templates, specific Task tool dispatch instructions, exact file naming conventions, precise consensus criteria (0.5 points overall, 1 point per criterion), and a detailed worked example showing the full flow. Claude would know exactly what to do at each step.	3 / 3
Workflow Clarity	Excellent multi-step workflow with clear sequencing (Phase 0 → 0.5 → 1 → 2 → 3), explicit decision points (consensus check with specific numeric thresholds), feedback loops (debate rounds with re-check), and clear termination conditions (max 3 rounds or consensus). The ASCII diagram and numbered orchestration steps make the flow unambiguous.	3 / 3
Progressive Disclosure	No bundle files are provided, so everything is inline in one massive file. The content would benefit from splitting prompt templates, the example walkthrough, and best practices into separate referenced files. The structure within the file is reasonable with clear sections, but the sheer volume of inline content makes it a borderline monolithic wall of text.	2 / 3
	Total	9 / 12 Passed

Description

32%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description conveys a unique evaluation methodology (multi-round debate with judges reaching consensus) but is too terse and lacks explicit trigger guidance. It doesn't specify what types of solutions it evaluates, what the output looks like, or when Claude should choose this skill over other evaluation approaches.

Suggestions

Add a 'Use when...' clause with natural trigger terms, e.g., 'Use when the user wants to rigorously compare multiple solution options, needs adversarial evaluation, or asks for a debate-style assessment of alternatives.'

List specific concrete actions such as 'Assigns independent judge perspectives, runs structured debate rounds, synthesizes scoring rationale, and produces a final consensus recommendation.'

Clarify the domain or scope—what kinds of 'solutions' does this evaluate (code solutions, design proposals, strategic options)? This would reduce conflict risk with other evaluation skills.

Dimension	Reasoning	Score
Specificity	It names a specific approach ('multi-round debate between independent judges until consensus') and a general action ('evaluate solutions'), but doesn't list multiple concrete actions or detail what kinds of solutions, what the output looks like, or what steps are involved.	2 / 3
Completeness	The description answers 'what' at a high level (evaluate solutions through debate) but completely lacks a 'Use when...' clause or any explicit trigger guidance for when Claude should select this skill. Per rubric guidelines, missing 'Use when' caps completeness at 2, and the 'what' is also weak, so this scores a 1.	1 / 3
Trigger Term Quality	Terms like 'evaluate', 'debate', 'judges', and 'consensus' are somewhat relevant but not strongly natural user terms. Users are more likely to say things like 'compare options', 'review approaches', or 'which solution is best' rather than 'multi-round debate between judges'.	2 / 3
Distinctiveness Conflict Risk	The 'multi-round debate between independent judges' mechanism is somewhat distinctive, but 'evaluate solutions' is generic enough to overlap with any evaluation, comparison, or decision-making skill. The lack of domain specificity increases conflict risk.	2 / 3
	Total	7 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Repository: NeoLabHQ/context-engineering-kit
Path: plugins/sadd/skills/judge-with-debate/SKILL.md
Commit: 3711edf

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.