judge-with-debate

Evaluate solutions through multi-round debate between independent judges until consensus

Quality

36%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Risky

Do not use without reviewing

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/sadd/skills/judge-with-debate/SKILL.md

Quality

Discovery

17%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description captures a unique evaluation methodology (multi-round debate with judges reaching consensus) but fails to provide natural trigger terms users would actually use, lacks a 'Use when...' clause, and doesn't specify what kinds of solutions or outputs are involved. It reads more like an internal process description than a skill selection guide.

Suggestions

Add a 'Use when...' clause with natural trigger terms like 'evaluate my approach', 'compare solutions', 'critique this design', 'get multiple perspectives on', or 'review pros and cons'.

Specify what types of solutions or artifacts this evaluates (e.g., code architectures, design proposals, technical approaches) and what output it produces (e.g., a consensus recommendation with reasoning).

Include user-facing language variations such as 'weigh tradeoffs', 'devil's advocate', 'stress-test an idea', or 'adversarial review' to improve trigger term coverage.

Dimension	Reasoning	Score
Specificity	Names the domain (solution evaluation) and a key mechanism (multi-round debate between independent judges until consensus), but doesn't list multiple concrete actions or specify what kinds of solutions, what the judges do, or what outputs are produced.	2 / 3
Completeness	It partially addresses 'what' (evaluate solutions through debate) but completely lacks a 'when' clause or any explicit trigger guidance. Per the rubric, a missing 'Use when...' clause caps completeness at 2, and the 'what' is also underspecified, so this scores a 1.	1 / 3
Trigger Term Quality	The terms 'multi-round debate', 'independent judges', and 'consensus' are technical/methodological jargon that users would rarely naturally say. Users are more likely to say things like 'evaluate my idea', 'compare options', 'critique this solution', or 'get multiple perspectives'.	1 / 3
Distinctiveness Conflict Risk	The 'multi-round debate between independent judges' mechanism is somewhat distinctive, but 'evaluate solutions' is broad enough to overlap with general code review, analysis, or comparison skills. The lack of specificity about what types of solutions increases conflict risk.	2 / 3
	Total	6 / 12 Passed

Implementation

55%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill provides a thorough, actionable multi-agent debate evaluation framework with excellent workflow clarity and concrete prompt templates. However, it is extremely verbose with significant repetition—the same concepts are restated across the process description, orchestration instructions, best practices, and example sections. The monolithic structure with no progressive disclosure makes it a poor fit for token-efficient skill design.

Suggestions

Reduce content by at least 50%: eliminate the redundant 'Orchestration Instructions' section (Steps 1-7) which restates the Phases, merge 'Best Practices' into inline notes within the relevant phases, and remove the 'Context' section explaining why debate is useful.

Split into multiple files: move prompt templates to a PROMPTS.md file, the worked example to EXAMPLE.md, and keep SKILL.md as a concise overview with the ASCII diagram, phase summaries, and file references.

Remove explanatory content Claude already knows: the entire <context> section explaining benefits of multi-agent debate, explanations like 'Independence in initial analysis prevents groupthink', and 'Key principle' callouts are unnecessary padding.

Consolidate the initial judge and debate judge prompt templates into a single parameterized template with notes on what changes between phases, rather than repeating nearly identical templates.

Dimension	Reasoning	Score
Conciseness	Extremely verbose at ~300+ lines. Massive amounts of repetition (the orchestration steps essentially restate the phases, the best practices restate what was already said, the example walks through the entire process again). Claude doesn't need explanations of what debate is, what consensus means, or why multiple perspectives reduce bias. The prompt templates are repeated with minor variations for each phase.	1 / 3
Actionability	Despite verbosity, the skill provides fully concrete, executable guidance: specific prompt templates with placeholders, exact file naming conventions, precise consensus thresholds (0.5 points overall, 1 point per criterion), Task tool dispatch patterns, and a detailed worked example showing the full flow with realistic scores and debate arguments.	3 / 3
Workflow Clarity	The multi-step process is clearly sequenced with explicit phases (0 → 0.5 → 1 → 2 → 3), decision points (consensus check with specific numeric thresholds), feedback loops (debate rounds with re-validation), and a clear ASCII diagram. The orchestration instructions provide an unambiguous step-by-step algorithm with branching logic.	3 / 3
Progressive Disclosure	Monolithic wall of text with no references to external files. Everything is inline in a single massive document. The prompt templates, best practices, example usage, and orchestration logic could all be split into separate referenced files. No bundle files are provided, and the content doesn't attempt any structural separation.	1 / 3
	Total	8 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Repository: NeoLabHQ/context-engineering-kit
Commit: dedca19

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.