do-and-judge

Execute a task with sub-agent implementation and LLM-as-a-judge verification with automatic retry loop

Quality

36%

Does it follow best practices?

Run evals on this skill

Adds up to 20 points to the overall score

View guide

Securityby

Critical

Do not install without reviewing

Fix and improve this skill with Tessl

tessl review fix ./plugins/sadd/skills/do-and-judge/SKILL.md

Quality

Content

55%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill is highly actionable with excellent workflow clarity, providing concrete templates, decision logic, and thorough examples for a complex orchestration pattern. However, it is severely undermined by extreme verbosity—the content is roughly 5-10x longer than necessary, with examples that redundantly repeat the same prompt templates already defined in the process section. The monolithic structure with no progressive disclosure makes it a poor fit for context-window efficiency.

Suggestions

Extract the 5 detailed examples into a separate EXAMPLES.md file and reference it from the main skill, reducing the main document by ~60%

Deduplicate repeated instructions—the parallel dispatch requirement, meta-judge specification passthrough, and CLAUDE_PLUGIN_ROOT inclusion are each stated 3-5 times; state each once in the process section

Move the full prompt templates (CoT prefix, self-critique suffix, judge prompt, retry prompt) into a TEMPLATES.md file, keeping only brief summaries with references in the main skill

Remove explanatory content Claude already knows (e.g., what complexity/risk/scope assessments mean, basic if/else decision logic descriptions) and replace with concise decision tables

Dimension	Reasoning	Score
Conciseness	Extremely verbose at ~700+ lines. Massive amounts of repetition: the examples repeat the full prompt templates already shown in the Process section, the same instructions appear multiple times (e.g., 'dispatch meta-judge and implementation in parallel' is stated 5+ times), and lengthy example prompts are shown in full multiple times. The skill explains concepts Claude already knows (what complexity/risk/scope assessments are, basic decision logic). The examples alone consume hundreds of lines repeating nearly identical patterns.	1 / 3
Actionability	The skill provides highly concrete, specific guidance: exact prompt templates, dispatch configurations, model selection tables, decision logic with thresholds, structured output formats, and detailed examples showing the full execution flow. Every phase has copy-paste ready templates and clear tool usage patterns.	3 / 3
Workflow Clarity	The multi-step process is clearly sequenced across 6 phases with explicit validation checkpoints (judge verification), feedback loops (retry with specific issues), error recovery (escalation after max retries), and clear decision logic (score thresholds, retry counts). The workflow handles edge cases like pre-existing changes and retry semantics.	3 / 3
Progressive Disclosure	The entire skill is a monolithic wall of text with no references to external files despite being extremely long. The 5 detailed examples (each 50-100+ lines) could easily be in a separate EXAMPLES.md. The prompt templates could be in a TEMPLATES.md. The model selection guide, best practices, and error handling sections all inline content that would benefit from separation. No bundle files are provided to support this massive document.	1 / 3
	Total	8 / 12 Passed

Description

17%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description relies heavily on technical jargon to describe an architectural pattern (sub-agent + LLM-as-judge + retry) without explaining what concrete tasks it handles or when it should be selected. It lacks a 'Use when...' clause and natural trigger terms, making it difficult for Claude to know when to choose this skill over others. The core action 'Execute a task' is too vague to be useful for skill selection.

Suggestions

Add a 'Use when...' clause specifying concrete scenarios, e.g., 'Use when the user needs high-quality output that requires iterative refinement, quality checking, or when tasks benefit from automated verification and retry.'

Replace 'Execute a task' with specific actions this skill performs, e.g., 'Delegates complex tasks to a sub-agent, evaluates output quality using LLM-based judging criteria, and automatically retries until quality thresholds are met.'

Include natural trigger terms users might say, such as 'verify quality', 'retry until correct', 'check and redo', 'quality assurance', or 'iterative refinement'.

Dimension	Reasoning	Score
Specificity	It names a domain (sub-agent implementation, LLM-as-a-judge verification, retry loop) but the core action 'Execute a task' is extremely vague. It doesn't specify what kind of tasks, what the sub-agent does concretely, or what verification entails beyond naming the pattern.	2 / 3
Completeness	There is no 'Use when...' clause or any explicit trigger guidance. The 'what' is weakly stated ('execute a task' is nearly meaningless), and the 'when' is entirely missing. Per rubric guidelines, missing 'Use when...' caps completeness at 2, but the 'what' is also weak enough to warrant a 1.	1 / 3
Trigger Term Quality	The terms used ('sub-agent implementation', 'LLM-as-a-judge verification', 'automatic retry loop') are technical jargon that users would rarely naturally say. A user needing this skill would more likely say things like 'verify output quality', 'retry until correct', or 'delegate and check work'.	1 / 3
Distinctiveness Conflict Risk	The mention of 'sub-agent' and 'LLM-as-a-judge' provides some distinctiveness as an architectural pattern, but 'execute a task' is so generic it could overlap with virtually any task-execution skill. The specific methodology terms help somewhat but aren't enough for a clear niche.	2 / 3
	Total	6 / 12 Passed

Validation

81%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 9 / 11 Passed

Validation for skill structure

Criteria	Description	Result
skill_md_line_count	SKILL.md is long (1124 lines); consider splitting into references/ and linking	Warning
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	9 / 11 Passed

Repository: NeoLabHQ/context-engineering-kit
Path: plugins/sadd/skills/do-and-judge/SKILL.md
Commit: 3711edf

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.