do-and-judge

Execute a task with sub-agent implementation and LLM-as-a-judge verification with automatic retry loop

Quality

36%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Critical

Do not install without reviewing

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/sadd/skills/do-and-judge/SKILL.md

Quality

Discovery

17%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description relies heavily on technical jargon without explaining what concrete problem it solves or when it should be selected. It lacks a 'Use when...' clause and the core action 'Execute a task' is too vague to help Claude distinguish this skill from others. The architectural pattern terms (sub-agent, LLM-as-a-judge) are specific but not user-facing.

Suggestions

Add a 'Use when...' clause specifying concrete trigger scenarios, e.g., 'Use when the user needs a task completed with quality verification, automated retries on failure, or delegated sub-task execution.'

Replace 'Execute a task' with specific examples of what tasks this handles, e.g., 'Delegates complex multi-step tasks to sub-agents, validates outputs using LLM-based quality checks, and automatically retries failed steps.'

Include natural trigger terms users might say, such as 'verify output quality', 'retry on failure', 'delegate subtasks', 'quality check', or 'automated validation'.

Dimension	Reasoning	Score
Specificity	Names a domain (sub-agent implementation, LLM-as-a-judge verification, retry loop) but the core action 'Execute a task' is extremely vague - it doesn't specify what kind of task, what the sub-agent does, or what is being verified.	2 / 3
Completeness	Only partially addresses 'what' (execute a task with sub-agents and verification) and completely lacks a 'when' clause or any explicit trigger guidance. Per rubric guidelines, missing 'Use when...' caps completeness at 2, and the 'what' is also weak, warranting a 1.	1 / 3
Trigger Term Quality	Uses technical jargon like 'sub-agent implementation', 'LLM-as-a-judge', and 'automatic retry loop' that users are unlikely to naturally say. Missing common user-facing terms that would trigger selection of this skill.	1 / 3
Distinctiveness Conflict Risk	The combination of 'sub-agent', 'LLM-as-a-judge', and 'retry loop' is somewhat distinctive as a pattern, but 'Execute a task' is so generic it could overlap with many other skills. The technical terms provide some differentiation but the scope is unclear.	2 / 3
	Total	6 / 12 Passed

Implementation

55%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill provides excellent actionable guidance with a well-structured multi-phase workflow, clear validation checkpoints, and concrete prompt templates. However, it is severely undermined by extreme verbosity — the 5 exhaustive examples repeat full prompt templates already defined in the process section, and the entire document could be reduced to roughly 1/3 its size without losing any information. The lack of progressive disclosure (no bundle files, everything inline) compounds the token efficiency problem.

Suggestions

Extract the 5 detailed examples into a separate EXAMPLES.md file and reference it from the main skill — the examples alone are longer than the core instructions need to be.

Remove redundant explanations: the parallel dispatch pattern is described in the Process section, then re-explained in the dispatch example subsection, then shown again in every example. Define it once and reference it.

Extract prompt templates into a TEMPLATES.md file so the main skill focuses on the workflow logic and decision-making, with templates available as reference.

Cut explanatory prose that restates what the structured content already shows (e.g., the model selection table is self-explanatory and doesn't need the preceding paragraph explaining what each model is for).

Dimension	Reasoning	Score
Conciseness	Extremely verbose at ~600+ lines. Massive amounts of repetition: the examples repeat the full prompt templates already shown in the Process section, the parallel dispatch pattern is explained 3+ times, and every example includes full prompt text that adds no new information. The skill explains concepts Claude already knows (what complexity/risk/scope assessments are, basic decision logic). The examples alone are longer than the entire instructional content needs to be.	1 / 3
Actionability	Despite the verbosity, the skill provides highly concrete, actionable guidance: exact prompt templates for meta-judge, implementation agent, and judge; specific model selection criteria with a decision table; structured output formats; exact dispatch patterns with tool call specifications; and detailed decision logic with score thresholds. Everything is copy-paste ready.	3 / 3
Workflow Clarity	The 6-phase workflow is clearly sequenced with explicit validation checkpoints (judge verification), feedback loops (retry with specific issues), error recovery (escalation after max retries), and clear decision logic (score thresholds, retry counts). The parallel dispatch ordering is explicitly specified. Phase transitions are well-defined with clear entry/exit criteria.	3 / 3
Progressive Disclosure	Everything is crammed into a single monolithic file with no bundle files or external references. The 5 detailed examples (which constitute roughly half the document) should be in a separate EXAMPLES.md. The prompt templates could be in a TEMPLATES.md. The model selection guide and best practices could be separate files. The document is a wall of text that would consume enormous context window space.	1 / 3
	Total	8 / 12 Passed

Validation

81%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 9 / 11 Passed

Validation for skill structure

Criteria	Description	Result
skill_md_line_count	SKILL.md is long (1124 lines); consider splitting into references/ and linking	Warning
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	9 / 11 Passed

Repository: NeoLabHQ/context-engineering-kit
Commit: dedca19

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.