Implement a task with automated LLM-as-Judge verification for critical steps
43
31%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/sdd/skills/implement-task/SKILL.mdQuality
Discovery
22%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description is too vague to effectively guide skill selection. It fails to specify what kinds of tasks it implements, what the verification process entails concretely, and critically lacks any 'Use when...' clause. The only distinguishing element is the 'LLM-as-Judge' terminology, which provides some signal but is insufficient for reliable skill matching.
Suggestions
Add an explicit 'Use when...' clause describing trigger scenarios, e.g., 'Use when the user wants to verify task outputs using LLM evaluation, automated quality checks, or judge-based validation of generated content.'
Replace 'implement a task' with specific concrete actions, e.g., 'Executes multi-step workflows with automated LLM-based quality verification at each critical checkpoint, including generating evaluation criteria, running judge prompts, and retrying failed steps.'
Include natural trigger terms users might say, such as 'quality check', 'automated review', 'verify output', 'judge evaluation', 'validation loop'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description mentions 'implement a task' which is extremely vague - it doesn't specify what kind of task, what domain, or what concrete actions are performed. 'Automated LLM-as-Judge verification' is somewhat specific but still abstract without explaining what it actually does. | 1 / 3 |
Completeness | The 'what' is vaguely stated ('implement a task with verification') and there is no 'when' clause at all. There are no explicit triggers or guidance on when Claude should select this skill, which per the rubric should cap completeness at 2, but the weak 'what' brings it down to 1. | 1 / 3 |
Trigger Term Quality | 'LLM-as-Judge' is a recognizable term for users familiar with the concept, and 'verification' and 'critical steps' are somewhat relevant keywords. However, these are more technical jargon than natural user language, and common variations or related terms (e.g., 'quality check', 'automated review', 'validation') are missing. | 2 / 3 |
Distinctiveness Conflict Risk | 'LLM-as-Judge verification' provides some distinctiveness as it's a specific methodology, but 'implement a task' is so generic it could overlap with virtually any implementation-focused skill. The niche is partially defined but not clearly bounded. | 2 / 3 |
Total | 6 / 12 Passed |
Implementation
39%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is a comprehensive orchestration guide with excellent workflow clarity—clear phases, validation checkpoints, feedback loops, and error handling. However, it is severely undermined by extreme verbosity (easily 3-4x longer than necessary), massive repetition across patterns and examples, and a complete lack of progressive disclosure. The monolithic structure means Claude's context window would be heavily consumed by redundant content, directly contradicting the skill's own advice about protecting context windows.
Suggestions
Split into multiple files: move argument definitions to ARGS.md, refine mode details to REFINE.md, usage examples to EXAMPLES.md, verification appendix to VERIFICATION.md, and voting algorithm to VOTING.md—reference each with one-line links from the main skill.
Eliminate repetition: the three execution patterns (A, B, C) share ~80% identical structure—define a single base pattern and only document the differences for each variant.
Remove explanations of basic concepts Claude already knows (e.g., how to calculate a median, what high variance means, why context windows matter) and trim the anti-rationalization section to a brief bullet list.
Reduce usage examples from 7 to 2-3 that cover distinct scenarios (basic, continue/refine, human-in-the-loop) and cut the verbose narrative format to concise input→output pairs.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose at ~800+ lines. Massive amounts of repetition (e.g., the same judge launch prompt template repeated for Pattern A, B, C; multiple nearly identical flow diagrams; 7 usage examples that largely repeat the same workflow). Explains concepts Claude already knows (what median is, how to sort scores). The checklist at the end repeats rules already stated in the body. Could easily be 60-70% shorter without losing actionable content. | 1 / 3 |
Actionability | Provides concrete prompt templates for sub-agents, specific bash commands for git operations, and clear argument parsing logic. However, much of the guidance is procedural description rather than executable code—pseudocode blocks like 'if STAGED is not empty AND UNSTAGED is not empty' aren't directly executable, and the skill relies heavily on external tools (Task tool, sdd:developer agent type) without providing their actual invocation syntax. The agent prompt templates are copy-paste ready, which is a strength. | 2 / 3 |
Workflow Clarity | The multi-phase workflow is clearly sequenced with explicit dependency ordering, validation checkpoints at every step (judge agents), feedback loops (fix → re-verify → iterate up to MAX_ITERATIONS), and clear pass/fail criteria with thresholds. Error recovery paths are well-defined for implementation failure, judge disagreement, and refine mode edge cases. The flow diagrams reinforce the sequence. | 3 / 3 |
Progressive Disclosure | Everything is crammed into a single monolithic file with no references to external files for detailed content. The argument definitions, refine mode behavior, human-in-the-loop behavior, verification specifications appendix, all 7 usage examples, and the complete voting algorithm could all be split into separate referenced files. The result is an overwhelming wall of text that defeats the purpose of a skill file as a concise overview. | 1 / 3 |
Total | 7 / 12 Passed |
Validation
81%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 9 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
skill_md_line_count | SKILL.md is long (1786 lines); consider splitting into references/ and linking | Warning |
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 9 / 11 Passed | |
dedca19
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.