implement-task

Implement a task with automated LLM-as-Judge verification for critical steps

Quality

31%

Does it follow best practices?

Run evals on this skill

Adds up to 20 points to the overall score

View guide

Securityby

Passed

No findings from the security scan

Fix and improve this skill with Tessl

tessl review fix ./plugins/sdd/skills/implement-task/SKILL.md

Quality

Content

55%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is extremely thorough and actionable with excellent workflow clarity, providing concrete agent prompts, validation gates, and error recovery loops. However, it is massively over-engineered for a single SKILL.md file — at 800+ lines it wastes enormous context window space with repetitive patterns, redundant diagrams, excessive examples, and inlined content that should be in separate reference files. The token cost of loading this skill is very high relative to the unique information it conveys.

Suggestions

Extract the 7 usage examples, Appendix A (verification specs reference), the voting algorithm details, and the refine/continue mode specifications into separate referenced files (e.g., EXAMPLES.md, VERIFICATION-SPECS.md, VOTING.md, MODES.md) to reduce the main file to ~200 lines.

Consolidate Patterns A/B/C into a single parameterized pattern with a table showing differences (verification level, number of judges, threshold) rather than repeating nearly identical prompt templates three times.

Remove explanations of basic concepts (e.g., 'For 2 evaluations: Median = (Score1 + Score2) / 2', 'Sort scores, take middle value') — Claude knows how to calculate medians.

Eliminate the redundant checklist at the end which restates rules already covered in the CRITICAL sections and anti-rationalization rules above.

Dimension	Reasoning	Score
Conciseness	Extremely verbose at ~800+ lines. Massive amounts of repetition (e.g., the same prompt templates repeated for Pattern A/B/C, multiple redundant flow diagrams, 7 usage examples that largely repeat the same concepts). Explains concepts Claude already knows (what median is, how to sort scores). The checklist at the end repeats rules already stated in the body. Much of this could be condensed to 1/4 the length.	1 / 3
Actionability	Highly actionable with concrete prompt templates, specific bash commands, exact file paths, detailed argument parsing logic, and copy-paste ready agent prompts. The workflow patterns (A/B/C) provide executable guidance for each scenario with specific tool configurations.	3 / 3
Workflow Clarity	Excellent workflow clarity with explicit phase sequencing (0-5), clear validation checkpoints at every step (judge PASS/FAIL gates), explicit feedback loops (fix → re-verify → iterate up to MAX_ITERATIONS), and clear error recovery paths. The flow diagrams reinforce the sequence. Destructive operations (moving files) have proper guards.	3 / 3
Progressive Disclosure	Monolithic wall of text with no references to external files for detailed content. Everything is inlined — the refine mode logic, continue mode logic, voting algorithm, all usage examples, the entire appendix on verification specs, and the full checklist. Much of this (e.g., Appendix A, the 7 usage examples, the voting algorithm) should be in separate referenced files. No bundle files are provided to support splitting, but the content desperately needs it.	1 / 3
	Total	8 / 12 Passed

Description

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description is too vague and abstract to be effective for skill selection. It fails to list concrete actions, lacks natural trigger terms users would use, and provides no explicit guidance on when Claude should select this skill. The only slightly redeeming quality is the mention of 'LLM-as-Judge' which provides some domain specificity.

Suggestions

List specific concrete actions the skill performs, e.g., 'Sets up automated evaluation pipelines, creates scoring rubrics, runs LLM-based quality checks on task outputs, and iterates on failing steps.'

Add a 'Use when...' clause with natural trigger terms, e.g., 'Use when the user asks to verify task outputs automatically, evaluate LLM responses, set up quality checks, or needs judge-based validation of critical workflow steps.'

Replace the vague 'implement a task' phrasing with specific task types or domains this skill applies to, making it distinguishable from general task-implementation skills.

Dimension	Reasoning	Score
Specificity	The description is vague — 'implement a task' is extremely generic, and 'automated LLM-as-Judge verification' is a single abstract concept rather than a list of concrete actions. No specific operations like 'create test cases', 'run evaluations', or 'generate scoring rubrics' are mentioned.	1 / 3
Completeness	The description weakly addresses 'what' (implement a task with verification) but provides no 'when' clause or explicit trigger guidance. There is no 'Use when...' or equivalent, and the 'what' itself is too vague to be useful.	1 / 3
Trigger Term Quality	The terms used are technical jargon ('LLM-as-Judge', 'verification for critical steps') that users are unlikely to naturally say. Common trigger terms like 'evaluate', 'check quality', 'auto-verify', 'test output', or 'judge responses' are absent.	1 / 3
Distinctiveness Conflict Risk	'LLM-as-Judge' is a somewhat distinctive concept that narrows the domain, but 'implement a task' is so generic it could overlap with virtually any skill. The combination provides some distinctiveness but not enough to clearly carve out a niche.	2 / 3
	Total	5 / 12 Passed

Validation

81%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 9 / 11 Passed

Validation for skill structure

Criteria	Description	Result
skill_md_line_count	SKILL.md is long (1786 lines); consider splitting into references/ and linking	Warning
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	9 / 11 Passed

Repository: NeoLabHQ/context-engineering-kit
Path: plugins/sdd/skills/implement-task/SKILL.md
Commit: 3711edf

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.