Implement a task with automated LLM-as-Judge verification for critical steps
43
31%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/sdd/skills/implement-task/SKILL.mdYour job is to implement solution in best quality using task specification and sub-agents. You MUST NOT stop until it critically neccesary or you are done! Avoid asking questions until it is critically neccesary! Launch implementation agent, judges, iterate till issues are fixed and then move to next step!
Execute task implementation steps with automated quality verification using LLM-as-Judge for critical artifacts.
$ARGUMENTSParse the following arguments from $ARGUMENTS:
| Argument | Format | Default | Description |
|---|---|---|---|
task-file | Path or filename | Auto-detect | Task file name or path (e.g., add-validation.feature.md) |
--continue | --continue | None | Continue implementation from last completed step. Launches judge first to verify state, then iterates with implementation agent. |
--refine | --refine | false | Incremental refinement mode - detect changes against git and re-implement only affected steps (from modified step onwards). |
--human-in-the-loop | --human-in-the-loop [step1,step2,...] | None | Steps after which to pause for human verification. If no steps specified, pauses after every step. |
--target-quality | --target-quality X.X or --target-quality X.X,Y.Y | 4.0 (standard) / 4.5 (critical) | Target threshold value (out of 5.0). Single value sets both. Two comma-separated values set standard,critical. |
--max-iterations | --max-iterations N | 3 | Maximum fix→verify cycles per step. Default is 3 iterations. Set to unlimited for no limit. |
--skip-judges | --skip-judges | false | Skip all judge validation checks - steps proceed without quality gates. |
Parse $ARGUMENTS and resolve configuration as follows:
# Extract task file (first positional argument, optional - auto-detect if not provided)
TASK_FILE = first argument that is a file path or filename
# Parse --target-quality (supports single value or two comma-separated values)
if --target-quality has single value X.X:
THRESHOLD_FOR_STANDARD_COMPONENTS = X.X
THRESHOLD_FOR_CRITICAL_COMPONENTS = X.X
elif --target-quality has two values X.X,Y.Y:
THRESHOLD_FOR_STANDARD_COMPONENTS = X.X
THRESHOLD_FOR_CRITICAL_COMPONENTS = Y.Y
else:
THRESHOLD_FOR_STANDARD_COMPONENTS = 4.0 # default
THRESHOLD_FOR_CRITICAL_COMPONENTS = 4.5 # default
# Initialize other defaults
MAX_ITERATIONS = --max-iterations || 3 # default is 3 iterations
HUMAN_IN_THE_LOOP_STEPS = --human-in-the-loop || [] (empty = none, "*" = all)
SKIP_JUDGES = --skip-judges || false
REFINE_MODE = --refine || false
CONTINUE_MODE = --continue || false
# Special handling for --human-in-the-loop without step list
if --human-in-the-loop present without step numbers:
HUMAN_IN_THE_LOOP_STEPS = "*" (all steps)--continueWhen --continue is used:
Step Resolution:
[DONE] markers on step titlesState Recovery:
in-progress/, todo/, done/)todo/, move to in-progress/ before continuing--refine)When --refine is used, it detects changes to project files (not the task file) and maps them to implementation steps to determine what needs re-verification.
Detect Changed Project Files:
First, determine what to compare against based on git state:
# Check for staged changes
STAGED=$(git diff --cached --name-only)
# Check for unstaged changes
UNSTAGED=$(git diff --name-only)Comparison logic:
| Staged | Unstaged | Compare Against | Command |
|---|---|---|---|
| Yes | Yes | Staged (unstaged only) | git diff --name-only |
| Yes | No | Last commit | git diff HEAD --name-only |
| No | Yes | Last commit | git diff HEAD --name-only |
| No | No | No changes | Exit with message |
Map Changes to Implementation Steps:
#### Verification section{changed_file → step_number}Determine Affected Steps:
Refine Execution:
Example:
# User manually fixed src/validation/validation.service.ts
# (This file was created in Step 2)
/implement my-task.feature.md --refine
# Detects: src/validation/validation.service.ts modified
# Maps to: Step 2 (Create ValidationService)
# Action: Launch judge for Step 2
# - If PASS: User's fix is good, proceed to Step 3
# - If FAIL: Implementation agent align rest of the code with user changes, without overwriting user's changes
# Continues: Step 3, Step 4... (re-verify all subsequent steps)Multiple Files Changed:
# User edited files from Step 2 AND Step 4
/implement my-task.feature.md --refine
# Detects: Files from Step 2 and Step 4 modified
# Earliest affected: Step 2
# Re-verifies: Step 2, Step 3, Step 4, Step 5...
# (Step 3 re-verified even though no direct changes, because it depends on Step 2)Staged vs Unstaged Changes:
# Scenario: User staged some changes, then made more edits
# Staged: src/validation/validation.service.ts (git add done)
# Unstaged: src/validation/validators/email.validator.ts (still editing)
/implement my-task.feature.md --refine
# Detects: Both staged AND unstaged changes exist
# Mode: Compares unstaged only (working dir vs staging)
# Only email.validator.ts is considered for refine
# Staged changes are preserved, not re-verified
# --
# Scenario: User only has staged changes (ready to commit)
# Staged: src/validation/validation.service.ts
# Unstaged: none
/implement my-task.feature.md --refine
# Detects: Only staged changes
# Mode: Compares against last commit
# validation.service.ts changes are verifiedHuman verification checkpoints occur:
Trigger Conditions:
HUMAN_IN_THE_LOOP_STEPSHUMAN_IN_THE_LOOP_STEPS is "*", triggers after every stepAt Checkpoint:
Checkpoint Message Format:
---
## 🔍 Human Review Checkpoint - Step X
**Step:** {step title}
**Step Type:** {standard/critical}
**Judge Score:** {score}/{threshold for step type} threshold
**Status:** ✅ PASS / 🔄 ITERATING (attempt {n})
**Artifacts Created/Modified:**
- {artifact_path_1}
- {artifact_path_2}
**Judge Feedback:**
{feedback summary}
**Action Required:** Review the above artifacts and provide feedback or continue.
> Continue? [Y/n/feedback]:
---Task status is managed by folder location:
.specs/tasks/todo/ - Tasks waiting to be implemented.specs/tasks/in-progress/ - Tasks currently being worked on.specs/tasks/done/ - Completed tasks| When | Action |
|---|---|
| Start implementation | Move task from todo/ to in-progress/ |
| Final verification PASS | Move task from in-progress/ to done/ |
| Implementation failure (user aborts) | Keep in in-progress/ |
Your role is DISPATCH and AGGREGATE. You do NOT do the work.
Properly build context of sub agents!
CRITICAL: For each sub-agent (implementation and evaluation), you need to provide:
${CLAUDE_PLUGIN_ROOT} so agents can resolve paths like @${CLAUDE_PLUGIN_ROOT}/scripts/create-scratchpad.sh| Prohibited Action | Why | What To Do Instead |
|---|---|---|
| Read implementation outputs | Context bloat → command loss | Sub-agent reports what it created |
| Read reference files | Sub-agent's job to understand patterns | Include path in sub-agent prompt |
| Read artifacts to "check" them | Context bloat → forget verifications | Launch judge agent |
| Evaluate code quality yourself | Not your job, causes forgetting | Launch judge agent |
| Skip verification "because simple" | ALL verifications are mandatory | Launch judge agent anyway |
If you think: "I should read this file to understand what was created" → STOP. The sub-agent's report tells you what was created. Use that information.
If you think: "I'll quickly verify this looks correct" → STOP. Launch a judge agent. That's not your job.
If you think: "This is too simple to need verification" → STOP. If the task specifies verification, launch the judge. No exceptions.
If you think: "I need to read the reference file to write a good prompt" → STOP. Put the reference file PATH in the sub-agent prompt. Sub-agent reads it.
Orchestrators who read files themselves = context overflow = command loss = forgotten steps. Every time.
Orchestrators who "quickly verify" = skip judge agents = quality collapse = failed artifacts.
Your context window is precious. Protect it. Delegate everything.
THRESHOLD_FOR_STANDARD_COMPONENTS (default 4.0) for standard steps!THRESHOLD_FOR_CRITICAL_COMPONENTS (default 4.5) for steps marked as critical in task file!MAX_ITERATIONS is set to unlimited: Iterate until quality threshold is met (no limit)HUMAN_IN_THE_LOOP_STEPS (or all steps if "*")!SKIP_JUDGES is true: Skip ALL judge validation - proceed directly to next step after each implementation completes!CONTINUE_MODE is true: Skip to RESUME_FROM_STEP - do not re-implement already completed steps!REFINE_MODE is true: Detect changed project files, map to steps, re-verify from REFINE_FROM_STEP - preserve user's fixes!Relaunch judge till you get valid results, of following happens:
This command orchestrates multi-step task implementation with:
Phase 0: Select Task & Move to In-Progress
│
├─── Use provided task file name or auto-select from todo/ (if only 1 task)
├─── Move task: todo/ → in-progress/
│
▼
Phase 1: Load Task
│
▼
Phase 2: Execute Steps
│
├─── For each step in dependency order:
│ │
│ ▼
│ ┌─────────────────────────────────────────────────┐
│ │ Launch sdd:developer agent │
│ │ (implementation) │
│ └─────────────────┬───────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────────────────────────────┐
│ │ Launch judge agent(s) │
│ │ (verification per #### Verification section) │
│ └─────────────────┬───────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────────────────────────────┐
│ │ Judge PASS? → Mark step complete in task file │
│ │ Judge FAIL? → Fix and re-verify (max 2 retries) │
│ └─────────────────────────────────────────────────┘
│
▼
Phase 3: Final Verification
│
├─── Verify all Definition of Done items
│ │
│ ▼
│ ┌─────────────────────────────────────────────────┐
│ │ Launch judge agent │
│ │ (verify all DoD items) │
│ └─────────────────┬───────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────────────────────────────┐
│ │ All PASS? → Proceed to Phase 4 │
│ │ Any FAIL? → Fix and re-verify (iterate) │
│ └─────────────────────────────────────────────────┘
│
▼
Phase 4: Move Task to Done
│
├─── Move task: in-progress/ → done/
│
▼
Phase 5: Final ReportParse user input to get the task file path and arguments.
If $ARGUMENTS is empty or only contains flags:
Check in-progress folder first:
ls .specs/tasks/in-progress/*.md 2>/dev/null$TASK_FILE to that file, $TASK_FOLDER to in-progressCheck todo folder:
ls .specs/tasks/todo/*.md 2>/dev/null$TASK_FILE to that file, $TASK_FOLDER to todoIf $ARGUMENTS contains a task file name:
in-progress/ → todo/ → done/$TASK_FILE and $TASK_FOLDER accordinglyIf task is in todo/ folder:
git mv .specs/tasks/todo/$TASK_FILE .specs/tasks/in-progress/
# Fallback if git not available: mv .specs/tasks/todo/$TASK_FILE .specs/tasks/in-progress/Update $TASK_PATH to .specs/tasks/in-progress/$TASK_FILE
If task is already in in-progress/:
Set $TASK_PATH to .specs/tasks/in-progress/$TASK_FILE
Parse all flags from $ARGUMENTS and initialize configuration.
Display resolved configuration:
### Configuration
| Setting | Value |
|---------|-------|
| **Task File** | {TASK_PATH} |
| **Standard Components Threshold** | {THRESHOLD_FOR_STANDARD_COMPONENTS}/5.0 |
| **Critical Components Threshold** | {THRESHOLD_FOR_CRITICAL_COMPONENTS}/5.0 |
| **Max Iterations** | {MAX_ITERATIONS or "3"} |
| **Human Checkpoints** | {HUMAN_IN_THE_LOOP_STEPS as comma-separated or "All steps" or "None"} |
| **Skip Judges** | {SKIP_JUDGES} |
| **Continue Mode** | {CONTINUE_MODE} |
| **Refine Mode** | {REFINE_MODE} |If CONTINUE_MODE is true:
Identify Last Completed Step:
[DONE] markers on step titles[DONE]LAST_COMPLETED_STEP to that number (or 0 if none)Verify Last Completed Step (if any):
LAST_COMPLETED_STEP > 0:
RESUME_FROM_STEP = LAST_COMPLETED_STEP + 1RESUME_FROM_STEP = LAST_COMPLETED_STEP (re-implement)Skip to Resume Point:
RESUME_FROM_STEPRESUME_FROM_STEPIf REFINE_MODE is true:
Detect Changed Project Files:
# Check for staged and unstaged changes
STAGED=$(git diff --cached --name-only)
UNSTAGED=$(git diff --name-only)Determine comparison mode:
if STAGED is not empty AND UNSTAGED is not empty:
# Both staged and unstaged - use unstaged only
CHANGED_FILES = git diff --name-only # working dir vs staging
COMPARISON_MODE = "unstaged_only"
elif STAGED is not empty OR UNSTAGED is not empty:
# Only one type - compare against last commit
CHANGED_FILES = git diff HEAD --name-only
COMPARISON_MODE = "vs_last_commit"
else:
# No changes
Report: "No project changes detected. Make edits first, then run --refine."
ExitLoad Task File and Extract Step→File Mapping:
#### Verification artifact pathsSTEP_FILE_MAP = {step_number → [file_paths]}Map Changed Files to Steps:
AFFECTED_STEPS = []
for each changed_file:
for step_number, file_list in STEP_FILE_MAP:
if changed_file matches any path in file_list:
AFFECTED_STEPS.append(step_number)Determine Refine Scope:
REFINE_FROM_STEP = min(AFFECTED_STEPS) # earliest affected stepREFINE_FROM_STEP onwards need re-verificationREFINE_FROM_STEP are preserved as-isStore Changed Files Context:
CHANGED_FILES = list of changed file pathsUSER_CHANGES_CONTEXT = git diff output for affected filesThis is the ONLY phase where you read a file.
Read the task file ONCE:
Read $TASK_PATHAfter this read, you MUST NOT read any other files for the rest of execution.
Parse the ## Implementation Process section:
Parallel with: annotations#### Verification sections:| Verification Level | When to Use | Judge Configuration |
|---|---|---|
| None | Simple operations (mkdir, delete) | Skip verification |
| Single Judge | Non-critical artifacts | 1 judge, threshold 4.0/5.0 |
| Panel of 2 Judges | Critical artifacts | 2 judges, median voting, threshold 4.5/5.0 |
| Per-Item Judges | Multiple similar items | 1 judge per item, parallel |
Create TodoWrite with all implementation steps, marking verification requirements:
{
"todos": [
{"content": "Step 1: [Title] - [Verification Level]", "status": "pending", "activeForm": "Implementing Step 1"},
{"content": "Step 2: [Title] - [Verification Level]", "status": "pending", "activeForm": "Implementing Step 2"}
]
}For each step in dependency order:
1. Launch Developer Agent:
Use Task tool with:
sdd:developeropus by defaultImplement Step [N]: [Step Title]
Task File: $TASK_PATH
Step Number: [N]
Your task:
- Execute ONLY Step [N]: [Step Title]
- Do NOT execute any other steps
- Follow the Expected Output and Success Criteria exactly
When complete, report:
1. What files were created/modified (paths)
2. Confirmation that success criteria are met
3. Any issues encountered2. Use Agent's Report (No Verification)
3. Mark Step Complete
[DONE] (e.g., ### Step 1: Setup [DONE])[X] completecompleted1. Launch Developer Agent:
Use Task tool with:
sdd:developeropus by defaultImplement Step [N]: [Step Title]
Task File: $TASK_PATH
Step Number: [N]
Your task:
- Execute ONLY Step [N]: [Step Title]
- Do NOT execute any other steps
- Follow the Expected Output and Success Criteria exactly
When complete, report:
1. What files were created/modified (paths)
2. Confirmation of completion
3. Self-critique summary2. Wait for Completion
3. Launch 2 Evaluation Agents in Parallel (MANDATORY):
⚠️ MANDATORY: This pattern requires launching evaluation agents. You MUST launch these evaluations. Do NOT skip. Do NOT verify yourself.
Use sdd:developer agent type for evaluations
Evaluation 1 & 2 (launch both in parallel with same prompt structure):
CLAUDE_PLUGIN_ROOT=${CLAUDE_PLUGIN_ROOT}
Read @${CLAUDE_PLUGIN_ROOT}/prompts/judge.md for evaluation methodology.
Evaluate artifact at: [artifact_path from implementation agent report]
**Chain-of-Thought Requirement:** Justification MUST be provided BEFORE score for each criterion.
Rubric:
[paste rubric table from #### Verification section]
Context:
- Read $TASK_PATH
- Verify Step [N] ONLY: [Step Title]
- Threshold: [from #### Verification section]
- Reference pattern: [if specified in #### Verification section]
You can verify the artifact works - run tests, check imports, validate syntax.
Return: scores per criterion with evidence, overall weighted score, PASS/FAIL, improvements if FAIL.4. Aggregate Results:
5. Determine Threshold:
#### Verification section or step metadata)THRESHOLD_FOR_CRITICAL_COMPONENTSTHRESHOLD_FOR_STANDARD_COMPONENTS6. On FAIL: Iterate Until PASS (max 3 iterations by default)
MAX_ITERATIONS reached (default 3):
7. On PASS: Mark Step Complete
[DONE] (e.g., ### Step 2: Create Service [DONE])[X] completecompleted8. Human-in-the-Loop Checkpoint (if applicable):
Only after step PASSES, if step number is in HUMAN_IN_THE_LOOP_STEPS (or HUMAN_IN_THE_LOOP_STEPS == "*"):
---
## 🔍 Human Review Checkpoint - Step [N]
**Step:** [Step Title]
**Judge Score:** [score]/[threshold for step type] threshold
**Status:** ✅ PASS
**Artifacts Created/Modified:**
- [artifact_path_1]
- [artifact_path_2]
**Judge Feedback:**
[feedback summary from judges]
**Action Required:** Review the above artifacts and provide feedback or continue.
> Continue? [Y/n/feedback]:
---For steps that create multiple similar items:
1. Launch Developer Agents in Parallel (one per item):
Use Task tool for EACH item (launch all in parallel):
sdd:developeropus by defaultImplement Step [N], Item: [Item Name]
Task File: $TASK_PATH
Step Number: [N]
Item: [Item Name]
Your task:
- Create ONLY [item_name] from Step [N]
- Do NOT create other items or steps
- Follow the Expected Output and Success Criteria exactly
When complete, report:
1. File path created
2. Confirmation of completion
3. Self-critique summary2. Wait for All Completions
3. Launch Evaluation Agents in Parallel (one per item)
⚠️ MANDATORY: Launch evaluation agents. Do NOT skip. Do NOT verify yourself.
Use sdd:developer agent type for evaluations
For each item:
CLAUDE_PLUGIN_ROOT=${CLAUDE_PLUGIN_ROOT}
Read @${CLAUDE_PLUGIN_ROOT}/prompts/judge.md for evaluation methodology.
Evaluate artifact at: [item_path from implementation agent report]
**Chain-of-Thought Requirement:** Justification MUST be provided BEFORE score for each criterion.
Rubric:
[paste rubric from #### Verification section]
Context:
- Read $TASK_PATH
- Verify Step [N]: [Step Title]
- Verify ONLY this Item: [Item Name]
- Threshold: [from #### Verification section]
You can verify the artifact works - run tests, check syntax, confirm dependencies.
Return: scores with evidence, overall score, PASS/FAIL, improvements if FAIL.4. Collect All Results
5. Report Aggregate:
6. Determine Threshold:
#### Verification section or step metadata)THRESHOLD_FOR_CRITICAL_COMPONENTSTHRESHOLD_FOR_STANDARD_COMPONENTS7. If Any FAIL: Iterate Until ALL PASS
MAX_ITERATIONS reached (default 3):
8. On ALL PASS: Mark Step Complete
[DONE] (e.g., ### Step 3: Create Items [DONE])[X] completecompleted9. Human-in-the-Loop Checkpoint (if applicable):
Only after ALL items PASS, if step number is in HUMAN_IN_THE_LOOP_STEPS (or HUMAN_IN_THE_LOOP_STEPS == "*"):
---
## 🔍 Human Review Checkpoint - Step [N]
**Step:** [Step Title]
**Items Passed:** X/Y
**Status:** ✅ ALL PASS
**Artifacts Created:**
- [item_1_path]
- [item_2_path]
- ...
**Action Required:** Review the above artifacts and provide feedback or continue.
> Continue? [Y/n/feedback]:
---Before moving to final verification, verify you followed the rules:
If you read files other than the task file, you are doing it wrong. STOP and restart.
After all implementation steps are complete, verify the task meets all Definition of Done criteria.
Use Task tool with:
sdd:developeropusCLAUDE_PLUGIN_ROOT=${CLAUDE_PLUGIN_ROOT}
Verify all Definition of Done items in the task file.
Task File: $TASK_PATH
Your task:
1. Read the task file and locate the "## Definition of Done (Task Level)" section
2. Go through each checkbox item one by one
3. For each item, verify if it passes by:
- Running appropriate tests (unit tests, E2E tests)
- Checking build/compilation status
- Verifying file existence and correctness
- Checking code patterns and linting
4. You MUST mark each item in task file that passed verification with `[X]`
5. Return a structured report:
- List ALL Definition of Done items
- Status for each:
- ✅ PASS - if the item is complete and verified
- ❌ FAIL - if the item fails verification, with specific reason why
- ⚠️ BLOCKED - if the item cannot be verified due to a blocker
- Evidence for each status
- Specific issues for any failures
- Overall pass rate
Be thorough - check everything the task requires.[X]If any Definition of Done items FAIL:
1. Launch Developer Agent for Each Failing Item:
Fix Definition of Done item: [Item Description]
Task File: $TASK_PATH
Current Status:
[paste failure details from verification report]
Your task:
1. Fix the specific issue identified
2. Verify the fix resolves the problem
3. Ensure no regressions (all tests still pass)
Return:
- What was fixed
- Confirmation the item now passes
- Any related changes made2. Re-verify After Fixes:
Launch the verification agent again (Step 3.1) to confirm all items now PASS.
3. Iterate if Needed:
Repeat fix → verify cycle until all Definition of Done items PASS.
Once ALL Definition of Done items PASS, move the task to the done folder.
Confirm all Definition of Done items are marked complete in the task file.
# Extract just the filename from $TASK_PATH
TASK_FILENAME=$(basename $TASK_PATH)
# Move from in-progress to done
git mv .specs/tasks/in-progress/$TASK_FILENAME .specs/tasks/done/
# Fallback if git not available: mv .specs/tasks/in-progress/$TASK_FILENAME .specs/tasks/done/When using 2+ evaluations, follow these manual computation steps:
Create a table with each criterion and scores from all evaluations:
| Criterion | Eval 1 | Eval 2 | Median | Difference |
|---|---|---|---|---|
| [Name 1] | X.X | X.X | ? | ? |
| [Name 2] | X.X | X.X | ? | ? |
For 2 evaluations: Median = (Score1 + Score2) / 2
For 3+ evaluations: Sort scores, take middle value (or average of two middle values if even count)
High variance = evaluators disagree significantly (difference > 2.0 points)
Formula: |Eval1 - Eval2| > 2.0 → Flag as high variance
Multiply each criterion's median by its weight and sum:
Overall = (Criterion1_Median × Weight1) + (Criterion2_Median × Weight2) + ...Compare overall score to threshold:
Overall ≥ Threshold → PASS ✅Overall < Threshold → FAIL ❌If evaluations significantly disagree (difference > 2.0 on any criterion):
After all steps complete and DoD verification passes:
## Implementation Summary
### Task Status
- Task Status: `done` ✅
- All Definition of Done items: X/X PASS (100%)
### Configuration Used
| Setting | Value |
|---------|-------|
| **Standard Components Threshold** | {THRESHOLD_FOR_STANDARD_COMPONENTS}/5.0 |
| **Critical Components Threshold** | {THRESHOLD_FOR_CRITICAL_COMPONENTS}/5.0 |
| **Max Iterations** | {MAX_ITERATIONS or "3"} |
| **Human Checkpoints** | {HUMAN_IN_THE_LOOP_STEPS or "None"} |
| **Skip Judges** | {SKIP_JUDGES} |
| **Continue Mode** | {CONTINUE_MODE} |
| **Refine Mode** | {REFINE_MODE} |
### Steps Completed
| Step | Title | Status | Verification | Score | Iterations | Judge Confirmed |
|------|-------|--------|--------------|-------|------------|-----------------|
| 1 | [Title] | ✅ | Skipped | N/A | 1 | - |
| 2 | [Title] | ✅ | Panel (2) | 4.5/5 | 1 | ✅ |
| 3 | [Title] | ✅ | Per-Item | 5/5 passed | 2 | ✅ |
| 4 | [Title] | ✅ | Single | 4.2/5 | 3 | ✅ |
**Legend:**
- ✅ PASS - Score >= threshold for step type
- ⚠️ MAX_ITER - Did not pass but MAX_ITERATIONS reached, proceeded anyway
- ⏭️ SKIPPED - Step skipped (continue/refine mode)
### Verification Summary
- Total steps: X
- Steps with verification: Y
- Passed on first try: Z
- Required iteration: W
- Total iterations across all steps: V
- Final pass rate: 100%
### Definition of Done Verification
| Item | Status | Evidence |
|------|--------|----------|
| [DoD Item 1] | ✅ PASS | [Brief evidence] |
| [DoD Item 2] | ✅ PASS | [Brief evidence] |
| ... | ... | ... |
**Issues Fixed During Verification:**
1. [Issue]: [How it was fixed]
2. [Issue]: [How it was fixed]
### High-Variance Criteria (Evaluators Disagreed)
- [Criterion] in [Step]: Eval 1 scored X, Eval 2 scored Y
### Human Review Summary (if --human-in-the-loop used)
| Step | Checkpoint | User Action | Feedback Incorporated |
|------|------------|-------------|----------------------|
| 2 | After PASS | Continued | - |
| 4 | After iteration 2 | Feedback | "Improve error messages" |
| 6 | After PASS | Continued | - |
### Task File Updated
- Task moved from `in-progress/` to `done/` folder
- All step titles marked `[DONE]`
- All step subtasks marked `[X]`
- All Definition of Done items marked `[X]`
### Recommendations
1. [Any follow-up actions]
2. [Suggested improvements]┌──────────────────────────────────────────────────────────────┐
│ IMPLEMENT TASK WITH VERIFICATION │
├──────────────────────────────────────────────────────────────┤
│ │
│ Phase 0: Select Task │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Use provided name or auto-select from todo/ (if 1 task) │ │
│ │ → Move task from todo/ to in-progress/ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Phase 1: Load Task │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Read $TASK_PATH → Parse steps │ │
│ │ → Extract #### Verification specs → Create TodoWrite │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Phase 2: Execute Steps (Respecting Dependencies) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ For each step: │ │
│ │ │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌───────────┐ │ │
│ │ │ developer │───▶│ Judge Agent │───▶│ PASS? │ │ │
│ │ │ Agent │ │ (verify) │ │ │ │ │
│ │ └──────────────┘ └───────────────┘ └───────────┘ │ │
│ │ │ │ │ │
│ │ Yes No │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌────────┐ Fix & │ │ │
│ │ │ Mark │ Retry │ │ │
│ │ │Complete│ ↺ │ │ │
│ │ └────────┘ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Phase 3: Final Verification │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ┌──────────────┐ ┌───────────────┐ ┌───────────┐ │ │
│ │ │ Judge Agent │───▶│ All DoD │───▶│ All PASS? │ │ │
│ │ │ (verify DoD) │ │ items checked │ │ │ │ │
│ │ └──────────────┘ └───────────────┘ └───────────┘ │ │
│ │ │ │ │ │
│ │ Yes No │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ Fix & │ │
│ │ Retry │ │
│ │ ↺ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Phase 4: Move Task to Done │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ mv in-progress/$TASK → done/$TASK │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Phase 5: Aggregate & Report │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Collect all verification results │ │
│ │ → Calculate aggregate metrics │ │
│ │ → Generate final report │ │
│ │ → Present to user │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘# Implement a specific task
/implement add-validation.feature.md
# Auto-select task from todo/ or in-progress/ (if only 1 task)
/implement
# Continue from last completed step
/implement add-validation.feature.md --continue
# Refine after user fixes project files (detects changes, re-verifies affected steps)
/implement add-validation.feature.md --refine
# Human review after every step
/implement add-validation.feature.md --human-in-the-loop
# Human review after specific steps only
/implement add-validation.feature.md --human-in-the-loop 2,4,6
# Higher quality threshold (stricter) - sets both standard and critical to 4.5
/implement add-validation.feature.md --target-quality 4.5
# Different thresholds for standard (3.5) and critical (4.5) components
/implement add-validation.feature.md --target-quality 3.5,4.5
# Lower quality threshold for both (faster convergence)
/implement add-validation.feature.md --target-quality 3.5
# Unlimited iterations (default is 3)
/implement add-validation.feature.md --max-iterations unlimited
# Skip all judge verifications (fast but no quality gates)
/implement add-validation.feature.md --skip-judges
# Combined: continue with human review
/implement add-validation.feature.md --continue --human-in-the-loopUser: /implement add-validation.feature.md
Phase 0: Task Selection...
Found task in: .specs/tasks/todo/add-validation.feature.md
Moving to in-progress: .specs/tasks/in-progress/add-validation.feature.md
Phase 1: Loading task...
Task: "Add form validation service"
Steps identified: 4 steps
Verification plan (from #### Verification sections):
- Step 1: No verification (directory creation)
- Step 2: Panel of 2 evaluations (ValidationService)
- Step 3: Per-item evaluations (3 validators)
- Step 4: Single evaluation (integration)
Phase 2: Executing...
Step 1: Launching sdd:developer agent...
Agent: "Implement Step 1: Create Directory Structure..."
Result: ✅ Directories created
Verification: Skipped (simple operation)
Status: ✅ COMPLETE
Step 2: Launching sdd:developer agent...
Agent: "Implement Step 2: Create ValidationService..."
Result: Files created, tests passing
Launching 2 judge agents in parallel...
Judge 1: 4.3/5.0 - PASS
Judge 2: 4.5/5.0 - PASS
Panel Result: 4.4/5.0 ✅
Status: ✅ COMPLETE (Judge Confirmed)
[Continue for all steps...]
Phase 3: Final Verification...
Launching DoD verification agent...
Agent: "Verify all Definition of Done items..."
Result: 4/4 items PASS ✅
Phase 4: Moving task to done...
mv .specs/tasks/in-progress/add-validation.feature.md .specs/tasks/done/
Phase 5: Final Report
Implementation complete.
- 4/4 steps completed
- 6 artifacts verified
- All passed first try
- Definition of Done: 4/4 PASS
- Task location: .specs/tasks/done/add-validation.feature.md ✅[All steps complete...]
Phase 3: Final Verification...
Launching DoD verification agent...
Agent: "Verify all Definition of Done items..."
Result: 3/4 items PASS, 1 FAIL ❌
Failing item:
- "Code follows ESLint rules": 356 errors found
Should I attempt to fix this issue? [Y/n]
User: Y
Launching sdd:developer agent...
Agent: "Fix ESLint errors..."
Result: Fixed 356 errors, 0 warnings ✅
Re-launching DoD verification agent...
Agent: "Re-verify all Definition of Done items..."
Result: 4/4 items PASS ✅
Phase 4: Moving task to done...
All DoD checkboxes marked complete ✅
Phase 5: Final Report
Task verification complete.
- All DoD items now PASS
- 1 issue fixed (ESLint errors)
- Task location: .specs/tasks/done/ ✅Step 3 Implementation complete.
Launching judge agents...
Judge 1: 3.5/5.0 - FAIL (threshold 4.0)
Judge 2: 3.2/5.0 - FAIL
Issues found:
- Test Coverage: 2.5/5
Evidence: "Missing edge case tests for empty input"
Justification: "Success criteria requires edge case coverage"
- Pattern Adherence: 3.0/5
Evidence: "Uses custom Result type instead of project standard"
Justification: "Should use existing Result<T, E> from src/types"
Should I attempt to fix these issues? [Y/n]
User: Y
Launching sdd:developer agent with feedback...
Agent: "Fix Step 3: Address judge feedback..."
Result: Issues fixed, tests added
Re-launching judge agents...
Judge 1: 4.2/5.0 - PASS
Judge 2: 4.4/5.0 - PASS
Panel Result: 4.3/5.0 ✅
Status: ✅ COMPLETE (Judge Confirmed)User: /implement add-validation.feature.md --continue
Phase 0: Parsing flags...
Configuration:
- Continue Mode: true
- Target Quality: 4.0/5.0 (default)
Scanning task file for completed steps...
Found: Step 1 [DONE], Step 2 [DONE]
Last completed: Step 2
Verifying Step 2 artifacts...
Launching judge agent for Step 3...
Judge: 4.3/5.0 - PASS ✅
Marking step as complete in task file...
Resuming from Step 4...
Step 3: Launching sdd:developer agent...
[continues normally from Step 4]# User manually fixed src/validation/validation.service.ts
# (This file was created in Step 2: Create ValidationService)
User: /implement add-validation.feature.md --refine
Phase 0: Parsing flags...
Configuration:
- Refine Mode: true
Detecting changed project files...
Changed files:
- src/validation/validation.service.ts (modified)
Mapping files to implementation steps...
- src/validation/validation.service.ts → Step 2 (Create ValidationService)
Earliest affected step: Step 2
Preserving: Step 1 (unchanged)
Re-verifying from: Step 2 onwards
Step 2: Launching judge to verify rest of logic with user's changes...
Judge: 4.3/5.0 - PASS ✅
Rest of logic is not affected, proceeding...
Step 3: Launching judge to verify...
Judge: typescript error detected in file
Launching imeplementation agent to fix the error, and align logic with user's changes...
Launching judge to verify fixed logic...
Judge: 4.5/5.0 - PASS ✅
[continues verifying remaining steps...]
All steps verified with user's changes incorporated ✅User: /implement add-validation.feature.md --human-in-the-loop
Configuration:
- Human Checkpoints: All steps
Step 1: Launching sdd:developer agent...
Result: Directories created ✅
---
## 🔍 Human Review Checkpoint - Step 1
**Step:** Create Directory Structure
**Judge Score:** N/A (no verification)
**Status:** ✅ COMPLETE
**Artifacts Created:**
- src/validation/
- src/validation/tests/
**Action Required:** Review the above artifacts and provide feedback or continue.
> Continue? [Y/n/feedback]: Y
---
Step 2: Launching sdd:developer agent...
Result: ValidationService created ✅
Launching judge agents...
Judge 1: 4.5/5.0 - PASS
Judge 2: 4.3/5.0 - PASS
Panel Result: 4.4/5.0 ✅
---
## 🔍 Human Review Checkpoint - Step 2
**Step:** Create ValidationService
**Judge Score:** 4.4/5.0 (threshold: 4.0)
**Status:** ✅ PASS
**Artifacts Created:**
- src/validation/validation.service.ts
- src/validation/tests/validation.service.spec.ts
**Judge Feedback:**
- All criteria met
- Test coverage comprehensive
**Action Required:** Review the above artifacts and provide feedback or continue.
> Continue? [Y/n/feedback]: The error messages could be more descriptive
---
Incorporating feedback: "error messages could be more descriptive"
Re-launching sdd:developer agent with feedback...
[iteration continues]User: /implement critical-api.feature.md --target-quality 4.5
Configuration:
- Target Quality: 4.5/5.0
Step 2: Implementing critical API endpoint...
Result: Endpoint created
Launching judge agents...
Judge 1: 4.2/5.0 - FAIL (threshold: 4.5)
Judge 2: 4.3/5.0 - FAIL
Iteration 1: Re-implementing with feedback...
[fixes applied]
Launching judge agents...
Judge 1: 4.4/5.0 - FAIL
Judge 2: 4.5/5.0 - PASS
Iteration 2: Re-implementing with feedback...
[more fixes applied]
Launching judge agents...
Judge 1: 4.6/5.0 - PASS
Judge 2: 4.5/5.0 - PASS
Panel Result: 4.55/5.0 ✅
Status: ✅ COMPLETE (passed on iteration 2)If sdd:developer agent reports failure:
If judges disagree significantly (difference > 2.0):
If --refine mode finds no git changes in the project:
If --refine mode finds changed files but none map to implementation steps:
Before completing implementation:
$ARGUMENTS correctlyTHRESHOLD_FOR_STANDARD_COMPONENTS for standard stepsTHRESHOLD_FOR_CRITICAL_COMPONENTS for critical stepsMAX_ITERATIONS reached, default 3)HUMAN_IN_THE_LOOP_STEPSSKIP_JUDGES is true: Skipped ALL judge validationCONTINUE_MODE is true: Verified last step and resumed correctlyREFINE_MODE is true: Detected changed project files, mapped to steps, re-verified from earliest affected step$TASK_PATH in .specs/tasks/in-progress/) - no other filessdd:developer agents via Task toolsdd:developer agents via Task toolSKIP_JUDGES is true)SKIP_JUDGES)[DONE][X]SKIP_JUDGES)HUMAN_IN_THE_LOOP_STEPSin-progress/ to done/ folder[X] in task fileThis appendix documents how verification is specified in task files. During Phase 2 (Execute Steps), you will reference these specifications to understand how to verify each artifact.
Task files define verification requirements in #### Verification sections within each implementation step. These sections specify:
Level: Verification complexity
None - Simple operations (mkdir, delete) - skip verificationSingle Judge - Non-critical artifacts - 1 judge, threshold 4.0/5.0Panel of 2 Judges - Critical artifacts - 2 judges, median voting, threshold 4.0/5.0 or 4.5/5.0Per-Item Judges - Multiple similar items - 1 judge per item, parallel executionArtifact(s): Path(s) to file(s) being verified
src/decision/decision.service.ts, src/decision/tests/decision.service.spec.tsThreshold: Minimum passing score
Rubric: Weighted criteria table (see format below)
Reference Pattern (Optional): Path to example of good implementation
src/app.service.ts for NestJS service patternsRubrics in task files use this markdown table format:
| Criterion | Weight | Description |
|-----------|--------|-------------|
| [Name 1] | 0.XX | [What to evaluate] |
| [Name 2] | 0.XX | [What to evaluate] |
| ... | ... | ... |Requirements:
Example:
| Criterion | Weight | Description |
|-----------|--------|-------------|
| Type Correctness | 0.35 | Types match specification exactly |
| API Contract Alignment | 0.25 | Aligns with documented API contract |
| Export Structure | 0.20 | Barrel exports correctly expose all types |
| Code Quality | 0.20 | Follows project TypeScript conventions |When judges evaluate artifacts, they use this 5-point scale for each criterion:
1 (Poor): Does not meet requirements
2 (Below Average): Multiple issues, partially meets requirements
3 (Adequate): Meets basic requirements
4 (Good): Meets all requirements, few minor issues
5 (Excellent): Exceeds requirements
During Phase 2 (Execute Steps):
#### Verification section in the task fileExample Verification Section in Task File:
#### Verification
**Level:** Panel of 2 Judges with Aggregated Voting
**Artifact:** `src/decision/decision.service.ts`, `src/decision/tests/decision.service.spec.ts`
**Rubric:**
| Criterion | Weight | Description |
|-----------|--------|-------------|
| Routing Logic | 0.20 | Correctly routes by customerType |
| Drip Feed Implementation | 0.25 | 2% random approval for rejected New customers only |
| Response Formatting | 0.20 | Correct decision outcome, triggeredRules preserved, ISO 8601 timestamp |
| Testability | 0.15 | Injectable randomGenerator enables deterministic testing |
| Test Coverage | 0.20 | Unit tests cover approval, rejection, drip feed, routing, timestamp |
**Reference Pattern:** NestJS service patterns, ZenEngineService APIThis specification tells you to:
dedca19
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.