CtrlK
BlogDocsLog inGet started
Tessl Logo

benpiper-workspace/planning-execution-harness

Break down goals into multiple tasks and coordinate execution with gates and recovery. Based on Claw Code's agentic harness.

92

1.09x
Quality

90%

Does it follow best practices?

Impact

100%

1.09x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

SPEC.md

Planning-Execution Harness: Technical Specification

Goal

Enable any LLM to reliably orchestrate multi-step processes by separating planning from execution, enforcing approval gates before action, and recovering intelligently from failures.

Problem solved: LLM agents either skip planning (act immediately, make mistakes) or plan without gating (no approval before risky actions) or fail once and stop (no recovery strategy).

This spec solves: Structured workflow with planning → approval → execution → recovery → observability.


Input/Output

Inputs

What the system receives:

  • Goal (string): High-level objective ("optimize resume", "debug login", "deploy system")
  • Context (optional object): Project state, constraints, preferences
  • Approval mechanism (function/human): How approval decisions are made
  • Tool set (callable functions): What actions the LLM can execute
  • Recovery policies (map): How to handle each failure type

Constraints:

  • Goal must be achievable in 3-7 concrete steps (not 1 step, not 50)
  • Approval must be explicit (not implicit or automatic for all cases)
  • Recovery policies must be defined upfront

Outputs

What the system produces:

StageOutputExample
PlanningOrdered task list with dependencies"Task 1: Analyze\nTask 2: Design (depends on 1)\nTask 3: Implement (depends on 2)"
GateApproval decision + any modifications"APPROVED" or "APPROVED_MODIFIED: reorder Task 2 and 3"
ExecutionTask completion log with progress"[Task 1/3] ✓ Completed\n[Task 2/3] ✓ Completed"
RecoveryRecovery event + outcome"RECOVERY_APPLIED: retry_transient → success"
FinalExecution summary + outcomes"3/3 tasks completed. 0 failures."

System Architecture

Core Constraint: Planning ≠ Execution

The system enforces two separate phases:

  1. Planning Phase: LLM produces a plan. Plan is reviewed. Plan may be modified. Plan is NOT executed yet.
  2. Execution Phase: ONLY approved plan is executed. No deviations.

This prevents the "plan one thing, do another" problem.

The Five-Stage Pipeline

User Goal
    ↓
[1. PLAN] → Decompose into ordered tasks
    ↓
[2. GATE] → BLOCKS execution until approved
    ↓
[3. EXECUTE] → Run tasks in order
    ↓ (if failure)
[4. RECOVER] → Classify + apply recovery recipe
    ↓
[5. LOG] → Record all state changes
    ↓
Outcome

Stage Independence: Each stage has a single responsibility. Stages can be implemented separately in different languages/systems.

Key Design Decisions

Decision 1: Gates Are Mandatory

  • Every execution must be gated (approved before proceeding)
  • No auto-approve all modes
  • Human/policy decides per-case

Decision 2: Failures Are Classified

  • Not all retries are equal
  • Transient errors → retry with backoff
  • Permission errors → ask human
  • Logic errors → fix and retry
  • Unrecoverable → escalate

Decision 3: Execution Follows Plan Exactly

  • Once approved, the plan is law
  • Don't skip tasks
  • Don't reorder tasks
  • Don't add new tasks
  • If plan is wrong, STOP and return to planning stage

Decision 4: Everything Is Logged

  • Event log is source of truth
  • If no event was emitted, it didn't happen
  • External systems can observe via log

Workflow: Detailed Five-Stage Process

Stage 1: PLAN — Decomposition

Input: User goal
Output: Ordered task list with dependencies
LLM responsibility: Break goal into concrete, testable, ordered tasks

Task Requirements:

RequirementDefinitionPassFail
ConcreteSpecific action, not vague"Remove unused imports""Clean code"
TestablePass/fail is objective"Response < 200ms""Make it faster"
OrderedExecution sequence clearTask 2 (depends on 1)Task 1, 2, 3
SizedNot too big/small3-7 tasks1 task or 50 tasks

Dependency Notation:

Task 1: Base functionality
Task 2: Feature (depends on Task 1)
Task 3: Parallel work (depends on Task 1)
Task 4: Integration (depends on Task 2, Task 3)

Failure Modes:

  • Vague tasks ("improve performance") → Reject, ask to be more specific
  • Circular dependencies (A→B, B→A) → Reject, fix dependencies
  • Too many/few tasks → Reject, rebalance

Success Criteria:

  • All tasks concrete ✓
  • All dependencies explicit ✓
  • 3-7 tasks total ✓

Emit Event: PLAN_CREATED { task_count, dependencies }


Stage 2: GATE — Approval

Input: Proposed plan
Output: Approval decision (approve/modify/reject)
Human/Policy responsibility: Review plan before execution

Gate Actions:

ActionWhenExample
APPROVEPlan looks correct"Yes, proceed"
APPROVE_WITH_MODIFICATIONSReorder/add/remove tasks"Approved, but skip Task 2"
ASK_FOR_CLARIFICATIONDon't understand"What does 'sync' mean?"
REJECTWrong approach"No, try different strategy"

Gate Requirements:

  • MUST block execution (plan cannot execute without approval)
  • MUST be explicit (no silent approval)
  • MUST accept modifications (user can change plan)

Approval Signals (what counts as approval):

  • "approved" / "yes" / "proceed" / "✓"
  • Thumbs up emoji
  • Explicit confirmation

Rejection/Modification Signals:

  • "no" / "rejected" / "wait"
  • Specific changes: "reorder Task 2 and 3"
  • Questions: "why does Task 1 need Task 2?"

Failure Modes:

  • Approving without understanding → User's problem, system accepts
  • Modifying to circular dependencies → System detects, asks user to fix
  • Timeout waiting for approval → Configurable, escalate if exceeded

Emit Events:

  • GATE_APPROVAL_REQUESTED { plan_id }
  • GATE_APPROVED { modifications }
  • GATE_REJECTED { reason }

Stage 3: EXECUTE — Ordered Task Execution

Input: Approved plan + working environment
Output: Task completions and failures
LLM responsibility: Execute tasks in order, follow the plan

Execution Rules:

  1. Execute in order: Task 1 → Task 2 → Task 3 (unless dependencies allow parallelization)
  2. Report progress: After each task, emit TASK_COMPLETED or TASK_FAILED
  3. Stop on error: Don't continue past a failed task
  4. Don't deviate: Don't skip, reorder, or add tasks without returning to planning stage
  5. Check dependencies: Before starting a task, verify all dependencies are complete

Progress Format (MANDATORY for every task):

Use clear notation showing N/M progress with completion symbols. Format is flexible; choose what fits your task:

Required elements:

  1. Task sequence: Step N/M, Task N/M, Item N/M, etc.
  2. Completion symbol: ✓/✗, COMPLETED/FAILED, (done)/(error), etc.

Acceptable formats:

[Task 1/5] ✓ Check database: Performance acceptable (2s query)
Step 2/5: Check API limits (✗ failed: 429 Rate limit)
Step 2/5 - Retry: Check API limits (✓ completed: 1000 req/hour)
Item 3/5 [COMPLETED] Update config: Deployment settings validated

All formats above are valid because they show:

  • Progress: N/M (1/5, 2/5, etc.)
  • Completion: ✓, ✗, COMPLETED, FAILED, etc.
  • Details: what was done and result

Critical requirement: Every task MUST show clear N/M progress + completion symbol. Choose any format that includes both elements.

Failure Modes:

  • Missing dependency (Task 2 before Task 1) → Detect, block, emit error
  • Task takes too long → Configurable timeout, emit warning
  • External error (API down) → Emit error, proceed to recovery stage

Emit Events:

  • TASK_STARTED { task_id, task_name }
  • TASK_COMPLETED { task_id, result }
  • TASK_FAILED { task_id, error }

Stage 4: RECOVER — Failure Classification & Recovery

Input: A task failure
Output: Recovery action or escalation
Policy responsibility: Define recovery recipes per failure type

Failure Classification:

Failure TypeCharacteristicsRecoveryMax Retries
TransientTemporary, will likely succeed if retriedRetry with backoff (5s, 30s, 5m)2
PermissionAccess denied, needs approval/credentialsAsk human for approval or credentials1 (after user provides input)
Invalid InputData is malformedAsk human for correction1 (after user provides input)
Logic ErrorCode/approach is wrongFix approach and retry1
UnrecoverableTask no longer makes senseSkip task or abort plan0

Recovery Recipe Structure:

Failure Type: [name]
Detection: [how to identify this failure]
Recovery Steps:
  1. [action]
  2. [action]
  3. [escalation policy if still failing]
Max Attempts: [number]
Escalation: [what happens if max attempts exceeded]

Example Recipe: Network Timeout

Failure Type: Transient (Network Timeout)
Detection: "Connection timeout" or "No response after 30s"
Recovery Steps:
  1. Wait 5 seconds
  2. Retry the request
  3. If still fails, wait 30s and retry again
  4. If still fails, emit RECOVERY_ESCALATION
Max Attempts: 2
Escalation: Ask user: "Network unstable. Retry, skip, or abort?"
  - Retry: continue from step 1
  - Skip: skip this task, continue with next
  - Abort: halt plan execution

Example Recipe: Permission Denied

Failure Type: Permission
Detection: "403 Forbidden" or "401 Unauthorized"
Recovery Steps:
  1. Emit PERMISSION_REQUIRED event
  2. STOP execution. Do NOT ask user to provide credentials during execution.
  3. Instruct user to set credentials via secure channels (environment variables, 
     config files, credential vaults) BEFORE execution resumes
  4. Ask: "Credentials configured? Retry, or skip/abort task?"
Max Attempts: 1 (after user configures credentials via secure means)
Escalation: User decides (skip or abort)
SECURITY NOTE: Never request, log, or handle credentials as text input during 
execution. Credentials must come from secure sources (env vars, vaults, config 
files) that are not exposed in logs or output.

Failure Modes:

  • Unknown failure type → Escalate to human (unknown recovery recipe)
  • Recovery fails after max attempts → Follow escalation policy
  • User rejects recovery → Respect decision (skip or abort)

Emit Events:

  • FAILURE_DETECTED { task_id, error_message }
  • FAILURE_CLASSIFIED { failure_type }
  • RECOVERY_APPLIED { recipe_name, outcome }
  • RECOVERY_ESCALATION { reason, user_decision }

Stage 5: LOG — Observability & Event Log

Input: All system state changes
Output: Complete, immutable event log
System responsibility: Record everything

What Gets Logged:

EventWhenFormat
PLAN_CREATEDPlan is generated{ task_count, dependencies }
GATE_APPROVAL_REQUESTEDPlan ready for review{ plan_id }
GATE_APPROVEDApproval given{ modifications_if_any }
TASK_STARTEDTask begins execution{ task_id, task_name }
TASK_COMPLETEDTask succeeds{ task_id, result }
TASK_FAILEDTask fails{ task_id, error_message }
FAILURE_CLASSIFIEDError is categorized{ failure_type }
RECOVERY_APPLIEDRecovery strategy executed{ recipe_name, outcome }
EXECUTION_COMPLETEAll tasks done{ summary: completed, failed, skipped }

Log Entry Format:

{
  "timestamp": "2024-01-15T14:23:45Z",
  "event": "TASK_COMPLETED",
  "task_id": "task_1",
  "task_name": "Analyze input",
  "details": { "result": "5 requirements identified" }
}

Log Properties:

  • Immutable: Once written, events cannot change
  • Ordered: Chronological sequence
  • Queryable: Can search by task, event type, timestamp
  • Complete: All state changes recorded

Why Logging Matters:

  • Auditability: Prove what was approved and executed
  • Debugging: See exactly what happened and when
  • Learning: Understand failure patterns
  • Transparency: Show user what system did

Supporting Sections

Error Handling Philosophy

Principle 1: Classify Before Recovering Never retry blindly. Identify failure type first, apply appropriate recovery.

Principle 2: Stop on Unknown Errors If error type not in recovery recipes, escalate to human.

Principle 3: Finite Retries Each recovery recipe specifies max attempts. Prevent infinite loops.

Principle 4: User Decides on Escalation If recovery exhausted, ask user: retry, skip, or abort? Respect their decision.

Recovery Policies (Configurable)

These are patterns teams can customize:

Transient Error Policy:

  • Retry count: 2 (configurable)
  • Backoff: exponential (5s, 30s, 5m)
  • Escalation: ask user after max retries

Permission Error Policy:

  • Ask for credentials/confirmation
  • Retry once with new credentials
  • Escalation: user decides skip/abort

Logic Error Policy:

  • Fix approach based on error
  • Retry once
  • Escalation: escalate to human for judgment

Observability: What Agents Can See

The event log is the single source of truth. Agents can:

  • Query event history
  • See what was approved
  • See what was executed
  • See what failed and how it was recovered
  • Compute summaries (3/5 tasks done, 1 failed, 1 skipped)

External systems can:

  • Subscribe to events in real-time
  • Trigger on specific event types
  • Build dashboards from log
  • Audit compliance via immutable record

Implementation Guidance

What Must Be Implemented

  1. Plan decomposition (LLM responsibility)
  2. Gate mechanism (human/policy responsibility)
  3. Task execution (LLM + tool set responsibility)
  4. Failure classification (policy responsibility)
  5. Event logging (system responsibility)

What Can Vary

  • Languages: Python, JavaScript, Go, Rust, etc.
  • Tool sets: Different tools for different domains
  • Approval mechanisms: Human, policy rules, hybrid
  • Storage: File, database, event stream
  • Recovery recipes: Domain-specific recovery strategies

Security Considerations

Credential Handling

This pattern is designed for task orchestration but has implications for credential management:

✓ Safe Use Cases:

  • Automation tasks that don't involve secrets (file operations, data transformation, public API calls with pre-configured auth)
  • Tasks where credentials are already stored in secure vaults/env vars (agent retrieves, doesn't receive as input)
  • Non-sensitive operations that are logged

⚠ Caution Required:

  • Tasks requiring permission recovery (3FA, API keys, passwords)
  • Sensitive environments where execution logs are exposed
  • Scenarios where agents might echo credential values in output

✗ Not Recommended:

  • Interactive credential entry during agent execution (use env vars/vaults instead)
  • Workflows where agent output might be logged or exposed (logs expose credentials)
  • High-security operations where any credential exposure is unacceptable

Best Practice: Credential Security

  • Never ask agents to request credentials as text input during execution
  • Require credentials to be pre-configured in secure sources (environment variables, credential vaults, config files)
  • When a permission error occurs, stop execution and require the user to configure credentials via secure means
  • Never log, echo, or display credential values in event logs or output
  • Design tasks to use pre-authenticated clients (credentials injected before task execution starts)

Testing Checklist

  • Planning stage produces concrete, testable, ordered tasks with correct dependencies
  • Gate stage blocks execution until explicit approval received
  • Approved plan is executed exactly as specified
  • Failed tasks are classified before recovery attempted
  • Recovery recipes are applied correctly per failure type
  • Event log captures all state changes with timestamps
  • Unknown error types escalate to human
  • Recovery exhaustion escalates to human with user choice
  • Circular dependencies detected and rejected at planning stage
  • Task dependencies verified before execution

EXAMPLES.md

IMPLEMENTATION.md

PROMPT.md

PUBLISHING.md

QUICKSTART.md

README.md

REFERENCES.md

SKILL.md

SPEC.md

tile.json