Run a safe-to-fail experiment for Complex domain problems where cause-and-effect is only visible in retrospect.
88
88%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
| Type | What is probed | Typical steps | Containment |
|---|---|---|---|
| architecture | Structural feasibility — can this design hold? | Scaffold minimal implementation, run against real constraints | Worktree isolation, no prod writes |
| library | Fit of an external library or tool | Install, write minimal integration test, observe behavior | Sandboxed env, pinned version |
| prompt | LLM behavior under specific conditions | Run prompt variants, collect outputs, compare against criteria | Stateless, no persistent writes |
| integration | Compatibility between two systems | Send minimal real requests, observe response shape and error surface | Read-only or staging environment |
| design | User or team response to a design decision | Prototype minimal artifact, present to stakeholder, sense reaction | No code commitment |
All probe types: safe-to-fail, minimal footprint, reversible by default. If a step cannot be reversed, it must be explicitly approved at Phase 1 entry gate.
Every probe emits this format at Phase 1 plan presentation and Phase 2 exit gate:
🔬 Probe → [constraints] → [steps] → [patterns] → [result] → GATEExamples:
🔬 Probe → [worktree, no prod writes] → [scaffold adapter, run unit tests] → [error surface, type compatibility] → confirmed → GATE🔬 Probe → [staging env, read-only] → [send 3 API requests, inspect response] → [rate limit hit at step 2] → surprise → GATEFields:
constraints: enabling constraints from Phase 1.2 (comma-separated, abbreviated)steps: probe steps executed (comma-separated, abbreviated)patterns: what emerged (observations, not conclusions)result: confirmed | refuted | partial | surpriseGATE: signals human checkpoint — always present, always lastFive elements required for a valid probe. If any are missing or weak, the agent asks or proposes — never skips.
| Element | Strong input | Weak/missing input | Agent fallback |
|---|---|---|---|
| Hypothesis | Specific, falsifiable statement: "X will behave like Y under condition Z" | Vague goal: "I want to see if this works" | AskUserQuestion: "State one specific hypothesis — what do you believe is true, and under what condition?" |
| Enabling constraints | Named: scope, reversibility boundary, time box | Absent or implied | Agent proposes based on context: "I suggest constraining to [worktree / staging / 30 min]. Confirm or adjust." |
| Confirm/refute criteria | Observable signals defined before execution | "We'll know it works when it works" | Agent proposes: "Confirmed if [X]. Refuted if [Y]. Surprise if [Z]. Revise or proceed?" |
| Prior probe results | Carried from probe-to-probe-llm.md handoff context | First cycle — none expected | Skip gracefully: no prior context needed on first invocation |
| Scope boundary | Explicit system/component boundary stated | Unbounded ("the whole system") | Agent proposes minimum viable scope: "I suggest scoping to [component]. Expand if needed." |
Fallback rule: agent asks or proposes — user validates. Never skip an element silently. Never assume a hypothesis the user did not state.
When result is partial and a refined hypothesis is identifiable:
thinking/probes/{project}/{date}-{slug}-llm.md). On collision, append -2, -3, etc.probe-to-probe-llm.md with:
probe with the filled handoff as $ARGUMENTSAccumulated context cap: 800 tokens total across a chain. At cap, compress prior cycles to 200 tokens (keep: confirmed facts, active constraints, rejected paths). References to thinking files do not count toward cap.
Output path: thinking/probes/{project}/{date}-{slug}-llm.md (under your configured workspace root)
{project} = current project folder name. On collision (file exists), append sequence: -2, -3, etc.
Required sections in knowledge file:
confirmed | refuted | partial | surpriseNaming: {YYYY-MM-DD}-{hypothesis-slug}-llm.md. Slug is lowercase hyphenated, derived from hypothesis subject.