Run a safe-to-fail experiment for Complex domain problems where cause-and-effect is only visible in retrospect.
88
88%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Safe-to-fail experiment in Complex domain. Cause-effect only visible in retrospect — probe to sense patterns, not to prove.
Probing: $ARGUMENTS
Check for handoff context: if $ARGUMENTS references a probe-to-probe-llm.md file, load it before Phase 1 — carried context accelerates qualification.
CRITICAL: After EVERY AskUserQuestion call, check if answers are empty/blank. Known Claude Code bug: outside Plan Mode, AskUserQuestion silently returns empty answers without showing UI.
If answers are empty: DO NOT proceed with assumptions. Instead:
ENTRY GATE: Phase 2 does not start until Phase 1 is complete. No bypass path exists.
Extract from $ARGUMENTS or handoff context:
If no hypothesis present: AskUserQuestion — ask user to state the hypothesis. Do not proceed without one.
Bounds without prescribing path:
Carry forward from prior cycles unchanged unless explicitly updated.
Before running: define observable signals. For each criterion:
Criteria must be defined before Phase 2 executes. Gate on this.
Output:
🔬 Probe → [constraints] → [steps] → [expected patterns] → [confirm/refute criteria] → GATEProbe type (see references/reference.md): architecture | library | prompt | integration | design
AskUserQuestion — one call:
On confirm: Phase 2 executes. On anything else: loop back to 1.1–1.4.
Configuration: isolation: worktree + run_in_background: true
Runs only after Phase 1 entry gate passes.
Run the experiment as defined in Phase 1. Prefer minimal, reversible actions. Gate frequency is SPARSE — enabling constraints bound the agent, not human micromanagement.
Observe results against confirm/refute criteria:
MUST execute before exit gate. DO NOT skip. DO NOT wait for user to ask.
Write probe result to a thinking artifact file in your configured workspace (e.g., thinking/probes/{project}/{date}-{slug}-llm.md).
{project} = current project folder name. Create the directory if missing.
Collision handling: If filename exists, append sequence: {date}-{slug}-2-llm.md, {date}-{slug}-3-llm.md, etc. First write gets clean name.
Guard: If workspace root is not configured, warn user and skip artifact persistence.
Content: hypothesis + enabling constraints + steps taken + observations + sensed patterns + result classification.
Classify result: confirmed | refuted | partial | surprise
Produce B4-compatible handoff:
| Result | When | Transition | Template |
|---|---|---|---|
| confirmed | Hypothesis holds | Complex → Complicated | references/probe-to-investigate-llm.md |
| partial (enough signal) | Some evidence, ready for expert analysis | Complex → Complicated | references/probe-to-investigate-llm.md |
| partial (need another angle) | Some evidence, hypothesis needs sharpening | Complex → Complex (re-probe) | references/probe-to-probe-llm.md |
| refuted / surprise | Hypothesis failed or unexpected result | Complex → Complex (brainstorm) | references/probe-to-brainstorm-llm.md |
Handoff token budget: target 300 tokens inline, flex 200-500, hard cap 600. References to thinking files do not count toward cap.
Self-transition: if result is partial and hypothesis can be sharpened, re-invoke probe via references/probe-to-probe-llm.md with accumulated context. Prior cycles compressed to 200 tokens at 800-token accumulated cap.
partial or surprise and you need to sharpen the hypothesis before escalating to expert analysis.Probing a new caching strategy:
# Problem: unknown whether Redis or Memcached fits the read pattern
# Probe: run both on 1% of traffic for 24h, measure hit rate + latency
# Reversible: yes (feature flag)Probing a service extraction:
# Problem: unclear if extracting auth into a microservice reduces coupling
# Probe: shadow the auth calls for 48h without routing real trafficProbing an LLM prompt strategy:
# Problem: uncertain whether chain-of-thought prompting improves accuracy for this task
# Probe: run 50 test cases through both zero-shot and CoT variants
# Confirm: CoT accuracy >= zero-shot + 10% on held-out eval set
# Refute: no measurable difference, or CoT latency cost exceeds acceptable threshold
# Reversible: yes (eval harness only, no production change)Probing a database schema migration approach:
# Hypothesis: adding a surrogate UUID primary key reduces join complexity in reporting queries
# Step 1: create a staging branch with the new schema
# Step 2: run the existing reporting query suite against staging
./scripts/run-query-benchmarks.sh --env staging --compare main
# Step 3: compare execution plans and latency — confirm/refute threshold: >=20% reduction
# Reversible: yes (staging only, main schema untouched)