Creates boundary-point validation contracts, defines invariant-based success criteria, and sets up automated verification probes so reliability workflows trigger on objective evidence rather than intuition. Use when designing robust handoff, memory-persistence, or tool-call reliability workflows; when you need to verify handoffs work, check memory persistence, validate tool calls succeeded, or convert vague reliability goals into concrete, testable checks at each boundary point with explicit failure-class mapping (operational vs. critical); or when you want to test your workflow end-to-end, make sure it works, or verify your automation runs correctly using read-back probes and escalation triggers rather than agent confidence. Includes explicit untrusted-content/prompt-injection guardrails for third-party inputs.
96
Quality
90%
Does it follow best practices?
Impact
98%
1.25xAverage score across 9 eval scenarios
A platform reliability team is replacing their legacy alerting system. The current system relies on an AI agent that monitors logs and decides when something "looks wrong enough" to page on-call. This has caused two categories of problems: false positives when the agent is uncertain about ambiguous log patterns, and false negatives when the agent incorrectly assumed an operation succeeded because the logs "seemed fine".
The CTO has mandated a new approach: every alert trigger must be derived from a specific, measurable check against observable system state — not from any model's confidence score, opinion, or intuition about whether the system seems healthy. The team needs a contract that defines what triggers and checks to use for three monitoring scenarios: (a) detecting a failed file write in a logging pipeline, (b) detecting a broken API integration with their metrics collector, and (c) detecting whether a cached configuration snapshot is current.
Produce the following files:
contract.md — A monitoring contract covering the three scenarios above. For each scenario define the boundary, the objective invariant checks to use as alert triggers, the failure classification, and what action to take. The triggers must be concrete and measurable — no subjective assessments.design_notes.md — A short document (3–5 bullet points) explaining the design principles used to choose the triggers, specifically addressing why objective checks are preferred over confidence-based triggers, and how to handle cases where state cannot be verified.