Creates boundary-point validation contracts, defines invariant-based success criteria, and sets up automated verification probes so reliability workflows trigger on objective evidence rather than intuition. Use when designing robust handoff, memory-persistence, or tool-call reliability workflows; when you need to verify handoffs work, check memory persistence, validate tool calls succeeded, or convert vague reliability goals into concrete, testable checks at each boundary point with explicit failure-class mapping (operational vs. critical); or when you want to test your workflow end-to-end, make sure it works, or verify your automation runs correctly using read-back probes and escalation triggers rather than agent confidence. Includes explicit untrusted-content/prompt-injection guardrails for third-party inputs.
96
Quality
90%
Does it follow best practices?
Impact
98%
1.25xAverage score across 9 eval scenarios
{
"context": "Tests whether the agent specifies precise, differentiated escalation triggers for different boundary types, including retry-then-halt for file handoffs, force re-computation for memory resume, and consecutive-failure thresholds for API calls, using the skill's prescribed escalation patterns.",
"type": "weighted_checklist",
"checklist": [
{
"name": "File handoff escalation",
"description": "The contract specifies 'retry once, then halt and report' (or semantically equivalent) as the escalation for a file/artifact handoff boundary",
"max_score": 12
},
{
"name": "Memory resume escalation",
"description": "The contract specifies 'force re-computation before proceeding' (or semantically equivalent: regenerate, rebuild cache) as the escalation for a memory/state-resume boundary",
"max_score": 12
},
{
"name": "API call escalation",
"description": "The contract specifies escalation after 2 consecutive failures (or equivalent threshold) for an API/tool-call boundary",
"max_score": 12
},
{
"name": "Five-column table",
"description": "contract.md contains a markdown table with columns: Boundary, Required Invariants, Verification Probes, Failure Class, Escalation Trigger",
"max_score": 10
},
{
"name": "Critical vs operational",
"description": "The contract distinguishes between 'critical' and 'operational' failure classes across the different boundaries",
"max_score": 10
},
{
"name": "Critical triggers halt",
"description": "The contract maps at least one 'critical' failure class to a halt or stop-pipeline escalation (not just retry)",
"max_score": 10
},
{
"name": "Operational triggers retry",
"description": "The contract maps at least one 'operational' failure class to a retry or recovery escalation (not halt)",
"max_score": 10
},
{
"name": "Missing evidence = operational",
"description": "The contract treats a missing or unverifiable state (e.g. no evidence of completion, unverifiable result) as at least operational minimum risk",
"max_score": 10
},
{
"name": "Invariants in table",
"description": "Every boundary row includes at least one invariant in the Required Invariants column",
"max_score": 7
},
{
"name": "Probes in table",
"description": "Every boundary row includes at least one verification probe",
"max_score": 7
}
]
}