Creates boundary-point validation contracts, defines invariant-based success criteria, and sets up automated verification probes so reliability workflows trigger on objective evidence rather than intuition. Use when designing robust handoff, memory-persistence, or tool-call reliability workflows; when you need to verify handoffs work, check memory persistence, validate tool calls succeeded, or convert vague reliability goals into concrete, testable checks at each boundary point with explicit failure-class mapping (operational vs. critical); or when you want to test your workflow end-to-end, make sure it works, or verify your automation runs correctly using read-back probes and escalation triggers rather than agent confidence. Includes explicit untrusted-content/prompt-injection guardrails for third-party inputs.
96
Quality
90%
Does it follow best practices?
Impact
98%
1.25xAverage score across 9 eval scenarios
{
"context": "Tests whether the agent produces a complete multi-boundary contract covering all five boundary types from the skill workflow, uses the correct table format with all five required columns, and maps failures to operational/critical classes with escalation triggers for each boundary.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Multiple boundary types",
"description": "contract.md includes at least 3 distinct boundary types from the set: state write, handoff, resume, external tool call, final report",
"max_score": 10
},
{
"name": "Five-column table",
"description": "contract.md contains a markdown table with exactly these columns: Boundary, Required Invariants, Verification Probes, Failure Class, Escalation Trigger",
"max_score": 10
},
{
"name": "Invariants for each boundary",
"description": "Every boundary row has at least one invariant listed (not empty cells)",
"max_score": 8
},
{
"name": "Probes for each boundary",
"description": "Every boundary row has at least one verification probe listed",
"max_score": 8
},
{
"name": "Failure class for each boundary",
"description": "Every boundary row has a failure class column entry mapping at least one failure to either 'operational' or 'critical'",
"max_score": 8
},
{
"name": "Escalation trigger for each boundary",
"description": "Every boundary row has a non-empty escalation trigger",
"max_score": 8
},
{
"name": "Artifact exists invariant used",
"description": "At least one row includes an artifact-existence invariant (e.g. file exists, image digest present, commit SHA exists)",
"max_score": 8
},
{
"name": "Timestamp freshness invariant used",
"description": "At least one row includes a timestamp freshness or max_age invariant",
"max_score": 8
},
{
"name": "Checksum or hash invariant used",
"description": "At least one row includes a checksum, hash, or digest-matching invariant",
"max_score": 8
},
{
"name": "Critical vs operational distinction",
"description": "The contract uses both 'critical' and 'operational' classifications (or equivalent severity labels) across different boundary rows",
"max_score": 8
},
{
"name": "Final report boundary included",
"description": "The contract includes a boundary point corresponding to the final report stage (smoke test result or deployment report)",
"max_score": 8
},
{
"name": "Resume/readiness boundary included",
"description": "The contract includes a boundary point corresponding to a resume or readiness check (deployment rollout or cluster state)",
"max_score": 8
}
]
}