Creates boundary-point validation contracts, defines invariant-based success criteria, and sets up automated verification probes so reliability workflows trigger on objective evidence rather than intuition. Use when designing robust handoff, memory-persistence, or tool-call reliability workflows; when you need to verify handoffs work, check memory persistence, validate tool calls succeeded, or convert vague reliability goals into concrete, testable checks at each boundary point with explicit failure-class mapping (operational vs. critical); or when you want to test your workflow end-to-end, make sure it works, or verify your automation runs correctly using read-back probes and escalation triggers rather than agent confidence. Includes explicit untrusted-content/prompt-injection guardrails for third-party inputs.
96
Quality
90%
Does it follow best practices?
Impact
98%
1.25xAverage score across 9 eval scenarios
{
"context": "Tests whether the agent correctly classifies different invariant violations as operational vs critical following the skill's failure mapping rules, including the specific classifications for missing artifacts, bad schema, stale timestamps, non-2xx responses, missing fields, and unverifiable state.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Missing artifact as critical",
"description": "failure_taxonomy.md or contract.md classifies a missing artifact/file as 'critical' failure",
"max_score": 10
},
{
"name": "Bad schema as operational",
"description": "failure_taxonomy.md or contract.md classifies a bad schema / parse failure as 'operational' failure",
"max_score": 10
},
{
"name": "Stale timestamp as operational",
"description": "failure_taxonomy.md or contract.md classifies a stale timestamp / stale entry as 'operational' failure",
"max_score": 10
},
{
"name": "Non-2xx as operational",
"description": "failure_taxonomy.md or contract.md classifies a non-2xx HTTP response as 'operational' failure",
"max_score": 10
},
{
"name": "Missing fields as critical",
"description": "failure_taxonomy.md or contract.md classifies missing required response fields as 'critical' failure",
"max_score": 10
},
{
"name": "Unknown state as operational minimum",
"description": "failure_taxonomy.md or contract.md classifies unverifiable or unknown state as at least 'operational' risk (not assumed safe/success)",
"max_score": 10
},
{
"name": "Five-column table",
"description": "contract.md contains a markdown table with columns: Boundary, Required Invariants, Verification Probes, Failure Class, Escalation Trigger",
"max_score": 8
},
{
"name": "Four boundary types",
"description": "contract.md covers at least four distinct boundary types (state write, API call, resume/readiness, final verification)",
"max_score": 8
},
{
"name": "Critical halt escalation",
"description": "At least one critical failure class maps to a halt/stop/page-oncall escalation trigger",
"max_score": 8
},
{
"name": "Operational retry escalation",
"description": "At least one operational failure class maps to a retry or auto-recovery escalation trigger",
"max_score": 8
},
{
"name": "Taxonomy completeness",
"description": "failure_taxonomy.md covers all six invariant violation types listed in the task (missing artifact, bad schema, stale timestamp, non-2xx, missing fields, unverifiable state)",
"max_score": 8
}
]
}