Creates boundary-point validation contracts, defines invariant-based success criteria, and sets up automated verification probes so reliability workflows trigger on objective evidence rather than intuition. Use when designing robust handoff, memory-persistence, or tool-call reliability workflows; when you need to verify handoffs work, check memory persistence, validate tool calls succeeded, or convert vague reliability goals into concrete, testable checks at each boundary point with explicit failure-class mapping (operational vs. critical); or when you want to test your workflow end-to-end, make sure it works, or verify your automation runs correctly using read-back probes and escalation triggers rather than agent confidence. Includes explicit untrusted-content/prompt-injection guardrails for third-party inputs.
96
Quality
90%
Does it follow best practices?
Impact
98%
1.25xAverage score across 9 eval scenarios
{
"context": "Tests whether the agent applies the skill's guardrails: using objective, measurable checks rather than agent confidence as triggers, preferring observable evidence over narrative assessment, and treating unverifiable state as at least operational risk.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Objective triggers only",
"description": "Every alert trigger in contract.md is a concrete, measurable check (e.g. file exists, HTTP status, timestamp age) — none are described as 'agent feels uncertain', 'model confidence below X', or qualitative assessments",
"max_score": 15
},
{
"name": "No confidence-based trigger",
"description": "contract.md does NOT use any of these phrases or equivalents: 'uncertain', 'feels like', 'seems', 'confidence score', 'agent decides' as a trigger condition",
"max_score": 12
},
{
"name": "Unverifiable state classified",
"description": "The contract or design_notes explicitly states that unverifiable or unknown state is treated as at least an operational risk (not assumed safe)",
"max_score": 12
},
{
"name": "File write boundary trigger",
"description": "The contract includes an objective check for the failed file write scenario (e.g. file exists, size > 0, checksum, timestamp)",
"max_score": 10
},
{
"name": "API integration boundary trigger",
"description": "The contract includes an objective check for the broken API integration scenario (e.g. HTTP status, required fields, response time)",
"max_score": 10
},
{
"name": "Cache freshness boundary trigger",
"description": "The contract includes an objective check for the cached configuration currency scenario (e.g. timestamp age, max_age threshold)",
"max_score": 10
},
{
"name": "Failure classification present",
"description": "The contract classifies each scenario's failure as either operational or critical",
"max_score": 8
},
{
"name": "Design principle documented",
"description": "design_notes.md explicitly states the preference for objective checks over confidence or narrative assessment as a design principle",
"max_score": 10
},
{
"name": "Unknown state principle documented",
"description": "design_notes.md explicitly addresses how to handle cases where state cannot be verified (treating it as risk, not success)",
"max_score": 8
},
{
"name": "Five-column table",
"description": "contract.md contains a table with at minimum columns for: Boundary, Invariants/Trigger, Failure Class, and Action",
"max_score": 5
}
]
}