Validate agent handoff packets and resume readiness using schema, freshness, and replay checks. Use when tasks pause/resume across sessions, agents, or humans — including when a user wants to continue where they left off, hand off to another agent, resume a previous task, or pick up an interrupted workflow. Includes explicit untrusted-content/prompt-injection guardrails for third-party inputs.
96
Quality
100%
Does it follow best practices?
Impact
96%
1.50xAverage score across 9 eval scenarios
{
"context": "Tests whether the agent correctly applies the classification system across three packets with different failure modes: Alpha (all checks pass = CLEAN), Beta (stale timestamp = OPERATIONAL), Gamma (empty required fields = OPERATIONAL), producing structured output with per-check summaries and recovery steps for each.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Alpha classified CLEAN",
"description": "Handoff-alpha is classified as CLEAN (all checks pass)",
"max_score": 12
},
{
"name": "Beta classified OPERATIONAL",
"description": "Handoff-beta is classified as OPERATIONAL (not CLEAN, not CRITICAL) due to the stale timestamp",
"max_score": 12
},
{
"name": "Gamma classified OPERATIONAL",
"description": "Handoff-gamma is classified as OPERATIONAL (not CLEAN, not CRITICAL) due to empty required fields",
"max_score": 12
},
{
"name": "Per-check breakdown for all three",
"description": "Each handoff packet has its own per-check summary showing pass/fail for each check performed",
"max_score": 14
},
{
"name": "Recovery steps for Beta and Gamma",
"description": "Both handoff-beta and handoff-gamma have non-empty recovery steps; handoff-alpha states no recovery steps are required",
"max_score": 12
},
{
"name": "Escalation for all three",
"description": "All three packets include an escalation recommendation section (even if Alpha's recommendation is 'no escalation needed')",
"max_score": 10
},
{
"name": "CLEAN/OPERATIONAL/CRITICAL labels used",
"description": "All three classifications use explicitly one of the labels CLEAN, OPERATIONAL, or CRITICAL",
"max_score": 14
},
{
"name": "Freshness check on all packets",
"description": "The freshness of updated_at is checked and reported for all three packets",
"max_score": 12
},
{
"name": "Summary table or section",
"description": "Report concludes with a summary that shows the classification of all three packets in one place",
"max_score": 2
}
]
}