Validate agent handoff packets and resume readiness using schema, freshness, and replay checks. Use when tasks pause/resume across sessions, agents, or humans — including when a user wants to continue where they left off, hand off to another agent, resume a previous task, or pick up an interrupted workflow. Includes explicit untrusted-content/prompt-injection guardrails for third-party inputs.
96
Quality
100%
Does it follow best practices?
Impact
96%
1.50xAverage score across 9 eval scenarios
{
"context": "Tests whether the agent performs a replay test on the handoff packet, detects that the documented state is contradictory (completed says work is done; next_action says to start it), refuses to mark the handoff as successful, and classifies as OPERATIONAL.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Replay test attempted",
"description": "Output shows that a replay test was performed — i.e., the agent attempted to answer questions about the current objective, any unresolved blockers, and the next immediate action from the packet",
"max_score": 12
},
{
"name": "Contradiction identified",
"description": "Output explicitly identifies the contradiction: completed states work is done, but next_action says to begin the same work",
"max_score": 18
},
{
"name": "Replay test marked failed",
"description": "The replay test result is marked as failed or inconclusive — not passed",
"max_score": 12
},
{
"name": "Not classified as CLEAN",
"description": "The overall classification is NOT CLEAN — the handoff is not marked as safe to resume without action",
"max_score": 12
},
{
"name": "OPERATIONAL classification",
"description": "The classification is specifically OPERATIONAL (reflecting that the packet exists but has integrity issues)",
"max_score": 10
},
{
"name": "Does not mark handoff successful",
"description": "Output does NOT state the handoff is successful, valid, or safe to proceed — it requires resolution first",
"max_score": 12
},
{
"name": "Per-check summary",
"description": "Output includes a per-check breakdown covering at least schema, freshness, token, and replay checks with pass/fail for each",
"max_score": 8
},
{
"name": "Replay failure in summary",
"description": "The check summary explicitly shows the replay test as failed (not just passing it by omission)",
"max_score": 8
},
{
"name": "Escalation present",
"description": "Output includes an escalation recommendation",
"max_score": 8
}
]
}