Validate agent handoff packets and resume readiness using schema, freshness, and replay checks. Use when tasks pause/resume across sessions, agents, or humans — including when a user wants to continue where they left off, hand off to another agent, resume a previous task, or pick up an interrupted workflow. Includes explicit untrusted-content/prompt-injection guardrails for third-party inputs.
96
Quality
100%
Does it follow best practices?
Impact
96%
1.50xAverage score across 9 eval scenarios
{
"context": "Tests whether the agent recognizes that a technically valid but meaninglessly vague handoff packet cannot be confirmed via replay test, and therefore classifies it as at least OPERATIONAL rather than CLEAN — never certifying an uncertain handoff as safe to resume.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Vagueness identified",
"description": "Output explicitly notes that the packet fields are too vague, nonspecific, or ambiguous to act on (e.g., 'finish the feature work', 'continue from where we left off' are called out as insufficient)",
"max_score": 12
},
{
"name": "Replay test attempted",
"description": "Output shows the agent attempted to answer the replay test questions (current objective, unresolved blocker, next immediate action) from the packet",
"max_score": 10
},
{
"name": "Replay test not confirmed",
"description": "Replay test is marked as failed or inconclusive — the agent could not confirm concrete answers from the packet's vague content",
"max_score": 15
},
{
"name": "Not classified as CLEAN",
"description": "The overall classification is NOT CLEAN — the packet is not certified as safe to resume without action",
"max_score": 15
},
{
"name": "OPERATIONAL classification",
"description": "The classification is OPERATIONAL (not CRITICAL, since the artifact exists and all fields are technically non-empty)",
"max_score": 12
},
{
"name": "Recovery steps present",
"description": "Recovery steps include at least one action to clarify or refresh the vague information (e.g., contact original engineer, re-document the current state)",
"max_score": 8
},
{
"name": "Escalation recommendation",
"description": "Output includes an escalation recommendation",
"max_score": 8
},
{
"name": "Per-check summary",
"description": "Output includes a per-check breakdown with pass/fail for each check, showing which checks passed (schema, freshness) and which failed (replay)",
"max_score": 8
},
{
"name": "Schema check passes",
"description": "Output indicates schema/field-presence check passed (all 8 fields are present and technically non-empty), distinguishing the failure as a replay/content issue rather than a schema issue",
"max_score": 7
},
{
"name": "Does not recommend immediate resumption",
"description": "Output does NOT tell the engineer to proceed with resuming the task — it requires the vague state to be resolved first",
"max_score": 5
}
]
}