Diagnose failed checks, flaky runs, API/tool errors, missing evidence, false-success risk, and unresolved incidents by classifying failure severity, applying bounded retries or suppression budgets, and deciding when to escalate. Use when troubleshooting jobs, pipelines, automations, integrations, agent/tool failures, timeouts, rate limits, stale read-backs, or any situation where you need a clear retry-vs-escalate decision instead of ad hoc recovery.
94
94%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Turn a failure signal into a repeatable retry / suppress / escalate decision.
cosmetic: no meaningful user-impact risk; output still usableoperational: reliability, completeness, latency, or confidence problem; outcome may still be recoverablecritical: likely user-impact, data-loss, unsafe side effect, or false-success riskcosmetic → retry up to 2 times, then logoperational → retry up to 3 times inside a suppression window; escalate when the budget is exhaustedcritical → stop the autonomous path for this failure and escalate immediatelyIf the state is unknown, contradictory, or not verifiable, classify it as at least operational.
If a success claim cannot be proved, treat that as a failure signal.
If round-trip verification suggests data loss, stale state, or unsafe side effects, classify as critical.
Use the ladder when any of these are true:
Track recurring operational failures with:
Escalate when either the time window or recurrence ceiling is exceeded.
record = load(failure_key) or {count: 0, first_seen: now}
record.count += 1
elapsed = now - record.first_seen
if record.count > MAX_RECURRENCE or elapsed > MAX_WINDOW:
escalate(record)
clear(failure_key)
else:
suppress(record)
save(failure_key, record)Use this shape when writing the final result:
Failure signal:
Evidence observed:
Tier assigned:
Action taken:
Escalation status:
Next safe step: