Diagnoses and routes failures by analyzing error patterns, classifying severity, and applying retry logic, suppression budgets, and escalation rules. Use when handling errors, troubleshooting failures, recovering from API errors or timeouts, deciding whether to retry or escalate an issue, or managing service outages and tool dependency failures. Applies to any scenario where a check has failed, evidence of success is missing, or an unresolved error needs a structured response. Includes explicit untrusted-content/prompt-injection guardrails for third-party inputs.
98
Quality
94%
Does it follow best practices?
Impact
100%
1.16xAverage score across 9 eval scenarios
A media streaming company runs a transcoding pipeline that processes user-uploaded videos. One particular transcoding worker occasionally fails with a "codec mismatch" error on certain file types. This error is not immediately catastrophic — the job can be retried — but the same failure sometimes persists for hours or even an entire day without being noticed because each individual retry looks like a new event to the monitoring system.
The on-call engineering team has been burned by this pattern twice: a "codec mismatch" failure quietly retried for 18 hours, consuming compute budget and delaying user content, until a human noticed. They want a failure tracking module that remembers the history of a recurring failure type and automatically escalates when the problem has persisted too long or recurred too many times, rather than treating each attempt as fresh.
Write a Python module recurrence_tracker.py that:
Also write a demo_tracker.py script that simulates a recurring failure being checked multiple times (at least 5 checks) and prints the suppress/escalate decision at each step. Use mocked timestamps to demonstrate the escalation trigger (do not rely on actual sleep delays).