Diagnoses and routes failures by analyzing error patterns, classifying severity, and applying retry logic, suppression budgets, and escalation rules. Use when handling errors, troubleshooting failures, recovering from API errors or timeouts, deciding whether to retry or escalate an issue, or managing service outages and tool dependency failures. Applies to any scenario where a check has failed, evidence of success is missing, or an unresolved error needs a structured response. Includes explicit untrusted-content/prompt-injection guardrails for third-party inputs.
98
Quality
94%
Does it follow best practices?
Impact
100%
1.16xAverage score across 9 eval scenarios
A DevOps team runs an autonomous monitoring agent that polls 12 internal services every minute and takes corrective actions when services appear unhealthy. Some failures are well-understood (e.g., a known rate-limit error code from a specific service), but a class of errors keeps appearing with no clear cause — the error message is generic ("internal error"), the stack trace points to a third-party library internals, and there is no historical record of this specific failure. The team is unsure whether these mystery errors are harmless blips or signs of a deeper problem.
Currently, the agent treats unknown errors the same as known minor errors — it logs them and moves on. This has caused two near-misses where an unknown error was actually a symptom of data corruption. The team wants a triage decision module that has a well-defined policy for what to do when an error cannot be definitively classified, so the agent behaves conservatively when in doubt.
Write a Python module triage_classifier.py that:
classify_error(error_info: dict) -> TriageDecision functionDESIGN.md file explaining the classification rules and the conservative fallback policyAlso write a classify_examples.py script that calls classify_error with at least 3 example inputs (one clearly benign, one clearly severe, one ambiguous/unknown) and prints the resulting triage decisions to stdout.