CtrlK
BlogDocsLog inGet started
Tessl Logo

markusdowne/error-triage-ladder

Diagnoses and routes failures by analyzing error patterns, classifying severity, and applying retry logic, suppression budgets, and escalation rules. Use when handling errors, troubleshooting failures, recovering from API errors or timeouts, deciding whether to retry or escalate an issue, or managing service outages and tool dependency failures. Applies to any scenario where a check has failed, evidence of success is missing, or an unresolved error needs a structured response. Includes explicit untrusted-content/prompt-injection guardrails for third-party inputs.

98

1.16x

Quality

94%

Does it follow best practices?

Impact

100%

1.16x

Average score across 9 eval scenarios

Overview
Skills
Evals
Files

task.mdevals/scenario-6/

Recurring Failure Escalation Service

Problem Description

A media streaming company runs a transcoding pipeline that processes user-uploaded videos. One particular transcoding worker occasionally fails with a "codec mismatch" error on certain file types. This error is not immediately catastrophic — the job can be retried — but the same failure sometimes persists for hours or even an entire day without being noticed because each individual retry looks like a new event to the monitoring system.

The on-call engineering team has been burned by this pattern twice: a "codec mismatch" failure quietly retried for 18 hours, consuming compute budget and delaying user content, until a human noticed. They want a failure tracking module that remembers the history of a recurring failure type and automatically escalates when the problem has persisted too long or recurred too many times, rather than treating each attempt as fresh.

Output Specification

Write a Python module recurrence_tracker.py that:

  • Tracks failure recurrences by a failure key (string identifier)
  • Decides whether to suppress or escalate based on recurrence count and elapsed time
  • Clears the tracking record after escalation
  • Includes configurable thresholds (MAX_RECURRENCE and MAX_WINDOW)

Also write a demo_tracker.py script that simulates a recurring failure being checked multiple times (at least 5 checks) and prints the suppress/escalate decision at each step. Use mocked timestamps to demonstrate the escalation trigger (do not rely on actual sleep delays).

Install with Tessl CLI

npx tessl i markusdowne/error-triage-ladder

evals

SKILL.md

tile.json