CtrlK
BlogDocsLog inGet started
Tessl Logo

coding-agent-helpers/compact-debug-ledger

Use when a debugging thread needs to be compressed into a reusable investigation ledger. Capture the target, evidence, attempted fixes, ruled-out hypotheses, viable hypotheses, and next experiments. Good triggers include "compact this debugging session", "summarize what we've tried", and "turn this into a debugging ledger".

99

3.66x
Quality

100%

Does it follow best practices?

Impact

99%

3.66x

Average score across 8 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

task.mdevals/scenario-5/

Compressing a Multi-System Debugging Session

Problem Description

A platform team has been investigating a complex outage that involves multiple interacting systems. After a long war-room session, you need to write up the investigation in a form that can be shared with the incident commander — someone who needs to quickly understand what's broken without reading 3 hours of call notes.

Create a compact incident investigation record from the session notes below and save it as incident_record.md.

Input Files

The following file is provided as input. Extract it before beginning.

=============== FILE: inputs/warroom_notes.md ===============

P0 Incident War Room Notes — 2024-03-15

Initial Report (14:00)

Users reporting: checkout fails, orders not confirmed, some users seeing duplicate charge emails, search returns stale results, recommendation tiles showing wrong items, admin dashboard throwing 502s.

Investigation Timeline

14:05 - Initial triage

Split into sub-teams: payments, search/recommendations, infrastructure.

14:20 - Payments team findings

  • Order service DB writes are succeeding
  • Stripe webhooks are processing normally
  • Duplicate charge emails: traced to notification service calling Stripe API and email service in a non-atomic way; if email fails, retry logic re-calls Stripe; this is a pre-existing bug that was triggered by email service latency spike
  • Email service latency: 8x normal (300ms → 2.4s)
  • Root cause of email latency: email provider (SendGrid) having a partial outage (confirmed via status page)

14:45 - Search/Recommendations team findings

  • Elasticsearch cluster: 2 of 5 data nodes are not responding
  • ES cluster health: red (two primary shards unassigned)
  • Cause: rolling restart of ES nodes was started at 13:50 as routine maintenance, did not complete before traffic spike
  • Recommendation service: dependent on ES for collaborative filtering queries; falling back to cold-start recommendations (explains wrong tiles)
  • Search stale results: cache TTL issue, cache was not invalidated when ES went into red state

15:10 - Infrastructure team findings

  • Admin dashboard 502s: load balancer health checks failing for admin service
  • Admin service: running fine on 2/3 instances; 1 instance OOM-killed at 13:58 due to memory spike
  • Memory spike: caused by a cron job that runs at 14:00 to pre-compute analytics reports; runs fine normally but takes 2x memory when ES queries are slow (due to retries with large response buffers)
  • So: ES partial outage → admin analytics cron uses more memory → OOM on one instance → 502s for ~33% of requests

15:30 - Mitigation actions

  • SendGrid outage: no action possible, monitoring
  • ES cluster: rolling restart completed at 15:25, cluster health now green, search/recommendations recovering
  • Email notification retry loop: deployed hotfix to check for duplicate before retry (resolves duplicate emails going forward)
  • Admin service: restarted OOM'd instance, now 3/3 healthy
  • Cache invalidation: added ES health check before serving cached search results

15:45 - Status

All symptoms resolved. Root causes: (1) SendGrid partial outage causing email latency → duplicate charge emails, (2) routine ES maintenance not completed before traffic spike → search/recs degraded → admin memory spike → 502s.

evals

tile.json