coding-agent-helpers/compact-debug-ledger

Use when a debugging thread needs to be compressed into a reusable investigation ledger. Capture the target, evidence, attempted fixes, ruled-out hypotheses, viable hypotheses, and next experiments. Good triggers include "compact this debugging session", "summarize what we've tried", and "turn this into a debugging ledger".

3.66x

Quality

100%

Does it follow best practices?

Impact

99%

3.66x

Average score across 8 eval scenarios

Securityby

Passed

No known issues

Compressing a Multi-System Debugging Session

Name: coding-agent-helpers/compact-debug-ledger
Rating: 99.3 (1 reviews)
Author: coding-agent-helpers

Problem Description

A platform team has been investigating a complex outage that involves multiple interacting systems. After a long war-room session, you need to write up the investigation in a form that can be shared with the incident commander — someone who needs to quickly understand what's broken without reading 3 hours of call notes.

Create a compact incident investigation record from the session notes below and save it as incident_record.md.

Input Files

The following file is provided as input. Extract it before beginning.

=============== FILE: inputs/warroom_notes.md ===============

P0 Incident War Room Notes — 2024-03-15

Initial Report (14:00)

Users reporting: checkout fails, orders not confirmed, some users seeing duplicate charge emails, search returns stale results, recommendation tiles showing wrong items, admin dashboard throwing 502s.

Investigation Timeline

14:05 - Initial triage

Split into sub-teams: payments, search/recommendations, infrastructure.

14:20 - Payments team findings

Order service DB writes are succeeding
Stripe webhooks are processing normally
Duplicate charge emails: traced to notification service calling Stripe API and email service in a non-atomic way; if email fails, retry logic re-calls Stripe; this is a pre-existing bug that was triggered by email service latency spike
Email service latency: 8x normal (300ms → 2.4s)
Root cause of email latency: email provider (SendGrid) having a partial outage (confirmed via status page)

14:45 - Search/Recommendations team findings

Elasticsearch cluster: 2 of 5 data nodes are not responding
ES cluster health: red (two primary shards unassigned)
Cause: rolling restart of ES nodes was started at 13:50 as routine maintenance, did not complete before traffic spike
Recommendation service: dependent on ES for collaborative filtering queries; falling back to cold-start recommendations (explains wrong tiles)
Search stale results: cache TTL issue, cache was not invalidated when ES went into red state

15:10 - Infrastructure team findings

Admin dashboard 502s: load balancer health checks failing for admin service
Admin service: running fine on 2/3 instances; 1 instance OOM-killed at 13:58 due to memory spike
Memory spike: caused by a cron job that runs at 14:00 to pre-compute analytics reports; runs fine normally but takes 2x memory when ES queries are slow (due to retries with large response buffers)
So: ES partial outage → admin analytics cron uses more memory → OOM on one instance → 502s for ~33% of requests

15:30 - Mitigation actions

SendGrid outage: no action possible, monitoring
ES cluster: rolling restart completed at 15:25, cluster health now green, search/recommendations recovering
Email notification retry loop: deployed hotfix to check for duplicate before retry (resolves duplicate emails going forward)
Admin service: restarted OOM'd instance, now 3/3 healthy
Cache invalidation: added ES health check before serving cached search results

15:45 - Status

All symptoms resolved. Root causes: (1) SendGrid partial outage causing email latency → duplicate charge emails, (2) routine ES maintenance not completed before traffic spike → search/recs degraded → admin memory spike → 502s.

evals

scenario-1

scenario-2

scenario-3

scenario-4

scenario-5

criteria.json

task.md

scenario-6

scenario-7

scenario-8

skills

tile.json