CtrlK
BlogDocsLog inGet started
Tessl Logo

coding-agent-helpers/compact-debug-ledger

Use when a debugging thread needs to be compressed into a reusable investigation ledger. Capture the target, evidence, attempted fixes, ruled-out hypotheses, viable hypotheses, and next experiments. Good triggers include "compact this debugging session", "summarize what we've tried", and "turn this into a debugging ledger".

99

3.66x
Quality

100%

Does it follow best practices?

Impact

99%

3.66x

Average score across 8 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

task.mdevals/scenario-3/

Preserving a Complex Intermittent Bug Investigation

Problem Description

You are a senior engineer on a platform team. Your team has spent the past week investigating an intermittent data corruption issue in a distributed transaction system. The investigation is ongoing, with several hypotheses explored and varying results. You're about to take a long weekend and need to hand off the investigation context to a colleague in a form they can immediately act on.

Produce a handoff document that captures the current investigation state from the transcript below. Save it as handoff.md.

Input Files

The following file is provided as input. Extract it before beginning.

=============== FILE: inputs/investigation_notes.md ===============

Distributed Transaction Corruption - Investigation Notes

Background

About 0.03% of financial transactions are being double-applied — a debit is processed twice instead of once. Issue started approximately 3 weeks ago. Cannot reproduce on demand; only observed in production.

Hypotheses Explored

H1: Clock skew between nodes causing duplicate message processing

Status: Investigated. All nodes synced to NTP within 2ms. Maximum observed skew is 8ms. The idempotency window is 500ms. Clock skew cannot explain duplicates in this case. Verdict: Ruled out.

H2: Kafka consumer group rebalancing causing message redelivery

Status: Investigated. Added consumer lag monitoring. Observed 4 rebalancing events in the past 72 hours, each coinciding with a duplicate transaction window. Rebalancing causes the consumer to replay uncommitted offsets. This is a credible cause. Verdict: Still under investigation — strong correlation but not confirmed causation.

H3: Redis idempotency key TTL too short

Status: Investigated. TTL is 30 minutes. Transaction processing time is <200ms. TTL is not the issue. Verdict: Ruled out.

H4: Race condition in the transaction coordinator lock acquisition

Status: Investigated. Added distributed tracing to lock acquisition. Found 2 cases where two workers both saw the lock as unacquired within a 3ms window. Could be a bug in the lock library (redlock). Not yet confirmed whether this explains all cases. Verdict: Still under investigation — potential lock library bug.

H5: Network partition causing the coordinator to process a retry it had already committed

Status: Not yet investigated. Plausible given recent network switch firmware update.

H6: Database connection pool misconfiguration causing transactions to interleave

Status: Investigated. Connection pool is isolated per worker. No interleaving possible by design. Verdict: Ruled out.

Key Evidence

  • Duplicate rate is 0.03%, approximately 12-15 cases per day
  • All duplicates involve the debit operation, never credit
  • Duplicates always occur in pairs — same transaction ID processed exactly twice
  • Event timestamps on duplicate pairs differ by 50-800ms
  • Kafka consumer rebalances correlate temporally with duplicate windows (4/4 observed cases)
  • redlock library version in use (v3.1.2) has an open GitHub issue about lock acquisition race under high load

Attempted Fixes

  • Added extra idempotency check at DB layer before commit → deployed 2 days ago → no duplicates observed in last 48h but sample size too small to be conclusive
  • Upgraded redlock to v3.2.1 → not yet deployed, pending QA

Remaining Unknowns

  • Whether the DB-layer idempotency check fully resolves the issue or just masks H2
  • Whether H5 (network partition) is contributing
  • Whether the redlock upgrade addresses H4

evals

tile.json