coding-agent-helpers/compact-debug-ledger

Use when a debugging thread needs to be compressed into a reusable investigation ledger. Capture the target, evidence, attempted fixes, ruled-out hypotheses, viable hypotheses, and next experiments. Good triggers include "compact this debugging session", "summarize what we've tried", and "turn this into a debugging ledger".

3.66x

Quality

100%

Does it follow best practices?

Impact

99%

3.66x

Average score across 8 eval scenarios

Securityby

Passed

No known issues

Active Bug Hunt: Flaky CI Pipeline

Name: coding-agent-helpers/compact-debug-ledger
Rating: 99.3 (1 reviews)
Author: coding-agent-helpers

Problem Description

Your team's CI pipeline has been producing flaky test failures for two weeks — the same tests sometimes pass, sometimes fail on identical code, with no reproducible pattern locally. Multiple engineers have investigated but progress has stalled because of scattered notes. You need to consolidate the current investigation state into a single compact reference document that focuses on what to try next.

Produce a compact investigation record from the notes below and save it as ci_debug.md.

Input Files

The following file is provided as input. Extract it before beginning.

=============== FILE: inputs/ci_notes.md ===============

CI Flakiness Investigation Notes

The Problem

Integration tests in the payments module fail approximately 15-20% of runs in CI. Always passes locally. Failures are non-deterministic — same commit, same code, pass/fail alternates with no apparent pattern.

What We Know (Evidence)

Failures first appeared after upgrading GitHub Actions runner from ubuntu-20.04 to ubuntu-22.04
Affected tests all involve the PaymentProcessor class and external HTTP calls
Test logs show "connection refused" on port 8080 — a mock server that should be started by test setup
The mock server sometimes starts before the test and sometimes starts after (race condition in test setup)
Test setup uses setTimeout(startServer, 100) — this was added as a workaround for a different issue last year
Server startup time on ubuntu-22.04 is faster than ubuntu-20.04 due to system optimization

Attempted Fixes and Results

Pinned runner back to ubuntu-20.04 → tests pass consistently → confirms runner version is the trigger, but not a permanent fix since ubuntu-20.04 is deprecated
Increased setTimeout delay from 100ms to 500ms → reduced failure rate from 20% to 5% but not eliminated → didn't fully solve it
Added retry logic in test client for connection refused → tests pass but mask the underlying race → not a real fix
Replaced setTimeout with a health-check poll (poll port 8080 every 50ms until ready, timeout 5s) → deployed to feature branch CI → zero failures in 20 runs → looks very promising, not yet merged
Attempted to use jest's globalSetup instead of per-test setup → failed — server lifecycle doesn't integrate well with jest's module isolation

Possible Next Actions (team brainstorm, not prioritized)

Merge the health-check poll fix to main
Add a regression test that deliberately starts server late to prevent future flakiness
Investigate whether other test modules have the same setTimeout anti-pattern
Check if the ubuntu-22.04 runner change affected any other test suites
Write a post-mortem documenting the root cause
Update the test infrastructure docs
Review all CI configs for similar patterns
Set up flakiness alerting dashboard
Consider migrating to testcontainers for more reliable service lifecycle management
Add a lint rule to catch setTimeout in test setup files

evals

scenario-1

scenario-2

scenario-3

scenario-4

criteria.json

task.md

scenario-5

scenario-6

scenario-7

scenario-8

skills

tile.json