Detect flaky tests from CI history and propose LLM-validated fixes via quarantine pull requests. Use when Claude needs to find flaky tests, analyze CI test stability, identify tests that flip pass/fail without code changes, or set up automated quarantine workflows. Supports any test framework that emits JUnit XML (pytest, unittest, JUnit, TestNG, Vitest, Jest with junit reporter). Trigger when users mention "flaky tests", "intermittent failures", "tests that randomly fail", "quarantine flaky tests", "CI flakiness", or ask to "find unreliable tests", "analyze CI history", "mark tests as flaky".
68
81%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Advisory
Suggest reviewing before use
Identify tests that flip pass/fail in CI without code changes, then quarantine them automatically with @pytest.mark.flaky markers and optional LLM-suggested fixes.
Install from PyPI in any Python 3.10 or newer environment:
pip install flaky-detector-agentOr install from source for the latest changes:
git clone https://github.com/golikovichev/flaky-detector-agent
cd flaky-detector-agent
pip install -e .Authenticate the gh CLI once per machine so the agent can open pull requests on your behalf:
gh auth loginOptional: export OPENAI_API_KEY to enable LLM-generated fix suggestions:
export OPENAI_API_KEY=sk-...Verify the install:
flaky-detector --help
flaky-detector data/sample_history # bundled sample with 3 known flakiesWhen a user reports flaky tests, intermittent CI failures, or asks to analyze test stability, follow this workflow with explicit verification at each step:
Ask for a directory of JUnit XML files (one per CI run) or a single XML file. Most CI systems can publish these as artifacts. Typical artifact paths:
artifacts/junit/*.xmlartifacts/test-results.xmltarget/surefire-reports/*.xmlVerification: the path passed to the detector must exist and contain at least one valid JUnit XML file.
flaky-detector data/junit-history --min-flips 3 --window-days 14This prints the list of flagged tests with their flip patterns. Verification: confirm the flagged tests match the team's intuition before opening a PR. False positives waste reviewer time.
--dry-run-pr to inspect markers without git changesflaky-detector data/junit-history --open-pr --dry-run-prThis shows the markers the agent would apply and the PR body it would create, without touching git or calling gh. Verification: the PR body should list each flagged test with its pattern, the marker location, and (if OPENAI_API_KEY is set) the AST-validated fix snippet.
Once the dry-run output looks correct:
flaky-detector data/junit-history --open-prThis creates branch flaky-quarantine-<timestamp>, applies markers, commits all changes, opens a draft PR via gh. Verification: open the PR in the browser. Confirm the markers are on the right tests. Confirm the description is readable. Take it out of draft when ready to merge.
With OPENAI_API_KEY set before running --open-pr, the agent asks an LLM (Codex by default) for a candidate fix snippet for each flagged test. The agent runs the snippet through Python ast.parse() before attaching it. Verification: the snippet always parses (the AST pass guarantees this). Treat the snippet as a hint, not as final code; the reviewer should still understand and adapt it.
A flaky test passes on one run and fails on the next without source changes in between. Flaky tests slowly destroy trust in CI: developers start ignoring red builds, real regressions sneak through, and engineering time goes to manual triage rather than features.
Common root causes detected indirectly through flip patterns:
The detector does not classify root causes itself. It surfaces candidates for human triage by counting outcome flips inside a sliding window.
A test is flagged flaky when it shows 3 or more outcome flips inside any 14-day sliding window. Both thresholds are configurable via --min-flips and --window-days.
A flip is a transition between failure states (pass to fail, or fail to pass). The heuristic deliberately ignores:
Failure rate alone confuses flaky tests with consistently-broken ones. A test that fails 50% of the time with pattern F F F P P P is broken on one side and fixed on the other, not flaky. A test with pattern P F P F P has the same failure rate but is the classic flaky signature.
Two-week windows match typical sprint cadence and avoid both extremes:
| Argument | Meaning | Default |
|---|---|---|
input | Path to a JUnit XML file or directory of them | required |
--min-flips N | Minimum outcome flips inside the window to flag a test | 3 |
--window-days N | Sliding window size in days | 14 |
--open-pr | Apply @pytest.mark.flaky markers and open a quarantine PR via gh | off |
--dry-run-pr | With --open-pr: preview only, no git or gh calls | off |
--tests-root PATH | Root directory where pytest test files live | tests |
--repo-root PATH | Repo root passed to git and gh | current dir |
Console listing for every flagged test:
Scanned 45 test executions across 5 CI runs.
Detected 3 flaky test(s):
- tests.test_login::test_login_concurrent_session
4 outcome flips across 5 runs within a 14-day window
pattern: P F P F P
- tests.test_checkout::test_checkout_timeout
4 outcome flips across 5 runs within a 14-day window
pattern: F P F P F
- tests.test_search::test_search_index_warmup
3 outcome flips across 5 runs within a 14-day window
pattern: P P F P FQuarantine pull request body (when --open-pr set) with one entry per flaky test, including the flip pattern, the marker the agent applied, and an optional LLM fix snippet.
Branch and commit named flaky-quarantine-<timestamp> so the team can review the diff before merging.
When OPENAI_API_KEY is set, the agent asks an LLM (Codex by default) for a candidate fix snippet for each flagged test. The agent then runs the snippet through Python ast.parse() before attaching it to the PR body. If the snippet does not parse, it is dropped silently so only well-formed code reaches the PR.
This pattern keeps generated code honest. The reviewer always sees parseable Python, never half-broken strings.
A typical CI integration looks like this:
flaky-detector <dir> --open-pr against the collection.For deeper guidance see the bundled reference files:
references/flaky-patterns.md: common root causes that produce the flip patterns this skill detectsreferences/quarantine-workflow.md: end-to-end CI integration with explicit checkpointsOPENAI_API_KEY env var. Without it the skill still detects and quarantines, just without snippets.pytest with pytest-rerunfailures for the quarantine markersgh CLI authenticated against the target repo for PR creationOpenAI Python SDK if LLM fix suggestions are wantedhttps://github.com/golikovichev/flaky-detector-agent
MIT licensed. Issues and pull requests welcome.
2392045
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.