Detect flaky tests from CI history and propose LLM-validated fixes via quarantine pull requests. Use when Claude needs to find flaky tests, analyze CI test stability, identify tests that flip pass/fail without code changes, or set up automated quarantine workflows. Supports any test framework that emits JUnit XML (pytest, unittest, JUnit, TestNG, Vitest, Jest with junit reporter). Trigger when users mention "flaky tests", "intermittent failures", "tests that randomly fail", "quarantine flaky tests", "CI flakiness", or ask to "find unreliable tests", "analyze CI history", "mark tests as flaky".
68
81%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Advisory
Suggest reviewing before use
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an excellent skill description that clearly defines a specific niche (flaky test detection and quarantine), provides comprehensive trigger terms covering natural user language, and explicitly addresses both what the skill does and when to use it. The inclusion of supported frameworks and input formats adds valuable specificity without being overly verbose.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: detect flaky tests from CI history, propose LLM-validated fixes, create quarantine pull requests. Also specifies supported frameworks (pytest, unittest, JUnit, TestNG, Vitest, Jest) and the input format (JUnit XML). | 3 / 3 |
Completeness | Clearly answers both 'what' (detect flaky tests, analyze CI test stability, propose fixes via quarantine PRs) and 'when' (explicit 'Use when' clause plus a 'Trigger when' clause with specific user phrases). Both dimensions are thoroughly covered. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural trigger terms users would say: 'flaky tests', 'intermittent failures', 'tests that randomly fail', 'quarantine flaky tests', 'CI flakiness', 'find unreliable tests', 'analyze CI history', 'mark tests as flaky'. These are highly natural phrases a user would actually use. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive niche focused specifically on flaky test detection and quarantine workflows from CI history. The combination of CI history analysis, flaky test detection, JUnit XML parsing, and quarantine PRs is unlikely to conflict with other skills like general testing or CI/CD skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
62%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill excels at actionability and workflow clarity with concrete commands, real output examples, and explicit verification checkpoints at each step. However, it is significantly undermined by verbosity—large sections explain concepts Claude already knows (what flaky tests are, why flip counts beat failure rates, why 14-day windows) and could be removed or moved to reference files. The referenced bundle files don't exist, weakening progressive disclosure.
Suggestions
Remove or drastically shorten the 'What flaky tests are and why detection matters' and 'Detection methodology' sections—Claude already understands these concepts and they consume significant tokens without adding actionable guidance.
Move the detailed methodology rationale (flip count vs failure rate, window size justification) and root cause taxonomy into the referenced `references/flaky-patterns.md` file rather than inlining them.
Trim the 'Limitations', 'Dependencies', and 'When NOT to use' sections to a compact bullet list—most of these are self-evident from the rest of the content.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is excessively verbose. The 'What flaky tests are and why detection matters' section explains concepts Claude already knows. The 'Detection methodology' section with its rationale for flip counts vs failure rates and why 14 days is chosen is unnecessary context padding. The limitations, dependencies, and 'when NOT to use' sections add bulk without actionable value. This could easily be cut by 50%+. | 1 / 3 |
Actionability | The skill provides fully executable commands at every step, with concrete CLI flags, real example output, and specific paths. The installation, detection, dry-run, and PR creation steps are all copy-paste ready with clear flag explanations. | 3 / 3 |
Workflow Clarity | The 5-step workflow has explicit verification checkpoints after each step (marked with bold 'Verification:' labels), includes a dry-run step before the destructive PR creation, and provides a clear feedback loop (preview → dry-run → real PR). The sequence is logical and well-guarded. | 3 / 3 |
Progressive Disclosure | References to `references/flaky-patterns.md` and `references/quarantine-workflow.md` are mentioned but no bundle files exist to support them. The main file contains extensive inline content (methodology rationale, root cause explanations, example workflows) that should have been split into reference files, making the SKILL.md itself bloated. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
2392045
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.