CtrlK
BlogDocsLog inGet started
Tessl Logo

flaky-detector

Detect flaky tests from CI history and propose LLM-validated fixes via quarantine pull requests. Use to find flaky tests, analyze CI test stability, identify tests that flip pass/fail without code changes, or set up automated quarantine workflows. Supports any test framework that emits JUnit XML (pytest, unittest, JUnit, TestNG, Vitest, Jest with junit reporter). Trigger when users mention "flaky tests", "intermittent failures", "tests that randomly fail", "quarantine flaky tests", "CI flakiness", or ask to "find unreliable tests", "analyze CI history", "mark tests as flaky".

72

Quality

88%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

SKILL.md
Quality
Evals
Security

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured, actionable skill with clear step-by-step workflows and explicit validation checkpoints (dry-run before real PR). Its main weakness is moderate verbosity—some sections explain concepts Claude already knows (CI artifact locations, why flip count beats failure rate) and could be trimmed. The progressive disclosure references look appropriate but cannot be verified without bundle files.

Suggestions

Trim the 'How detection works' section to just the algorithm parameters and remove the failure-rate-vs-flip-count justification, which is background knowledge rather than actionable guidance.

Remove the CI-system-specific artifact path examples (GitHub Actions, GitLab, Jenkins) since Claude already knows these; just say 'point at JUnit XML artifacts from CI'.

DimensionReasoningScore

Conciseness

Generally efficient but includes some unnecessary explanation (e.g., explaining what a flip is vs failure rate, listing CI artifact paths for GitHub/GitLab/Jenkins that Claude would know). The 'How detection works' section over-explains the rationale behind flip counting vs failure rate.

2 / 3

Actionability

Provides fully executable commands at every step, concrete CLI flags with defaults in a clear table, specific example output, and a clear installation sequence. Commands are copy-paste ready with real flags and paths.

3 / 3

Workflow Clarity

The Quick Start section provides a clear 5-step sequence with explicit validation checkpoints: preview detection before acting, dry-run the PR before opening it, confirm markers before taking out of draft. This is a well-structured workflow with feedback loops for a potentially destructive operation (opening PRs).

3 / 3

Progressive Disclosure

References `references/flaky-patterns.md` and `references/quarantine-workflow.md` which are good signals of progressive disclosure, but no bundle files are provided to verify these exist. The main file includes some content (like the detection algorithm explanation and limits section) that could be split out, and the inline content is moderately long.

2 / 3

Total

10

/

12

Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that clearly articulates specific capabilities (flaky test detection, LLM-validated fixes, quarantine PRs), supported technologies (JUnit XML from multiple frameworks), and explicit trigger conditions with natural user language. It uses proper third-person voice throughout and provides comprehensive coverage of both what the skill does and when it should be selected.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: detect flaky tests from CI history, propose LLM-validated fixes, create quarantine pull requests. Also specifies supported frameworks (pytest, unittest, JUnit, TestNG, Vitest, Jest with junit reporter) and the input format (JUnit XML).

3 / 3

Completeness

Clearly answers both 'what' (detect flaky tests, propose fixes, create quarantine PRs, analyze CI test stability) and 'when' (explicit 'Use to...' clause and 'Trigger when...' clause with specific user phrases). Both dimensions are thoroughly covered.

3 / 3

Trigger Term Quality

Excellent coverage of natural trigger terms users would say: 'flaky tests', 'intermittent failures', 'tests that randomly fail', 'quarantine flaky tests', 'CI flakiness', 'find unreliable tests', 'analyze CI history', 'mark tests as flaky'. These are highly natural phrases a user would actually use.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive niche focused specifically on flaky test detection and quarantine workflows from CI history. The combination of CI history analysis, flaky test detection, JUnit XML parsing, and quarantine PR creation is unlikely to conflict with other skills like general testing or CI/CD skills.

3 / 3

Total

12

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
golikovichev/flaky-detector-agent
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.