Name: retracesoftware/flaky-pytest-investigator
Rating: 68.8 (1 reviews)
Author: retracesoftware

retracesoftware/flaky-pytest-investigator

investigate flaky, intermittent, non-reproducible, or ci-only python pytest failures by leading with retrace capture and deterministic replay. use when a user reports flaky pytest tests, random failures, tests that pass locally but fail in ci, async/threading/timing flakes, pytest-xdist issues, fixture leakage, monkeypatch leakage, test isolation failures, dependency/environment-sensitive failures, pytest timeouts, or ai-generated code that breaks tests intermittently. guide the agent to preserve the failed execution as a retrace trace, replay it, inspect runtime state, and use ordinary pytest/source/log checks only as supporting triage.

Quality

86%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Quality

Content

64%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured, highly actionable skill with concrete executable commands and a clear investigative workflow centered on Retrace capture and replay. Its main weaknesses are verbosity (redundant checklists, over-explanation of the philosophy), missing validation checkpoints between critical steps, and a monolithic structure that would benefit from splitting reference material into separate files.

Suggestions

Add explicit validation checkpoints: after Step 1, verify the .retrace file exists and contains a failing run (e.g., check file size, exit code); after Step 2 replay extraction, verify the extracted binary is runnable before proceeding.

Consolidate the overlapping category list (Step 4) and code areas checklist (Step 6) into a single reference table or separate file to reduce duplication and improve scannability.

Trim the 'Core Position' section to 1-2 bullet points—Claude doesn't need to be convinced of the philosophy, just told the preferred approach.

Extract the report template and the full flake category taxonomy into separate referenced files (e.g., REPORT_TEMPLATE.md, FLAKE_CATEGORIES.md) to improve progressive disclosure.

Dimension	Reasoning	Score
Conciseness	The skill is reasonably well-structured but includes some redundancy and verbosity. The 'Core Position' section explains concepts Claude could infer, the flake category list in Step 4 is exhaustive to the point of being a wall of text, and the 'Inspect Likely Code Areas' checklist in Step 6 largely duplicates the category list. Several sections could be tightened significantly.	2 / 3
Actionability	The skill provides fully executable bash commands for capture, replay, and pytest triage across multiple scenarios (auto-enable, explicit recording, CI artifact upload). The GitHub Actions YAML snippet is copy-paste ready, and the terminal replay commands are concrete and specific.	3 / 3
Workflow Clarity	The seven steps are clearly sequenced and the overall flow (capture → replay → classify → support checks → report) is logical. However, there are no explicit validation checkpoints between steps—e.g., no check that the .retrace file was actually produced and contains a failure before proceeding to replay, and no feedback loop if replay extraction fails. For a workflow involving artifact capture that can silently produce empty/passing recordings, this is a meaningful gap.	2 / 3
Progressive Disclosure	The content is a single monolithic file with no references to supporting documents. The extensive category lists, code area checklists, and report template could be split into separate reference files. For a skill of this length (~200+ lines), inline inclusion of all this detail hurts scannability, though the section headers provide reasonable navigation.	2 / 3
	Total	9 / 12 Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a strong skill description that excels across all dimensions. It clearly defines a specific methodology (retrace capture and deterministic replay), provides an extensive and natural list of trigger scenarios in an explicit 'use when' clause, and carves out a distinct niche that separates it from general debugging or testing skills. The description is detailed without being padded, and uses appropriate third-person voice throughout.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: 'preserve the failed execution as a retrace trace, replay it, inspect runtime state, and use ordinary pytest/source/log checks only as supporting triage.' Also names the core methodology (retrace capture and deterministic replay).	3 / 3
Completeness	Clearly answers both 'what' (investigate flaky pytest failures via retrace capture and deterministic replay, preserve traces, replay, inspect runtime state) and 'when' (explicit 'use when' clause with extensive list of trigger scenarios).	3 / 3
Trigger Term Quality	Excellent coverage of natural terms users would say: 'flaky', 'intermittent', 'non-reproducible', 'ci-only', 'pass locally but fail in ci', 'async/threading/timing flakes', 'pytest-xdist', 'fixture leakage', 'monkeypatch leakage', 'test isolation failures', 'pytest timeouts', 'ai-generated code that breaks tests'. These are highly natural phrases developers actually use.	3 / 3
Distinctiveness Conflict Risk	Highly distinctive due to the specific methodology (retrace capture and deterministic replay) combined with the narrow focus on flaky/intermittent Python pytest failures. Unlikely to conflict with general testing or debugging skills.	3 / 3
	Total	12 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Reviewed

20 days ago

Table of Contents

Discovery Implementation Validation