investigate flaky, intermittent, non-reproducible, or ci-only python pytest failures by leading with retrace capture and deterministic replay. use when a user reports flaky pytest tests, random failures, tests that pass locally but fail in ci, async/threading/timing flakes, pytest-xdist issues, fixture leakage, monkeypatch leakage, test isolation failures, dependency/environment-sensitive failures, pytest timeouts, or ai-generated code that breaks tests intermittently. guide the agent to preserve the failed execution as a retrace trace, replay it, inspect runtime state, and use ordinary pytest/source/log checks only as supporting triage.
68
86%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong skill description that clearly defines a specific niche (flaky pytest failure investigation via retrace/replay), provides extensive natural trigger terms covering many variations of how users describe flaky tests, and explicitly states both what the skill does and when to use it. The description is dense but not padded—every phrase serves a purpose for skill selection.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: 'preserve the failed execution as a retrace trace, replay it, inspect runtime state, and use ordinary pytest/source/log checks only as supporting triage.' Also names the specific methodology (retrace capture and deterministic replay). | 3 / 3 |
Completeness | Clearly answers both 'what' (investigate flaky pytest failures via retrace capture and deterministic replay) and 'when' (explicit 'use when' clause with extensive list of trigger scenarios). The 'use when' clause is thorough and well-structured. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural terms users would say: 'flaky', 'intermittent', 'non-reproducible', 'ci-only', 'pass locally but fail in ci', 'async/threading/timing flakes', 'pytest-xdist', 'fixture leakage', 'monkeypatch leakage', 'test isolation failures', 'pytest timeouts', 'ai-generated code that breaks tests'. These are highly natural phrases developers actually use. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive due to the specific methodology (retrace capture and deterministic replay) combined with the narrow focus on flaky/intermittent Python pytest failures. Unlikely to conflict with general testing or debugging skills because of the unique retrace-first approach. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
64%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a well-structured, actionable skill with concrete executable commands and a clear investigative workflow centered on Retrace capture and replay. Its main weaknesses are moderate verbosity (repeated category lists across steps, some redundant framing), and missing explicit validation checkpoints between capture and replay steps. The monolithic structure would benefit from splitting reference material (classification taxonomy, report template) into separate files.
Suggestions
Add an explicit validation checkpoint after Step 1 to verify the .retrace file was produced and contains a failing test (e.g., check file size, list PIDs), and a recovery path if capture failed silently.
Consolidate the overlapping category lists in Steps 4 and 6 into a single reference (ideally a separate file) to reduce redundancy and improve conciseness.
Trim the 'Core Position' section to 1-2 bullet points — the rationale for Retrace-first is already implicit in the workflow structure and doesn't need extended justification.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is reasonably well-structured but includes some redundancy and verbosity. The classification list in Step 4 is exhaustive but long, the inspection checklist in Step 6 largely repeats Step 4's categories, and Step 3's context-gathering list could be tighter. Some explanatory framing (e.g., 'Core Position' section) restates what the description already conveys. | 2 / 3 |
Actionability | The skill provides fully executable bash commands for capture, replay, and pytest triage at every step. The GitHub Actions YAML snippet is copy-paste ready, and the investigation report template is concrete and immediately usable. | 3 / 3 |
Workflow Clarity | The 7-step sequence is clearly ordered and logical, but validation checkpoints are mostly implicit. There's no explicit 'verify the .retrace file was actually produced and contains a failure' step after capture, and no feedback loop for what to do if replay fails or the recording is corrupted. For a workflow involving artifact capture and replay (which can fail silently), explicit validation gates are important. | 2 / 3 |
Progressive Disclosure | The content is a single monolithic file with no bundle files or references to supplementary materials. The classification taxonomy, inspection checklist, and report template could be split into separate referenced files to keep the main skill leaner. For a skill of this length (~200 lines), inline content is borderline acceptable but would benefit from separation. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
Reviewed
Table of Contents