CtrlK
BlogDocsLog inGet started
Tessl Logo

retracesoftware/flaky-pytest-investigator

investigate flaky, intermittent, non-reproducible, or ci-only python pytest failures by leading with retrace capture and deterministic replay. use when a user reports flaky pytest tests, random failures, tests that pass locally but fail in ci, async/threading/timing flakes, pytest-xdist issues, fixture leakage, monkeypatch leakage, test isolation failures, dependency/environment-sensitive failures, pytest timeouts, or ai-generated code that breaks tests intermittently. guide the agent to preserve the failed execution as a retrace trace, replay it, inspect runtime state, and use ordinary pytest/source/log checks only as supporting triage.

68

Quality

86%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a strong skill description that clearly defines a specific niche (flaky pytest failure investigation via retrace/replay), provides extensive natural trigger terms covering many variations of how users describe flaky tests, and explicitly states both what the skill does and when to use it. The description is dense but not padded—every phrase serves a purpose for skill selection.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: 'preserve the failed execution as a retrace trace, replay it, inspect runtime state, and use ordinary pytest/source/log checks only as supporting triage.' Also names the specific methodology (retrace capture and deterministic replay).

3 / 3

Completeness

Clearly answers both 'what' (investigate flaky pytest failures via retrace capture and deterministic replay) and 'when' (explicit 'use when' clause with extensive list of trigger scenarios). The 'use when' clause is thorough and well-structured.

3 / 3

Trigger Term Quality

Excellent coverage of natural terms users would say: 'flaky', 'intermittent', 'non-reproducible', 'ci-only', 'pass locally but fail in ci', 'async/threading/timing flakes', 'pytest-xdist', 'fixture leakage', 'monkeypatch leakage', 'test isolation failures', 'pytest timeouts', 'ai-generated code that breaks tests'. These are highly natural phrases developers actually use.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive due to the specific methodology (retrace capture and deterministic replay) combined with the narrow focus on flaky/intermittent Python pytest failures. Unlikely to conflict with general testing or debugging skills because of the unique retrace-first approach.

3 / 3

Total

12

/

12

Passed

Implementation

64%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured, actionable skill with concrete executable commands and a clear investigative workflow centered on Retrace capture and replay. Its main weaknesses are moderate verbosity (repeated category lists across steps, some redundant framing), and missing explicit validation checkpoints between capture and replay steps. The monolithic structure would benefit from splitting reference material (classification taxonomy, report template) into separate files.

Suggestions

Add an explicit validation checkpoint after Step 1 to verify the .retrace file was produced and contains a failing test (e.g., check file size, list PIDs), and a recovery path if capture failed silently.

Consolidate the overlapping category lists in Steps 4 and 6 into a single reference (ideally a separate file) to reduce redundancy and improve conciseness.

Trim the 'Core Position' section to 1-2 bullet points — the rationale for Retrace-first is already implicit in the workflow structure and doesn't need extended justification.

DimensionReasoningScore

Conciseness

The skill is reasonably well-structured but includes some redundancy and verbosity. The classification list in Step 4 is exhaustive but long, the inspection checklist in Step 6 largely repeats Step 4's categories, and Step 3's context-gathering list could be tighter. Some explanatory framing (e.g., 'Core Position' section) restates what the description already conveys.

2 / 3

Actionability

The skill provides fully executable bash commands for capture, replay, and pytest triage at every step. The GitHub Actions YAML snippet is copy-paste ready, and the investigation report template is concrete and immediately usable.

3 / 3

Workflow Clarity

The 7-step sequence is clearly ordered and logical, but validation checkpoints are mostly implicit. There's no explicit 'verify the .retrace file was actually produced and contains a failure' step after capture, and no feedback loop for what to do if replay fails or the recording is corrupted. For a workflow involving artifact capture and replay (which can fail silently), explicit validation gates are important.

2 / 3

Progressive Disclosure

The content is a single monolithic file with no bundle files or references to supplementary materials. The classification taxonomy, inspection checklist, and report template could be split into separate referenced files to keep the main skill leaner. For a skill of this length (~200 lines), inline content is borderline acceptable but would benefit from separation.

2 / 3

Total

9

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Reviewed

Table of Contents