fix-local-tests

Fix failing tests by prioritising shell implementation fixes to match bash behaviour

Quality

—

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The body is highly actionable with a clear, well-validated workflow and concrete executable commands throughout, but it is somewhat verbose in the security preamble and keeps all content inline without progressive disclosure to separate reference files.

Suggestions

Tighten the security preamble to the essential directive and trim redundancy to improve token efficiency.

Fix the duplicate "### 7." heading so the workflow step sequence is unambiguous (e.g. renumber the bash-comparison step to 8).

Move the fuzz-handling detail and/or the failure-classification guidance into a reference file referenced one level deep from SKILL.md to improve progressive disclosure.

Dimension	Reasoning	Score
Conciseness	The body is mostly lean commands, tables, and imperative instructions with no explanation of concepts Claude already knows, but the multi-paragraph security preamble is somewhat verbose and the duplicate "### 7." heading indicates slack, fitting the score-2 'mostly efficient but could be tightened' anchor rather than the score-3 'every token earns its place' anchor.	2 / 3
Actionability	It provides copy-paste-ready executable commands ("go test -race ./interp/... ./tests/...", "docker run --rm debian:bookworm-slim bash -c '...'", "bash -c '<script>'"), a concrete classification table mapping categories to actions, and a literal fuzz corpus file format example, matching the score-3 'fully executable code/commands; copy-paste ready' anchor.	3 / 3
Workflow Clarity	The 7-step workflow is clearly sequenced with explicit validation checkpoints (step 4 "Run the failing test to verify the fix", step 7 "run the full test suite") and an explicit feedback loop ("If new failures appear, repeat from step 1"), matching the score-3 anchor; validation is present so the destructive-operation cap does not apply, though the duplicate "### 7." numbering is a minor cosmetic blemish.	3 / 3
Progressive Disclosure	The skill is well-organized into labeled sections with no nested references, but at roughly 120 lines all content lives inline in SKILL.md with no bundle files or one-level-deep references to split out detail (e.g. the fuzz or classification material), fitting the score-2 'some structure but content that should be separate is inline' anchor rather than the score-3 'well-signaled one-level-deep references' anchor.	2 / 3
	Total	10 / 12 Passed

Description

50%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is reasonably specific and uses third-person voice, but it lacks an explicit 'Use when' trigger clause and offers only limited trigger-term variation, leaving every dimension at the mid-level anchor.

Suggestions

Add an explicit trigger clause, e.g. 'Use when tests in the interp/tests suite are failing or when rshell output diverges from bash'.

Broaden natural trigger terms to include common user phrasings like 'test failures', 'broken tests', or 'go test failures'.

Lead with the distinctive bash-comparison context so the skill is less likely to be confused with generic test-fixing skills.

Dimension	Reasoning	Score
Specificity	Quotes "Fix failing tests" and "shell implementation fixes to match bash behaviour" name a concrete action and domain, but it is a single action family rather than multiple listed concrete actions, matching the score-2 anchor rather than the score-3 'lists multiple specific concrete actions' anchor.	2 / 3
Completeness	It clearly states what the skill does ("Fix failing tests by prioritising shell implementation fixes to match bash behaviour") but contains no "Use when..." clause or equivalent explicit trigger guidance, so completeness is capped at 2 per the rubric guideline.	2 / 3
Trigger Term Quality	"failing tests", "fix", and "bash" are natural terms a user might say, but coverage is limited and misses common variations such as "test failures" or "broken tests", fitting the score-2 'some relevant keywords but missing common variations' anchor.	2 / 3
Distinctiveness Conflict Risk	The "shell implementation fixes to match bash behaviour" qualifier carves a niche, but the leading "Fix failing tests" phrase is generic and could overlap with other test-fixing skills, so it sits at the score-2 'somewhat specific but could still overlap' anchor rather than a clearly distinct score-3.	2 / 3
	Total	8 / 12 Passed

Validation

93%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 15 / 16 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	15 / 16 Passed

Repository: DataDog/rshell
Commit: a0a1140

Reviewed: about 16 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.