The body is a well-structured, highly actionable two-phase workflow with explicit gates and feedback loops. Its main weakness is progressive disclosure: it references several files and a script that are not bundled, and carries some emphasis/anti-pattern verbosity that could be trimmed.

Suggestions

Ship the referenced bundle files (references/reference.md and the probe-to-investigate/brainstorm/probe-llm.md handoffs) and scripts/run-query-benchmarks.sh, or inline the essential content and drop the references.

Tighten conciseness by collapsing repeated 'MANDATORY / DO NOT skip / DO NOT wait' emphasis into a single guard statement and trimming the per-anti-pattern 'Why:' rationales.

Consider moving the full 'Usage Examples' block and the AskUserQuestion bug workaround into a reference file so the main SKILL.md stays a lean overview.

Dimension	Reasoning	Score
Conciseness	The body is mostly lean and operational, but repeated emphasis ('MANDATORY', 'DO NOT skip', 'DO NOT wait'), the AskUserQuestion bug guard, and verbose 'Why:' lines on every anti-pattern add tokens that could be tightened.	2 / 3
Actionability	Concrete executable guidance throughout: explicit entry/exit gates, real command invocation ('./scripts/run-query-benchmarks.sh --env staging --compare main'), file paths, token budgets, and a result-classification table.	3 / 3
Workflow Clarity	A clearly sequenced two-phase workflow with explicit validation checkpoints (entry gate before Phase 2, confirm/refute criteria before execution) and feedback loops (loop back to 1.1–1.4, partial→re-probe self-transition).	3 / 3
Progressive Disclosure	A References section signals one-level-deep handoff/reference files, but no references/, scripts/, or assets/ bundle directories exist, so the referenced paths (reference.md, run-query-benchmarks.sh, the probe-to-* handoff files) point to files that are not present.	2 / 3
	Total	10 / 12 Passed

Description

90%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is strong: it answers both what and when with explicit trigger terms and a clear disambiguation against neighboring skills. The only soft spot is specificity, where the concrete actions read more as process phases than a comprehensive capability list.

Dimension	Reasoning	Score
Specificity	Names the Complex domain and concrete two-phase actions ('foreground qualify → background probe → sense result'), but these are process steps rather than a comprehensive list of discrete capabilities, matching the score-2 anchor.	2 / 3
Completeness	It explicitly answers both what ('Safe-to-fail experiment for Complex domain problems... Two-phase...') and when (an explicit 'Use when:' clause), matching the score-3 anchor.	3 / 3
Trigger Term Quality	The 'Use when:' clause lists natural terms a user would say ('probe, safe-to-fail, test hypothesis, experiment with hypothesis, Complex domain with hypothesis'), giving good coverage.	3 / 3
Distinctiveness Conflict Risk	A clear Cynefin-Complex niche with an explicit 'NOT for brainstorming (use brainstorm) or known cause-effect (use investigate)' disambiguation, making wrong-skill conflict unlikely.	3 / 3
	Total	11 / 12 Passed

Validation

75%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 12 / 16 Passed

Validation for skill structure

Criteria	Description	Result
allowed_tools_field	'allowed-tools' contains unusual tool name(s)	Warning
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning
relative_links	Relative link issues: 4 missing	Warning
referenced_paths_exist	Referenced path issues: 10 missing	Warning

	Total	12 / 16 Passed

Reviewed

about 1 month ago

Table of Contents

Discovery Implementation Validation