Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines
61
61%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Quality
Discovery
67%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description excels at specificity and distinctiveness, clearly articulating a unique and detailed set of capabilities around agent compliance testing and visualization. However, it lacks an explicit 'Use when...' clause, which caps completeness, and the trigger terms lean technical rather than matching natural user language patterns. Adding explicit trigger guidance and more natural keywords would significantly improve skill selection accuracy.
Suggestions
Add an explicit 'Use when...' clause, e.g., 'Use when the user wants to test whether their skills, rules, or agent prompts are being followed, or asks about compliance checking or behavioral evaluation.'
Include more natural trigger terms users might say, such as 'test my skill', 'check rule compliance', 'evaluate agent behavior', 'prompt adherence', or 'are my instructions being followed'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines. These are detailed, concrete capabilities. | 3 / 3 |
Completeness | The 'what' is well-covered (auto-generates scenarios, runs agents, classifies behavioral sequences, reports compliance rates), but there is no explicit 'Use when...' clause or equivalent trigger guidance. The when is only implied from the description of capabilities. | 2 / 3 |
Trigger Term Quality | Contains some relevant terms like 'skills', 'rules', 'agent definitions', 'compliance rates', and 'tool call timelines', but misses common natural language triggers users might say such as 'test my skill', 'check if rules are followed', 'evaluate agent behavior', or 'prompt testing'. The language is more technical/descriptive than what users would naturally type. | 2 / 3 |
Distinctiveness Conflict Risk | This is a highly specific niche — compliance visualization for skills/rules/agent definitions with scenario generation at multiple strictness levels. It is very unlikely to conflict with other skills due to its unique combination of behavioral compliance testing and reporting. | 3 / 3 |
Total | 10 / 12 Passed |
Implementation
37%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill provides a clear high-level overview of what skill-comply does and how to invoke it via CLI, but falls short on actionability and workflow clarity. The multi-step pipeline is described conceptually without validation checkpoints or error recovery guidance, and there are no concrete examples of specs, reports, or expected outputs that would help Claude understand what success looks like.
Suggestions
Add a concrete example of what a generated spec and report look like (even abbreviated), so Claude knows what to expect and can verify correctness
Add validation checkpoints and error handling guidance for the multi-step pipeline — e.g., what to do if spec generation produces incorrect steps, or if claude -p fails mid-run
Include guidance on interpreting and acting on compliance results — e.g., what compliance rate thresholds mean and what remediation steps to take
Link to external reference files for detailed report format, spec schema, and configuration options rather than leaving those details unaddressed
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Mostly efficient but includes some unnecessary explanation. The 'Key Concept: Prompt Independence' section is a one-liner that could be folded into the intro. The 'Supported Targets' section explains things Claude could infer from examples. The 'When to Activate' section is useful but slightly verbose. | 2 / 3 |
Actionability | Provides concrete CLI commands for running the tool, which is good. However, it lacks executable examples of what a spec looks like, what report output looks like, or how to interpret results. The numbered pipeline steps (1-6) describe rather than instruct — there's no guidance on what to do when compliance is low or how to act on the report. | 2 / 3 |
Workflow Clarity | The skill describes a multi-step pipeline (generate specs → generate scenarios → run agents → classify → check ordering → report) but provides no validation checkpoints, no error handling guidance, and no feedback loops. There's no guidance on what to do if spec generation is wrong, if claude -p fails, or if classification seems off. For a complex multi-step automated process, this is insufficient. | 1 / 3 |
Progressive Disclosure | The content is reasonably structured with clear sections, and the 'Advanced' subsection is a nice touch for optional content. However, there are no references to external files for detailed information (e.g., report format details, spec format, configuration options), and the report contents section could benefit from linking to an example report file. | 2 / 3 |
Total | 7 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
Reviewed
Table of Contents