Use this skill to monitor a deployed URL for regressions after deploys, merges, or dependency upgrades.
44
44%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Quality
Discovery
40%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description provides reasonable 'when' context (after deploys, merges, dependency upgrades) but is weak on specificity — it doesn't explain what kind of monitoring or regression checks are performed. The second-person 'Use this skill to' phrasing is acceptable per the rubric examples, but the lack of concrete actions and missing common trigger term variations limit its effectiveness for skill selection.
Suggestions
Add specific concrete actions the skill performs, e.g., 'Runs HTTP health checks, captures screenshots for visual diff comparison, and validates response codes against a deployed URL'.
Expand trigger terms to include natural variations like 'smoke test', 'post-deploy check', 'regression testing', 'site health check', or 'production verification'.
Clarify the mechanism of monitoring to distinguish this from general testing or CI/CD skills — e.g., is it visual regression, functional testing, uptime monitoring, or performance benchmarking?
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description says 'monitor a deployed URL for regressions' but doesn't specify what concrete actions are performed — e.g., does it run visual diff checks, HTTP status checks, performance benchmarks, or something else? 'Monitor for regressions' is vague. | 1 / 3 |
Completeness | It addresses 'when' fairly well ('after deploys, merges, or dependency upgrades') but the 'what' is weak — 'monitor a deployed URL for regressions' doesn't explain what specific actions or checks are performed. The 'Use this skill to...' phrasing serves as a trigger clause but the what-it-does portion lacks detail. | 2 / 3 |
Trigger Term Quality | Includes some relevant terms like 'deployed URL', 'regressions', 'deploys', 'merges', 'dependency upgrades' that users might naturally mention. However, it misses common variations like 'smoke test', 'post-deploy check', 'site monitoring', 'health check', or 'regression testing'. | 2 / 3 |
Distinctiveness Conflict Risk | The combination of 'deployed URL' and 'regressions after deploys' provides some distinctiveness, but 'monitor for regressions' could overlap with testing skills, CI/CD skills, or general monitoring skills. Without specifying the mechanism (visual regression, functional checks, etc.), it's not clearly distinct. | 2 / 3 |
Total | 7 / 12 Passed |
Implementation
22%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill reads more like a feature specification or product requirements document than an actionable skill for Claude. It describes what a canary watch system should do (check HTTP status, measure performance, alert on thresholds) but provides zero implementation — no actual code to make HTTP requests, no browser automation to measure Web Vitals, no scripts to parse console errors. The '/canary-watch' commands reference a non-existent tool with no explanation of what backs them.
Suggestions
Replace the declarative descriptions with actual executable code showing how to perform each check (e.g., using fetch/curl for HTTP status, Puppeteer/Playwright for Web Vitals, etc.)
Add a concrete step-by-step workflow: 1) capture baseline, 2) run checks, 3) compare against thresholds, 4) if critical alert → take specific action, 5) if sustained watch → loop with sleep
Clarify what tools/MCP servers Claude should use to implement this — the '/canary-watch' command syntax implies a tool that doesn't exist; either define it or show how to build the monitoring from available primitives
Remove the 'When to Use' section entirely — Claude can infer when post-deploy monitoring is appropriate — and use that space for implementation details
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is reasonably structured but includes some unnecessary sections like 'When to Use' which lists obvious scenarios Claude can infer. The 'What It Watches' section is somewhat explanatory rather than instructional. However, it's not egregiously verbose. | 2 / 3 |
Actionability | The skill describes what a canary watch system would do but provides no executable code, no actual implementation, and no concrete commands that work. The '/canary-watch' commands appear to reference a tool that doesn't exist — there's no code showing how to actually perform HTTP checks, measure LCP/CLS, or detect console errors. It reads like a product spec, not an actionable skill. | 1 / 3 |
Workflow Clarity | There is no clear step-by-step workflow for how to actually perform monitoring. The skill describes watch modes and thresholds declaratively but never sequences the actual operations Claude should perform. There are no validation checkpoints or error recovery steps for when checks fail. | 1 / 3 |
Progressive Disclosure | The content is organized into logical sections with clear headers, and the Integration section references related skills. However, the alert thresholds YAML block and output example are inline when they could be referenced separately, and there are no links to detailed implementation docs. | 2 / 3 |
Total | 6 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
Reviewed
Table of Contents