Monitor running experiments, check progress, collect results. Use when user says "check results", "is it done", "monitor", or wants experiment output.
68
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Advisory
Suggest reviewing before use
Quality
Discovery
92%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a well-constructed skill description that clearly states concrete capabilities and provides explicit trigger terms. The 'Use when...' clause with quoted user phrases is effective for skill selection. The main weakness is that 'monitor' and 'check results' could potentially overlap with other monitoring-related skills, and the description could benefit from slightly more specificity about what kind of experiments are being monitored.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple concrete actions: 'Monitor running experiments', 'check progress', 'collect results'. These are specific, actionable capabilities. | 3 / 3 |
Completeness | Clearly answers both what ('Monitor running experiments, check progress, collect results') and when ('Use when user says "check results", "is it done", "monitor", or wants experiment output') with explicit triggers. | 3 / 3 |
Trigger Term Quality | Includes natural trigger terms users would actually say: 'check results', 'is it done', 'monitor', and 'experiment output'. These cover common phrasings well. | 3 / 3 |
Distinctiveness Conflict Risk | The term 'monitor' is somewhat generic and could overlap with system monitoring or other monitoring skills. However, the experiment-specific context ('running experiments', 'experiment output') helps narrow it. Could be more distinctive by specifying the type of experiments or platform. | 2 / 3 |
Total | 11 / 12 Passed |
Implementation
77%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a solid, actionable skill with clear workflow sequencing and executable commands covering multiple deployment targets (SSH, Vast.ai, Modal). Its main weakness is verbosity in the W&B section and some unnecessary explanatory text that doesn't add value for Claude. The skill would benefit from tightening commentary and potentially splitting the W&B monitoring into a referenced sub-file.
Suggestions
Remove explanatory commentary like 'This gives the auto-review-loop richer signal...' and the 'What to extract' bullet descriptions — Claude can infer what metrics matter from the code.
Consider extracting the W&B monitoring section into a separate WANDB_MONITORING.md file referenced from the main skill, reducing the main file's token footprint.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is mostly efficient but includes some unnecessary commentary (e.g., 'This gives the auto-review-loop richer signal than just screen output' is explanatory rather than instructional). The W&B section is quite lengthy with inline Python scripts that could be more condensed. The 'What to extract' bullet list explains concepts Claude already understands. | 2 / 3 |
Actionability | The skill provides fully executable bash commands and Python code snippets for every step. Commands are copy-paste ready with clear placeholders (<server>, <PORT>, <HOST>, etc.), and concrete examples are given for SSH, screen capture, JSON parsing, W&B API calls, and vastai CLI usage. | 3 / 3 |
Workflow Clarity | The workflow is clearly sequenced with numbered steps from checking running processes through collecting output, parsing results, summarizing, interpreting, and notifying. It includes validation guidance (Step 5 flags unexpected results, checking logs for errors) and conditional logic (skip W&B if not configured, skip Feishu if absent). The 'if hardcopy fails' fallback and 'if results look wrong, check training logs' provide error recovery paths. | 3 / 3 |
Progressive Disclosure | The content is reasonably well-structured with clear section headers, but it's a fairly long monolithic document with no references to external files. The W&B section in particular is quite detailed and could be split into a separate reference file. However, for a skill of this complexity, the inline approach is borderline acceptable. | 2 / 3 |
Total | 10 / 12 Passed |
Validation
81%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 9 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
allowed_tools_field | 'allowed-tools' contains unusual tool name(s) | Warning |
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 9 / 11 Passed | |
a425a71
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.