Monitor running experiments, check progress, collect results. Use when user says "check results", "is it done", "monitor", or wants experiment output.
81
78%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/monitor-experiment/SKILL.mdQuality
Discovery
92%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a well-structured description that clearly states concrete capabilities and provides explicit trigger terms in a 'Use when' clause. The trigger terms are natural and cover common user phrasings. The main weakness is that 'experiments' is somewhat broad and could overlap with other monitoring or results-related skills; specifying the type of experiments (e.g., ML training runs, A/B tests) would improve distinctiveness.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple concrete actions: 'Monitor running experiments', 'check progress', 'collect results'. These are specific, actionable capabilities. | 3 / 3 |
Completeness | Clearly answers both what ('Monitor running experiments, check progress, collect results') and when ('Use when user says "check results", "is it done", "monitor", or wants experiment output') with explicit triggers. | 3 / 3 |
Trigger Term Quality | Includes natural trigger terms users would actually say: 'check results', 'is it done', 'monitor', and 'experiment output'. These cover common phrasings well. | 3 / 3 |
Distinctiveness Conflict Risk | The term 'monitor' and 'check results' could overlap with other monitoring or results-checking skills (e.g., CI/CD monitoring, test results). The 'experiment' domain helps narrow it but could still conflict with other experiment-related skills. | 2 / 3 |
Total | 11 / 12 Passed |
Implementation
64%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a solid, actionable monitoring skill with concrete commands for multiple platforms. Its main weaknesses are the lack of error-handling/validation checkpoints in the workflow and the lengthy inline W&B section that could benefit from being extracted to a separate reference. The multi-platform coverage is comprehensive but contributes to length that could be better organized.
Suggestions
Add explicit error-handling guidance for common failure modes (SSH connection failures, dead screen sessions, missing result files) to improve workflow robustness.
Extract the W&B metrics section (Step 3.5) into a separate WANDB_MONITORING.md file and reference it with a one-line link, keeping the main skill leaner.
Remove the explanatory blockquote ('This gives the auto-review-loop richer signal...') — it explains rationale Claude doesn't need.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is mostly efficient but includes some unnecessary commentary (e.g., 'This gives the auto-review-loop richer signal than just screen output' is explanatory rather than instructional). The W&B section is quite lengthy and could be tightened. However, most content is actionable commands rather than explanation. | 2 / 3 |
Actionability | Provides concrete, executable bash commands and Python snippets for every step. SSH commands, screen capture, JSON parsing, W&B API calls, and vastai CLI commands are all copy-paste ready with clear parameterization. | 3 / 3 |
Workflow Clarity | Steps are clearly sequenced and numbered, covering discovery through interpretation. However, there are no explicit validation checkpoints or feedback loops — e.g., no guidance on what to do if SSH fails, if screen sessions are dead, or if JSON results are malformed. For a monitoring workflow that could encounter many failure modes, this is a gap. | 2 / 3 |
Progressive Disclosure | The content is well-structured with clear headers and steps, but the W&B section (Step 3.5) is quite long and could be split into a separate reference file. The skill is over 100 lines and handles multiple platforms (SSH, Vast.ai, Modal, W&B, Feishu) all inline, making it a somewhat dense monolithic document. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
81%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 9 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
allowed_tools_field | 'allowed-tools' contains unusual tool name(s) | Warning |
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 9 / 11 Passed | |
dc00dfb
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.