Monitor running experiments, check progress, collect results. Use when user says "check results", "is it done", "monitor", or wants experiment output.
77
73%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/monitor-experiment/SKILL.mdQuality
Discovery
82%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description is functional with a clear 'Use when' clause and natural trigger terms, making it easy for Claude to know when to select it. However, the capabilities listed are somewhat generic ('monitor', 'check progress', 'collect results') and could benefit from more concrete specifics about what kind of experiments and what actions are actually performed. The distinctiveness could be improved by specifying the experiment framework or domain.
Suggestions
Add specifics about what kind of experiments (e.g., ML training runs, A/B tests, scientific simulations) and concrete actions (e.g., 'parse training logs', 'extract metrics from output files').
Improve distinctiveness by specifying the experiment framework or environment to reduce overlap with other monitoring skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (experiments) and lists some actions (monitor, check progress, collect results), but the actions are somewhat generic and don't describe concrete mechanisms or specific operations like 'parse log files' or 'query experiment database'. | 2 / 3 |
Completeness | Clearly answers both 'what' (monitor running experiments, check progress, collect results) and 'when' (explicit 'Use when' clause with specific trigger phrases like 'check results', 'is it done', 'monitor'). | 3 / 3 |
Trigger Term Quality | Includes natural trigger terms users would actually say: 'check results', 'is it done', 'monitor', and 'experiment output'. These are realistic phrases covering common variations of how users would ask about experiment status. | 3 / 3 |
Distinctiveness Conflict Risk | The term 'experiments' provides some specificity, but 'monitor', 'check results', and 'is it done' could overlap with other monitoring-related skills (e.g., CI/CD monitoring, job queue monitoring). The description doesn't clarify what type of experiments. | 2 / 3 |
Total | 10 / 12 Passed |
Implementation
64%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a solid, actionable monitoring skill with concrete commands for multiple infrastructure backends (SSH, Vast.ai, Modal, W&B). Its main weaknesses are the lack of validation/error-recovery checkpoints in the workflow and some verbosity in the W&B section that could be trimmed or split into a separate file. The conditional sections (W&B, Feishu, Modal) make the skill comprehensive but add length that could benefit from progressive disclosure.
Suggestions
Add explicit error handling/validation checkpoints — e.g., what to do when SSH connection fails, when screen sessions don't exist, or when JSON results are malformed/empty.
Extract the W&B monitoring section (Step 3.5) into a separate WANDB_MONITORING.md file and reference it with a one-line link, keeping the main skill leaner.
Remove the explanatory note 'This gives the auto-review-loop richer signal than just screen output — training dynamics, loss curves, and metric trends over time' as it explains rationale Claude doesn't need.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is mostly efficient but includes some unnecessary commentary (e.g., 'This gives the auto-review-loop richer signal than just screen output' is explanatory padding). The W&B section is quite lengthy and could be tightened. However, most content is actionable commands rather than explanation. | 2 / 3 |
Actionability | The skill provides fully executable bash commands and Python snippets for every step — SSH commands, screen capture, JSON parsing, W&B API calls, and vastai CLI commands. Commands are copy-paste ready with clear placeholder conventions. | 3 / 3 |
Workflow Clarity | Steps are clearly sequenced and numbered, covering multiple infrastructure types (SSH, Vast.ai, Modal). However, there are no explicit validation checkpoints or error recovery feedback loops — e.g., no guidance on what to do if SSH fails, if screen sessions are dead, or if JSON results are malformed. The 'If hardcopy fails' note is minimal. | 2 / 3 |
Progressive Disclosure | The content is a single monolithic file with no references to external documentation. The W&B section (Step 3.5) is quite long and could be split into a separate reference file. The conditional sections (W&B, Feishu, Modal) add bulk that not all users need inline. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
81%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 9 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
allowed_tools_field | 'allowed-tools' contains unusual tool name(s) | Warning |
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 9 / 11 Passed | |
700fbe2
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.