monitor-experiment

Monitor running experiments, check progress, collect results. Use when user says "check results", "is it done", "monitor", or wants experiment output.

Quality

84%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Quality

Discovery

92%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-constructed skill description that clearly states concrete capabilities and provides explicit trigger terms. The 'Use when...' clause with quoted user phrases is effective for skill selection. The main weakness is that 'monitor' and 'check results' could potentially overlap with other monitoring-related skills, and the description could benefit from slightly more specificity about what kind of experiments are being monitored.

Dimension	Reasoning	Score
Specificity	Lists multiple concrete actions: 'Monitor running experiments', 'check progress', 'collect results'. These are specific, actionable capabilities.	3 / 3
Completeness	Clearly answers both what ('Monitor running experiments, check progress, collect results') and when ('Use when user says "check results", "is it done", "monitor", or wants experiment output') with explicit triggers.	3 / 3
Trigger Term Quality	Includes natural trigger terms users would actually say: 'check results', 'is it done', 'monitor', and 'experiment output'. These cover common phrasings well.	3 / 3
Distinctiveness Conflict Risk	The term 'monitor' is somewhat generic and could overlap with system monitoring or other monitoring skills. However, the experiment-specific context ('running experiments', 'experiment output') helps narrow it. Could be more distinctive by specifying the type of experiments or platform.	2 / 3
	Total	11 / 12 Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a solid, actionable skill with clear workflow sequencing and executable commands covering multiple deployment targets (SSH, Vast.ai, Modal). Its main weakness is verbosity in the W&B section and some unnecessary explanatory text that doesn't add value for Claude. The skill would benefit from tightening commentary and potentially splitting the W&B monitoring into a referenced sub-file.

Suggestions

Remove explanatory commentary like 'This gives the auto-review-loop richer signal...' and the 'What to extract' bullet descriptions — Claude can infer what metrics matter from the code.

Consider extracting the W&B monitoring section into a separate WANDB_MONITORING.md file referenced from the main skill, reducing the main file's token footprint.

Dimension	Reasoning	Score
Conciseness	The skill is mostly efficient but includes some unnecessary commentary (e.g., 'This gives the auto-review-loop richer signal than just screen output' is explanatory rather than instructional). The W&B section is quite lengthy with inline Python scripts that could be more condensed. The 'What to extract' bullet list explains concepts Claude already understands.	2 / 3
Actionability	The skill provides fully executable bash commands and Python code snippets for every step. Commands are copy-paste ready with clear placeholders (<server>, <PORT>, <HOST>, etc.), and concrete examples are given for SSH, screen capture, JSON parsing, W&B API calls, and vastai CLI usage.	3 / 3
Workflow Clarity	The workflow is clearly sequenced with numbered steps from checking running processes through collecting output, parsing results, summarizing, interpreting, and notifying. It includes validation guidance (Step 5 flags unexpected results, checking logs for errors) and conditional logic (skip W&B if not configured, skip Feishu if absent). The 'if hardcopy fails' fallback and 'if results look wrong, check training logs' provide error recovery paths.	3 / 3
Progressive Disclosure	The content is reasonably well-structured with clear section headers, but it's a fairly long monolithic document with no references to external files. The W&B section in particular is quite detailed and could be split into a separate reference file. However, for a skill of this complexity, the inline approach is borderline acceptable.	2 / 3
	Total	10 / 12 Passed

Validation

81%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 9 / 11 Passed

Validation for skill structure

Criteria	Description	Result
allowed_tools_field	'allowed-tools' contains unusual tool name(s)	Warning
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	9 / 11 Passed

Repository: wanshuiyin/Auto-claude-code-research-in-sleep
Commit: a425a71

Reviewed: 4 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.