CtrlK
BlogDocsLog inGet started
Tessl Logo

monitor-experiment

Monitor running experiments, check progress, collect results. Use when user says "check results", "is it done", "monitor", or wants experiment output.

81

Quality

78%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/monitor-experiment/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

92%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-structured description that clearly states concrete capabilities and provides explicit trigger terms in a 'Use when' clause. The trigger terms are natural and cover common user phrasings. The main weakness is that 'experiments' is somewhat broad and could overlap with other monitoring or results-related skills; specifying the type of experiments (e.g., ML training runs, A/B tests) would improve distinctiveness.

DimensionReasoningScore

Specificity

Lists multiple concrete actions: 'Monitor running experiments', 'check progress', 'collect results'. These are specific, actionable capabilities.

3 / 3

Completeness

Clearly answers both what ('Monitor running experiments, check progress, collect results') and when ('Use when user says "check results", "is it done", "monitor", or wants experiment output') with explicit triggers.

3 / 3

Trigger Term Quality

Includes natural trigger terms users would actually say: 'check results', 'is it done', 'monitor', and 'experiment output'. These cover common phrasings well.

3 / 3

Distinctiveness Conflict Risk

The term 'monitor' and 'check results' could overlap with other monitoring or results-checking skills (e.g., CI/CD monitoring, test results). The 'experiment' domain helps narrow it but could still conflict with other experiment-related skills.

2 / 3

Total

11

/

12

Passed

Implementation

64%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a solid, actionable monitoring skill with concrete commands for multiple platforms. Its main weaknesses are the lack of error-handling/validation checkpoints in the workflow and the lengthy inline W&B section that could benefit from being extracted to a separate reference. The multi-platform coverage is comprehensive but contributes to length that could be better organized.

Suggestions

Add explicit error-handling guidance for common failure modes (SSH connection failures, dead screen sessions, missing result files) to improve workflow robustness.

Extract the W&B metrics section (Step 3.5) into a separate WANDB_MONITORING.md file and reference it with a one-line link, keeping the main skill leaner.

Remove the explanatory blockquote ('This gives the auto-review-loop richer signal...') — it explains rationale Claude doesn't need.

DimensionReasoningScore

Conciseness

The skill is mostly efficient but includes some unnecessary commentary (e.g., 'This gives the auto-review-loop richer signal than just screen output' is explanatory rather than instructional). The W&B section is quite lengthy and could be tightened. However, most content is actionable commands rather than explanation.

2 / 3

Actionability

Provides concrete, executable bash commands and Python snippets for every step. SSH commands, screen capture, JSON parsing, W&B API calls, and vastai CLI commands are all copy-paste ready with clear parameterization.

3 / 3

Workflow Clarity

Steps are clearly sequenced and numbered, covering discovery through interpretation. However, there are no explicit validation checkpoints or feedback loops — e.g., no guidance on what to do if SSH fails, if screen sessions are dead, or if JSON results are malformed. For a monitoring workflow that could encounter many failure modes, this is a gap.

2 / 3

Progressive Disclosure

The content is well-structured with clear headers and steps, but the W&B section (Step 3.5) is quite long and could be split into a separate reference file. The skill is over 100 lines and handles multiple platforms (SSH, Vast.ai, Modal, W&B, Feishu) all inline, making it a somewhat dense monolithic document.

2 / 3

Total

9

/

12

Passed

Validation

81%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation9 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

allowed_tools_field

'allowed-tools' contains unusual tool name(s)

Warning

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

9

/

11

Passed

Repository
wanshuiyin/Auto-claude-code-research-in-sleep
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.