CtrlK
BlogDocsLog inGet started
Tessl Logo

run-experiment

Deploy and run ML experiments on local or remote GPU servers. Use when user says "run experiment", "deploy to server", "跑实验", or needs to launch training jobs.

68

Quality

83%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Risky

Do not use without reviewing

SKILL.md
Quality
Evals
Security

Quality

Discovery

89%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a solid skill description that clearly communicates its purpose and includes explicit trigger guidance with natural user phrases. The main weakness is that the 'what' portion could be more specific about the concrete actions performed (e.g., environment setup, job monitoring, log retrieval). The multilingual trigger term adds useful coverage.

Suggestions

Expand the capability list with more specific actions, e.g., 'configure environments, submit training jobs, monitor GPU utilization, retrieve logs' to improve specificity.

DimensionReasoningScore

Specificity

Names the domain (ML experiments, GPU servers) and some actions (deploy, run, launch training jobs), but doesn't list comprehensive specific actions like configuring environments, monitoring jobs, managing checkpoints, etc.

2 / 3

Completeness

Clearly answers both 'what' (deploy and run ML experiments on local or remote GPU servers) and 'when' (explicit 'Use when' clause with specific trigger phrases including 'run experiment', 'deploy to server', '跑实验', and launching training jobs).

3 / 3

Trigger Term Quality

Includes strong natural trigger terms users would actually say: 'run experiment', 'deploy to server', '跑实验', 'launch training jobs'. The multilingual trigger term is a nice touch for broader coverage, and these are phrases users would naturally use.

3 / 3

Distinctiveness Conflict Risk

The combination of ML experiments, GPU servers, and training jobs creates a clear niche. The specific trigger terms like 'run experiment', 'deploy to server', and '跑实验' are distinct enough to avoid conflicts with general coding or deployment skills.

3 / 3

Total

11

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-crafted, highly actionable skill with clear multi-step workflows, explicit validation checkpoints, and concrete executable commands for multiple deployment targets. Its main weakness is length — the skill tries to cover five deployment targets (local CUDA, local MPS, remote SSH, Vast.ai, Modal) plus W&B integration and Feishu notifications all in one file, which makes it longer than ideal. Some content could be trimmed or split into referenced sub-files for better progressive disclosure.

Suggestions

Split deployment target details (Vast.ai lifecycle, Modal configuration, W&B integration) into separate referenced files to reduce the main SKILL.md to a concise overview with pointers.

Trim the AGENTS.md example block — it's useful but could be shortened to just one or two target examples with a note that other targets follow the same pattern.

DimensionReasoningScore

Conciseness

The skill is fairly long (~200 lines) and includes some sections that could be tightened (e.g., the AGENTS.md example block is extensive, the W&B integration step explains metrics Claude would know to log). However, most content is actionable commands rather than explanatory prose, so it's not egregiously verbose.

2 / 3

Actionability

The skill provides fully executable bash and Python commands throughout — SSH commands, rsync patterns, screen session creation, nvidia-smi queries, wandb integration code, and Modal deployment commands are all copy-paste ready with clear placeholder conventions.

3 / 3

Workflow Clarity

The workflow is clearly sequenced (Steps 1-7) with explicit validation checkpoints: GPU availability check before assignment, launch verification via `screen -ls`, artifact copy verification before Vast.ai destruction, and explicit error-handling guidance (e.g., 'If any artifact copy fails, do not destroy the instance'). Feedback loops are present for risky operations.

3 / 3

Progressive Disclosure

The content is well-structured with clear headers and conditional sections (e.g., 'Remote Only', 'when wandb: true'), but everything is in a single monolithic file with no references to supporting documents. The W&B integration details, Vast.ai lifecycle management, and Modal configuration could be split into separate reference files to keep the main skill leaner.

2 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
wanshuiyin/Auto-claude-code-research-in-sleep
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.