CtrlK
BlogDocsLog inGet started
Tessl Logo

autolab-managed-experiment

Run one Autolab benchmark experiment safely on Hugging Face Jobs. Use when a planner, reviewer, or experiment worker is preparing, auditing, launching, or reviewing a single train.py hypothesis against the current local promoted master.

87

1.94x
Quality

83%

Does it follow best practices?

Impact

99%

1.94x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Content

92%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, well-crafted skill that provides clear, actionable guidance for running a single Autolab benchmark experiment. The workflow is well-sequenced with validation checkpoints, and the guardrails are specific and concrete rather than vague. The only notable weakness is the lack of references to supporting documentation for the various scripts and concepts mentioned, though the content is compact enough that this is a minor concern.

DimensionReasoningScore

Conciseness

The content is lean and efficient. It assumes Claude understands the tooling context, avoids explaining what scripts do internally, and every line serves a purpose—either a command, a constraint, or a decision rule.

3 / 3

Actionability

Every workflow step includes specific, copy-paste-ready commands with exact script paths and flags. The guardrails are concrete and specific (e.g., naming exact files like `train_orig.py`, `master.json`, `results.tsv`), and fast checks provide executable commands for verification.

3 / 3

Workflow Clarity

The 7-step workflow is clearly sequenced with explicit validation checkpoints: preflight before launch (step 3), log parsing after run (step 5), and a conditional promotion gate (step 7). The guardrails section adds feedback loops—stop if preflight reports multiple hypotheses, stop if workspace is stale—providing clear error recovery guidance.

3 / 3

Progressive Disclosure

The content is well-organized with clear sections (Workflow, Guardrails, Fast Checks), but everything is inline in a single file with no references to supporting documentation. For a skill of this complexity with multiple scripts and concepts (refresh_master, preflight, promotion logic), some references to deeper docs would improve navigation. However, the content length is moderate enough that this is a minor issue.

2 / 3

Total

11

/

12

Passed

Description

75%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has strong completeness with an explicit 'Use when' clause and excellent distinctiveness due to its narrow domain focus. However, the actions described are somewhat generic (preparing, auditing, launching, reviewing) rather than listing concrete technical operations, and the trigger terms lean heavily on domain jargon that may not match how users naturally phrase requests.

Suggestions

Replace generic verbs with more concrete actions, e.g., 'Submits a train.py training job to Hugging Face Jobs, monitors execution, and compares results against the promoted master baseline'.

Add natural language trigger variations users might say, such as 'run experiment', 'submit training job', 'benchmark run', or 'test hypothesis'.

DimensionReasoningScore

Specificity

The description names a specific domain (Autolab benchmark experiment on Hugging Face Jobs) and mentions actions like preparing, auditing, launching, and reviewing, but these are somewhat generic verbs rather than concrete technical actions like 'submit training job', 'parse logs', or 'compare metrics'.

2 / 3

Completeness

Clearly answers both 'what' (run one Autolab benchmark experiment safely on Hugging Face Jobs) and 'when' (explicit 'Use when' clause covering preparing, auditing, launching, or reviewing a single train.py hypothesis against the current local promoted master).

3 / 3

Trigger Term Quality

Includes some relevant keywords like 'Autolab', 'benchmark', 'Hugging Face Jobs', 'train.py', and 'promoted master', but these are fairly domain-specific jargon. Terms like 'planner', 'reviewer', 'experiment worker' are role-based rather than natural user language. Missing common variations a user might say like 'run experiment', 'training job', 'HF jobs'.

2 / 3

Distinctiveness Conflict Risk

Highly specific niche combining Autolab, Hugging Face Jobs, single benchmark experiment, and train.py hypothesis against promoted master. Very unlikely to conflict with other skills due to the narrow, well-defined scope.

3 / 3

Total

10

/

12

Passed

Validation

72%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation8 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

metadata_version

'metadata.version' is missing

Warning

metadata_field

'metadata' should map string keys to string values

Warning

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

8

/

11

Passed

Repository
huggingface/context-course
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.