autolab-managed-experiment

Run one Autolab benchmark experiment safely on Hugging Face Jobs. Use when a planner, reviewer, or experiment worker is preparing, auditing, launching, or reviewing a single train.py hypothesis against the current local promoted master.

1.94x

Quality

83%

Does it follow best practices?

Impact

99%

1.94x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Quality

Content

92%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, well-crafted skill that provides clear, actionable guidance for running a single Autolab benchmark experiment. The workflow is well-sequenced with validation checkpoints, and the guardrails are specific and concrete rather than vague. The only notable weakness is the lack of references to supporting documentation for the various scripts and concepts mentioned, though the content is compact enough that this is a minor concern.

Dimension	Reasoning	Score
Conciseness	The content is lean and efficient. It assumes Claude understands the tooling context, avoids explaining what scripts do internally, and every line serves a purpose—either a command, a constraint, or a decision rule.	3 / 3
Actionability	Every workflow step includes specific, copy-paste-ready commands with exact script paths and flags. The guardrails are concrete and specific (e.g., naming exact files like `train_orig.py`, `master.json`, `results.tsv`), and fast checks provide executable commands for verification.	3 / 3
Workflow Clarity	The 7-step workflow is clearly sequenced with explicit validation checkpoints: preflight before launch (step 3), log parsing after run (step 5), and a conditional promotion gate (step 7). The guardrails section adds feedback loops—stop if preflight reports multiple hypotheses, stop if workspace is stale—providing clear error recovery guidance.	3 / 3
Progressive Disclosure	The content is well-organized with clear sections (Workflow, Guardrails, Fast Checks), but everything is inline in a single file with no references to supporting documentation. For a skill of this complexity with multiple scripts and concepts (refresh_master, preflight, promotion logic), some references to deeper docs would improve navigation. However, the content length is moderate enough that this is a minor issue.	2 / 3
	Total	11 / 12 Passed

Description

75%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has strong completeness with an explicit 'Use when' clause and excellent distinctiveness due to its narrow domain focus. However, the actions described are somewhat generic (preparing, auditing, launching, reviewing) rather than listing concrete technical operations, and the trigger terms lean heavily on domain jargon that may not match how users naturally phrase requests.

Suggestions

Replace generic verbs with more concrete actions, e.g., 'Submits a train.py training job to Hugging Face Jobs, monitors execution, and compares results against the promoted master baseline'.

Add natural language trigger variations users might say, such as 'run experiment', 'submit training job', 'benchmark run', or 'test hypothesis'.

Dimension	Reasoning	Score
Specificity	The description names a specific domain (Autolab benchmark experiment on Hugging Face Jobs) and mentions actions like preparing, auditing, launching, and reviewing, but these are somewhat generic verbs rather than concrete technical actions like 'submit training job', 'parse logs', or 'compare metrics'.	2 / 3
Completeness	Clearly answers both 'what' (run one Autolab benchmark experiment safely on Hugging Face Jobs) and 'when' (explicit 'Use when' clause covering preparing, auditing, launching, or reviewing a single train.py hypothesis against the current local promoted master).	3 / 3
Trigger Term Quality	Includes some relevant keywords like 'Autolab', 'benchmark', 'Hugging Face Jobs', 'train.py', and 'promoted master', but these are fairly domain-specific jargon. Terms like 'planner', 'reviewer', 'experiment worker' are role-based rather than natural user language. Missing common variations a user might say like 'run experiment', 'training job', 'HF jobs'.	2 / 3
Distinctiveness Conflict Risk	Highly specific niche combining Autolab, Hugging Face Jobs, single benchmark experiment, and train.py hypothesis against promoted master. Very unlikely to conflict with other skills due to the narrow, well-defined scope.	3 / 3
	Total	10 / 12 Passed

Validation

72%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 8 / 11 Passed

Validation for skill structure

Criteria	Description	Result
metadata_version	'metadata.version' is missing	Warning
metadata_field	'metadata' should map string keys to string values	Warning
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	8 / 11 Passed

Repository: huggingface/context-course
Commit: 0448a7c

Reviewed: 26 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.