autolab-managed-experiment

Run one Autolab benchmark experiment safely on Hugging Face Jobs. Use when a planner, reviewer, or experiment worker is preparing, auditing, launching, or reviewing a single train.py hypothesis against the current local promoted master.

1.94x

Quality

83%

Does it follow best practices?

Impact

99%

1.94x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Evaluation results

98%

-2%

Benchmark Experiment: Learning Rate Schedule Hypothesis

Full experiment workflow script

Criteria

Without context

With context

Source credentials

100%

Refresh master with dag

100%

Only train.py edited

100%

75%

Preflight before launch

100%

Launch mode experiment

100%

Log output to /tmp

100%

Parse metric from log

100%

Record run with submit_patch

100%

Promotion condition val_bpb

100%

Correct step ordering

100%

Exactly one launch call

100%

80%

Experiment Readiness Audit

Guardrails and pre-launch safety checks

Criteria

Without context

With context

Correct freshness baseline

100%

Ignore git main

10%

100%

Stop on multiple hypothesis categories

91%

100%

Stale workspace: stop and rewrite

100%

Prepare mode prohibited

100%

Credentials sourced in script

100%

Refresh master in script

100%

Preflight in script

100%

Script stops before launch

100%

train_orig.py named as base

100%

66%

Experiment Status Check and Launch Decision Script

Duplicate detection and programmatic preflight

Criteria

Without context

With context

trackio_reporter used

100%

max-jobs flag present

100%

Preflight json flag used

100%

JSON output captured

50%

100%

Duplicate check purpose

41%

100%

Preflight json purpose

50%

100%

No duplicate launch rule

100%

uv run used for scripts

100%

Non-zero exit on duplicate

100%

Single-experiment scope

100%

Repository: huggingface/context-course
Commit: 0448a7c

Evaluated: 26 days ago
Agent: Claude Code
Model: Claude Sonnet 4.6

Table of Contents

Benchmark Experiment: Learning Rate Schedule Hypothesis Experiment Readiness Audit Experiment Status Check and Launch Decision Script

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.