CtrlK
BlogDocsLog inGet started
Tessl Logo

autolab-managed-experiment

Run one Autolab benchmark experiment safely on Hugging Face Jobs. Use when a planner, reviewer, or experiment worker is preparing, auditing, launching, or reviewing a single train.py hypothesis against the current local promoted master.

87

1.94x
Quality

83%

Does it follow best practices?

Impact

99%

1.94x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Evaluation results

98%

-2%

Benchmark Experiment: Learning Rate Schedule Hypothesis

Full experiment workflow script

Criteria
Without context
With context

Source credentials

100%

100%

Refresh master with dag

100%

100%

Only train.py edited

100%

75%

Preflight before launch

100%

100%

Launch mode experiment

100%

100%

Log output to /tmp

100%

100%

Parse metric from log

100%

100%

Record run with submit_patch

100%

100%

Promotion condition val_bpb

100%

100%

Correct step ordering

100%

100%

Exactly one launch call

100%

100%

100%

80%

Experiment Readiness Audit

Guardrails and pre-launch safety checks

Criteria
Without context
With context

Correct freshness baseline

0%

100%

Ignore git main

10%

100%

Stop on multiple hypothesis categories

91%

100%

Stale workspace: stop and rewrite

0%

100%

Prepare mode prohibited

0%

100%

Credentials sourced in script

0%

100%

Refresh master in script

0%

100%

Preflight in script

0%

100%

Script stops before launch

100%

100%

train_orig.py named as base

0%

100%

100%

66%

Experiment Status Check and Launch Decision Script

Duplicate detection and programmatic preflight

Criteria
Without context
With context

trackio_reporter used

0%

100%

max-jobs flag present

0%

100%

Preflight json flag used

0%

100%

JSON output captured

50%

100%

Duplicate check purpose

41%

100%

Preflight json purpose

50%

100%

No duplicate launch rule

100%

100%

uv run used for scripts

0%

100%

Non-zero exit on duplicate

100%

100%

Single-experiment scope

100%

100%

Repository
huggingface/context-course
Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.