Workflow 1.5: Bridge between idea discovery and auto review. Reads EXPERIMENT_PLAN.md, implements experiment code, deploys to GPU, collects initial results. Use when user says "实现实验", "implement experiments", "bridge", "从计划到跑实验", "deploy the plan", or has an experiment plan ready to execute.
88
85%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Implement and deploy experiments from plan: $ARGUMENTS
This skill bridges Workflow 1 (idea discovery + method refinement) and Workflow 2 (auto review loop). It takes the experiment plan and turns it into running experiments with initial results.
Workflow 1 output: This skill: Workflow 2 input:
refine-logs/EXPERIMENT_PLAN.md → implement → deploy → collect → initial results ready
refine-logs/EXPERIMENT_TRACKER.md code /run-experiment for /auto-review-loop
refine-logs/FINAL_PROPOSAL.mdfalse to review code before deploying.false to skip.true, prefer idea-stage/IDEA_CANDIDATES.md over the full idea-stage/IDEA_REPORT.md, and append completed runs to EXPERIMENT_LOG.md.Override:
/experiment-bridge "EXPERIMENT_PLAN.md" — compact: true, base repo: https://github.com/org/project
This skill expects one or more of:
refine-logs/EXPERIMENT_PLAN.md (best) — claim-driven experiment roadmap from /experiment-planrefine-logs/EXPERIMENT_TRACKER.md — run-by-run execution tablerefine-logs/FINAL_PROPOSAL.md — method description for implementation contextidea-stage/IDEA_CANDIDATES.md — compact idea summary (preferred when COMPACT = true) (fall back to ./IDEA_CANDIDATES.md if not found)idea-stage/IDEA_REPORT.md — fallback if refine-logs don't exist (fall back to ./IDEA_REPORT.md if not found)If none exist, ask the user what experiments to implement.
Read EXPERIMENT_PLAN.md and extract:
FINAL_PROPOSAL.md — what exactly to implementPresent a brief summary:
📋 Experiment plan loaded:
- Milestones: [N] (sanity → baseline → main → ablation)
- Must-run experiments: [N]
- Nice-to-have: [N]
- Estimated GPU-hours: [X]
Proceeding to implementation.If BASE_REPO is set — clone the repo first:
git clone <BASE_REPO> base_repo/For each milestone (in order), write the experiment scripts:
Check existing code — scan the project (or cloned base_repo/) for existing experiment scripts, model code, and data loaders. Reuse as much as possible.
Implement missing pieces:
Follow the plan's run order — implement sanity-stage experiments first, then baselines, then main method, then ablations.
Self-review before deploying:
Skip this step if CODE_REVIEW is false.
Before deploying, send the experiment code to a secondary Codex reviewer with xhigh reasoning:
spawn_agent:
reasoning_effort: xhigh
message: |
Review the following experiment implementation for correctness.
## Experiment Plan
[paste key sections from EXPERIMENT_PLAN.md]
## Method Description
[paste from FINAL_PROPOSAL.md]
## Implementation
[paste the experiment scripts or exact file paths plus relevant snippets]
Check for:
1. Does the code correctly implement the method described in the proposal?
2. Are all hyperparameters from the plan reflected in the code?
3. Are there logic bugs: wrong loss, wrong data split, missing eval, leakage, metric mismatch?
4. Is the evaluation metric computed against ground truth, not another model's output?
5. Are seeds, result paths, logging, and failure handling sufficient for reproducible experiments?
Output:
- BLOCKING issues that must be fixed before deployment
- NON-BLOCKING issues that can wait
- Suggested patches or checksIf BLOCKING issues are found, fix them and re-run this review once before Phase 3. Save the reviewer response and any fixes in refine-logs/EXPERIMENT_CODE_REVIEW.md. If reviewer delegation is unavailable, run the same checklist locally and mark the review [local-only].
Before deploying the full experiment suite, run the sanity-stage experiment:
/run-experiment [sanity experiment command]Wait for completion. Verify:
If sanity fails → fix the code, re-run. Do not proceed to full deployment with broken code.
If the same sanity failure repeats, trigger a second opinion: summarize the plan, code diff, command, logs, backend, and failure, then ask a fresh Codex reviewer agent for a rescue diagnosis. Apply only concrete fixes grounded in the logs.
Deploy experiments following the plan's milestone order. Route by job count and dependencies:
/run-experiment [experiment commands]For large batches (≥10 jobs), multi-seed sweeps, or teacher→student phase dependencies, use the queue scheduler:
/experiment-queue [grid spec or manifest]Auto-routing rule: if any milestone in EXPERIMENT_PLAN.md declares ≥10 jobs or declares phase dependencies, route that milestone to /experiment-queue; otherwise use /run-experiment. /experiment-queue adds OOM-aware retry with backoff, stale-screen cleanup, wave-transition race prevention, phase dependency enforcement, and crash-safe state persistence in queue_state.json.
For each milestone:
/run-experiment, or max_parallel from the queue manifest for /experiment-queue)/monitor-experiment to track progress; if /experiment-queue is active, monitor queue_state.jsonBackend lifecycle rules:
auto_destroy is configured, write the exact cleanup command before launch.🚦 Checkpoint (if AUTO_DEPLOY = false):
🔧 Code implementation complete. Ready to deploy:
Milestone 0 (sanity): [status — passed/pending]
Milestone 1 (baseline): [N experiments, ~X GPU-hours]
Milestone 2 (main method): [N experiments, ~X GPU-hours]
Milestone 3 (ablations): [N experiments, ~X GPU-hours]
Total estimated: ~X GPU-hours on [N] GPUs
Deploy now? Or review the code first?As experiments complete:
/training-check to detect NaN, loss divergence, plateaus, or overfitting. If W&B is not configured, skip silently.refine-logs/EXPERIMENT_TRACKER.md — fill in Status and Notes columns# Initial Experiment Results
**Date**: [today]
**Plan**: refine-logs/EXPERIMENT_PLAN.md
## Results by Milestone
### M0: Sanity — PASSED
- [result]
### M1: Baselines
| Run | System | Key Metric | Status |
|-----|--------|-----------|--------|
| R001 | baseline_1 | X.XX | DONE |
### M2: Main Method
| Run | System | Key Metric | Status |
|-----|--------|-----------|--------|
| R003 | our_method | X.XX | DONE |
### M3: Ablations
...
## Summary
- [X/Y] must-run experiments completed
- Main result: [positive/negative/inconclusive]
- Ready for /auto-review-loop: [YES/NO]
## Next Step
→ /auto-review-loop "[topic]"Skip entirely if COMPACT is false.
Append each completed experiment to EXPERIMENT_LOG.md:
## [Run ID] — [timestamp]
- **System**: [method name]
- **Config**: [key hyperparameters]
- **Result**: [primary metric = X.XX]
- **Verdict**: [positive / negative / inconclusive]
- **Reproduce**: `python train.py --config configs/run_id.yaml --seed 42`After main experiments (M2) complete with positive results, invoke /ablation-planner to design ablation studies:
refine-logs/EXPERIMENT_PLAN.md and refine-logs/EXPERIMENT_TRACKER.mdIf /ablation-planner is unavailable, skip silently.
Present final status:
🔬 Experiment bridge complete:
- Implemented: [N] experiment scripts
- Deployed: [N] experiments on [M] GPUs
- Completed: [X/Y] must-run, [A/B] nice-to-have
- Main result: [one sentence]
Results: refine-logs/EXPERIMENT_RESULTS.md
Tracker: refine-logs/EXPERIMENT_TRACKER.md
Ready for Workflow 2:
→ /auto-review-loop "[topic]"Follow these shared protocols for all output files:
- Output Versioning Protocol — write timestamped file first, then copy to fixed name
- Output Manifest Protocol — log every output to MANIFEST.md
- Output Language Protocol — respect the project's language setting
cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.EXPERIMENT_TRACKER.md should reflect real status after each run completes./idea-discovery "direction" ← Workflow 1: find + refine + plan
/experiment-bridge ← you are here (Workflow 1.5: implement + deploy)
/auto-review-loop "topic" ← Workflow 2: review + iterate
/paper-writing "NARRATIVE_REPORT.md" ← Workflow 3: write the paper
Or use /research-pipeline for the full end-to-end flow (includes this bridge).700fbe2
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.