idea-creator

Generate and rank research ideas given a broad direction. Use when user says "找idea", "brainstorm ideas", "generate research ideas", "what can we work on", or wants to explore a research area for publishable directions.

Quality

—

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Research Idea Creator

Generate publishable research ideas for: $ARGUMENTS

Overview

Given a broad research direction from the user, systematically generate, validate, and rank concrete research ideas. Standalone, Phase 1's landscape survey is inline (WebSearch — it does not invoke /research-lit); Phases 4-5 invoke /novelty-check, /run-experiment, and /monitor-experiment for validation and pilots. For the full sub-skill pipeline (/research-lit → idea generation → /novelty-check → /research-review), run /idea-discovery (Workflow 1), which orchestrates this skill.

Constants

PILOT_MAX_HOURS = 2 — Skip any pilot estimated to take > 2 hours per GPU. Flag as "needs manual pilot".
PILOT_TIMEOUT_HOURS = 3 — Hard timeout: kill pilots exceeding 3 hours. Collect partial results if available.
MAX_PILOT_IDEAS = 3 — Pilot at most 3 ideas in parallel. Additional ideas are validated on paper only.
MAX_TOTAL_GPU_HOURS = 8 — Total GPU budget for all pilots combined.
REVIEWER_MODEL = gpt-5.5 — Model used via a secondary Codex agent for brainstorming and review. Must be an OpenAI model (e.g., gpt-5.5, o3, gpt-4o).
REVIEWER_BACKEND = codex — Default: Codex xhigh reviewer through spawn_agent / send_input. Use --reviewer: oracle-pro only when explicitly requested; if Oracle is unavailable, warn and fall back to Codex xhigh.
OUTPUT_DIR = idea-stage/ — All idea-stage outputs go here. Create the directory if it doesn't exist.

💡 Override via argument, e.g., /idea-creator "topic" — pilot budget: 4h per idea, 20h total.

Workflow

Phase 0: Load Research Wiki (if active)

Skip this phase entirely if research-wiki/ does not exist.

Resolve the wiki helper using the Codex-side canonical chain (see ../shared-references/wiki-helper-resolution.md):

ARIS_REPO="${ARIS_REPO:-$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills-codex.txt 2>/dev/null)}"
WIKI_SCRIPT=""
[ -n "$ARIS_REPO" ] && [ -f "$ARIS_REPO/tools/research_wiki.py" ] && WIKI_SCRIPT="$ARIS_REPO/tools/research_wiki.py"
[ -z "$WIKI_SCRIPT" ] && [ -f tools/research_wiki.py ] && WIKI_SCRIPT="tools/research_wiki.py"
[ -z "$WIKI_SCRIPT" ] && [ -f ~/.codex/skills/research-wiki/research_wiki.py ] && WIKI_SCRIPT="$HOME/.codex/skills/research-wiki/research_wiki.py"

If research-wiki/query_pack.md exists and is less than 7 days old, read it as initial landscape context:

treat listed gaps as priority search seeds
treat failed ideas as a banlist
treat top papers as known prior work
still run Phase 1 for papers from the last 3-6 months because the wiki may be stale

If research-wiki/ exists but query_pack.md is stale or missing, rebuild it only when WIKI_SCRIPT is available. If the helper is unavailable, continue without rebuilding and report that wiki refresh was skipped.

Phase 1: Landscape Survey (5-10 min)

Map the research area to understand what exists and where the gaps are.

Scan local paper library first: Check papers/ and literature/ in the project directory for existing PDFs. Read first 3 pages of relevant papers to build a baseline understanding before searching online. This avoids re-discovering what the user already knows.
Search recent literature using WebSearch:
- Top venues in the last 2 years (NeurIPS, ICML, ICLR, ACL, EMNLP, etc.)
- Recent arXiv preprints (last 6 months)
- Use 5+ different query formulations
- Read abstracts and introductions of the top 10-15 papers
Build a landscape map:
- Group papers by sub-direction / approach
- Identify what has been tried and what hasn't
- Note recurring limitations mentioned in "Future Work" sections
- Flag any open problems explicitly stated by multiple papers
Identify structural gaps:
- Methods that work in domain A but haven't been tried in domain B
- Contradictory findings between papers (opportunity for resolution)
- Assumptions that everyone makes but nobody has tested
- Scaling regimes that haven't been explored
- Diagnostic questions that nobody has asked

Phase 2: Idea Generation (brainstorm with external LLM)

Use a secondary Codex agent for divergent thinking:

spawn_agent:
  model: REVIEWER_MODEL
  reasoning_effort: xhigh
  message: |
    You are a senior ML researcher brainstorming research ideas.

    Research direction: [user's direction]

    Here is the current landscape:
    [paste landscape map from Phase 1]

    Key gaps identified:
    [paste gaps from Phase 1]

    Generate 8-12 concrete research ideas. For each idea:
    1. One-sentence summary
    2. Core hypothesis (what you expect to find and why)
    3. Minimum viable experiment (what's the cheapest way to test this?)
    4. Expected contribution type: empirical finding / new method / theoretical result / diagnostic
    5. Risk level: LOW (likely works) / MEDIUM (50-50) / HIGH (speculative)
    6. Estimated effort: days / weeks / months

    Prioritize ideas that are:
    - Testable with moderate compute (8x RTX 3090 or less)
    - Likely to produce a clear positive OR negative result (both are publishable)
    - Not "apply X to Y" unless the application reveals genuinely surprising insights
    - Differentiated from the 10-15 papers above

    Be creative but grounded. A great idea is one where the answer matters regardless of which way it goes.

Save the agent id for follow-up.

Save a Review Tracing record for this spawn_agent call following ../shared-references/review-tracing.md, including the landscape summary, prompt summary, raw idea list path, reviewer route, and saved agent id.

Phase 3: Mechanical consolidation + objective feasibility gate

This phase does NOT judge idea quality, novelty, or impact — those are the job of the Phase-4 cross-model reviewer (a different model family). Dropping ideas here on a same-family novelty or impact call would pre-filter the reviewer's input with same-family judgment — the opposite of why ARIS uses a cross-model reviewer at all. Phase 3 only (a) clusters near-duplicate ideas and (b) drops ideas that are OBJECTIVELY out of budget; everything else passes through ANNOTATED, not eliminated.

Objective feasibility gate (safe to gate here): drop an idea ONLY on a mechanical, budget-based fact — estimated compute > 1 week of available GPU time, OR a dataset that is provably unavailable. Do NOT drop on "implementation looks complex" — annotate complexity instead.
Novelty signal — ANNOTATE, do not eliminate: do 2-3 targeted searches and attach a prior_work note (what looks related, with links). This is input for the Phase-4 reviewer, not a filter; full /novelty-check runs in Phase 4. Do NOT drop an idea here because it "might already be done."
Impact signal — ANNOTATE, do not eliminate: attach a one-line so_what note (why the result would matter either way). Do NOT drop on a same-family "a reviewer wouldn't care" call — that is exactly what the Phase-4 cross-model reviewer is for.

Every feasible, non-duplicate idea — with its prior_work and so_what annotations — proceeds to Phase 4, where the cross-model reviewer does the quality/novelty narrowing.

Phase 4: Deep Validation (for top ideas)

For each surviving idea, run a deeper evaluation:

Novelty check: Use the /novelty-check workflow (multi-source search + GPT-5.5 cross-verification) for each idea

Critical review: Use GPT-5.5 via send_input (same agent):

send_input:
  target: [saved reviewer id from the earlier idea review]
  message: |
    Here are our top ideas after filtering:
    [paste surviving ideas with novelty check results]

    For each, play devil's advocate:
    - What's the strongest objection a reviewer would raise?
    - What's the most likely failure mode?
    - How would you rank these for a top venue submission?
    - Which 2-3 would you actually work on?

Combine rankings: Merge your assessment with GPT-5.5's ranking. Select top 2-3 ideas for pilot experiments.

Phase 5: Parallel Pilot Experiments (for top 2-3 ideas)

Before committing to a full research effort, run cheap pilot experiments to get empirical signal. This is the key differentiator from paper-only validation.

Design pilots: For each top idea, define the minimal experiment that would give a positive or negative signal:
- Single seed, small scale (e.g., small dataset subset, fewer epochs)
- Target: 30 min - PILOT_MAX_HOURS per pilot on 1 GPU
- Estimate GPU-hours BEFORE launching. If estimated time > PILOT_MAX_HOURS, reduce scale (fewer epochs, smaller subset) or flag as "needs manual pilot"
- Clear success metric defined upfront (e.g., "if metric improves by > 1%, signal is positive")
Deploy in parallel: Use /run-experiment to launch pilots on different GPUs simultaneously:
```
GPU 0: Pilot for Idea 1
GPU 1: Pilot for Idea 2
GPU 2: Pilot for Idea 3
```
Use run_in_background: true to launch all at once.
Collect results: Use /monitor-experiment to check progress. If any pilot exceeds PILOT_TIMEOUT_HOURS, kill it and collect partial results. Once all pilots complete (or timeout), compare:
- Which ideas showed positive signal?
- Which showed null/negative results? (eliminate or deprioritize)
- Any surprising findings that suggest a pivot?
- Total GPU-hours consumed (track against MAX_TOTAL_GPU_HOURS budget)
Re-rank based on empirical evidence: Update the idea ranking using pilot results. An idea with strong pilot signal jumps ahead of a theoretically appealing but untested idea.

Note: Skip this phase if the ideas are purely theoretical or if no GPU is available. Flag skipped ideas as "needs pilot validation" in the report.

Phase 6: Output — Ranked Idea Report

Write a structured report to idea-stage/IDEA_REPORT.md:

Lead every recommended idea with its method, in plain language. Before any hypothesis, novelty score, or claim, state in 2–4 concrete steps what we actually build / train / run — no jargon, no claim-IDs. The reader must understand what we do before what we claim; claims (hypothesis, validation, expected outcome) come after and read as the method's acceptance criteria.

# Research Idea Report

**Direction**: [user's research direction]
**Generated**: [date]
**Ideas evaluated**: X generated → Y survived filtering → Z piloted → W recommended

## Landscape Summary
[3-5 paragraphs on the current state of the field]

## Recommended Ideas (ranked)

### Idea 1: [title]
- **Method (what we actually do)**: [2–4 concrete steps in plain language — what we build / train / run. No jargon, no claim-IDs, no hypothesis yet. Lead with this so the reader grasps the approach first.]
- **Hypothesis**: [one sentence]
- **Minimum experiment**: [concrete description]
- **Expected outcome**: [what success/failure looks like]
- **Novelty**: X/10 — closest work: [paper]
- **Feasibility**: [compute, data, implementation estimates]
- **Risk**: LOW/MEDIUM/HIGH
- **Contribution type**: empirical / method / theory / diagnostic
- **Pilot result**: [POSITIVE: metric +X% / NEGATIVE: no signal / SKIPPED: needs GPU]
- **Reviewer's likely objection**: [strongest counterargument]
- **Why we should do this**: [1-2 sentences]

### Idea 2: [title]
...

## Eliminated Ideas (for reference)
| Idea | Reason eliminated |
|------|-------------------|
| ... | Already done by [paper] |
| ... | Requires > 1 week GPU time |
| ... | Result wouldn't be interesting either way |

## Pilot Experiment Results
| Idea | GPU | Time | Key Metric | Signal |
|------|-----|------|------------|--------|
| Idea 1 | GPU 0 | 45 min | +2.3% CE | POSITIVE |
| Idea 2 | GPU 1 | 30 min | -0.1% CE | NEGATIVE |
| Idea 3 | GPU 2 | 1.5 hr | +0.8% CE | WEAK POSITIVE |

## Suggested Execution Order
1. Start with Idea 1 (positive pilot signal, lowest risk)
2. Idea 3 as backup (weak signal, may need larger scale to confirm)
3. Idea 2 eliminated by pilot — negative result documented

## Next Steps
- [ ] Scale up Idea 1 to full experiment (multi-seed, full dataset)
- [ ] If confirmed, invoke /auto-review-loop for full iteration

Phase 7: Write Ideas to Research Wiki (if active)

Skip this phase entirely if research-wiki/ does not exist.

This is critical for spiral learning: without it, ideas/ stays empty and re-ideation has no memory.

The idea page is written by the deterministic upsert_idea helper — NOT freehand markdown — so every generation, including a re-run with updated constraints, records reliably (one helper call per idea, not a prose step the model can skip). upsert_idea writes the page, wires the inspired_by/addresses_gap edges, and rebuilds index + query_pack in a single call. Default skip-on-exist: a re-ideation run records NEW ideas without clobbering an existing idea whose outcome /result-to-claim may already have enriched. --outcome stays pending at creation (the experiment verdict is set later by /result-to-claim, never guessed here). If WIKI_SCRIPT is unavailable, the ideas are NOT recorded and a single WARN is reported (fix: install ARIS research_wiki.py).

if research-wiki/ exists AND WIKI_SCRIPT is available:
    for each recommended (stage proposed) and eliminated (stage archived) idea:
        python3 "$WIKI_SCRIPT" upsert_idea research-wiki/ --slug "<stable-idea-id>" \
             --title "<idea title>" --stage "<proposed|archived>" --outcome pending \
             --thesis "<core hypothesis / direction>" \
             --risks "<novelty / feasibility risks; why killed if eliminated>" \
             --based-on "<paper:slug,paper:slug2>" --target-gaps "<G2,G10>"
    log: "idea-creator wrote N ideas (M recommended, K eliminated)"
else if research-wiki/ exists AND WIKI_SCRIPT unavailable:
    report: ideas NOT recorded — ARIS research_wiki.py unreachable

Edge semantics (wired by upsert_idea itself): idea:<id> --inspired_by--> paper:<slug> and idea:<id> --addresses_gap--> gap:<id>.

Output Protocols

Follow these shared protocols for all output files:

Output Versioning Protocol — write timestamped file first, then copy to fixed name

Output Manifest Protocol — log every output to MANIFEST.md

Output Language Protocol — respect the project's language setting

Key Rules

Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.
The user provides a DIRECTION, not an idea. Your job is to generate the ideas.
Quantity first, quality second: brainstorm broadly, then filter ruthlessly.
A good negative result is just as publishable as a positive one. Prioritize ideas where the answer matters regardless of direction.
Don't fall in love with any idea before validating it. Be willing to kill ideas.
Always estimate compute cost. An idea that needs 1000 GPU-hours is not actionable for most researchers.
"Apply X to Y" is the lowest form of research idea. Push for deeper questions.
Include eliminated ideas in the report — they save future time by documenting dead ends.
If the user's direction is too broad (e.g., "NLP", "computer vision", "reinforcement learning"), STOP and ask them to narrow it. A good direction is 1-2 sentences specifying the problem, domain, and constraint — e.g., "factorized gap in discrete diffusion LMs" or "sample efficiency of offline RL with image observations". Without sufficient specificity, generated ideas will be too vague to run experiments on.

Composing with Other Skills

After this skill produces the ranked report:

/idea-creator "direction"     → ranked ideas
/novelty-check "top idea"     → deep novelty verification (already done in Phase 4, but user can re-run)
/research-review "top idea"   → external critical feedback
implement                     → write code
/run-experiment               → deploy to GPU
/auto-review-loop             → iterate until submission-ready

Review Tracing

After each spawn_agent or send_input reviewer call, save the trace following ../shared-references/review-tracing.md. Include the reviewer route, saved agent id, prompt summary, raw output path, selected ideas, and rejected ideas.

Repository: wanshuiyin/Auto-claude-code-research-in-sleep
Commit: fe5963c

Last updated: about 17 hours ago
Created: about 17 hours ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.