CtrlK
BlogDocsLog inGet started
Tessl Logo

auto-paper-improvement-loop

Autonomously improve a generated paper via GPT-5.4 xhigh review → implement fixes → recompile, for 2 rounds. Use when user says \"改论文\", \"improve paper\", \"论文润色循环\", \"auto improve\", or wants to iteratively polish a generated paper.

90

Quality

88%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Auto Paper Improvement Loop: Review → Fix → Recompile

Autonomously improve the paper at: $ARGUMENTS

Context

This skill is designed to run after Workflow 3 (/paper-plan/paper-figure/paper-write/paper-compile). It takes a compiled paper and iteratively improves it through external LLM review.

Unlike /auto-review-loop (which iterates on research — running experiments, collecting data, rewriting narrative), this skill iterates on paper writing quality — fixing theoretical inconsistencies, softening overclaims, adding missing content, and improving presentation.

Constants

  • MAX_ROUNDS = 2 — Two rounds of review→fix→recompile. Empirically, Round 1 catches structural issues (4→6/10), Round 2 catches remaining presentation issues (6→7/10). Diminishing returns beyond 2 rounds for writing-only improvements.
  • REVIEWER_MODEL = gpt-5.4 — Model used via a secondary Codex agent for paper review.
  • REVIEW_LOG = PAPER_IMPROVEMENT_LOG.md — Cumulative log of all rounds, stored in paper directory.
  • HUMAN_CHECKPOINT = false — When true, pause after each round's review and present score + weaknesses to the user. The user can approve fixes, provide custom modification instructions, skip specific fixes, or stop early. When false (default), runs fully autonomously.

💡 Override: /auto-paper-improvement-loop "paper/" — human checkpoint: true

Inputs

  1. Compiled paperpaper/main.pdf + LaTeX source files
  2. All section .tex files — concatenated for review prompt

State Persistence (Compact Recovery)

If the context window fills up mid-loop, Codex auto-compacts. To recover, this skill writes PAPER_IMPROVEMENT_STATE.json after each round:

{
  "current_round": 1,
  "agent_id": "019ce736-...",
  "last_score": 6,
  "status": "in_progress",
  "timestamp": "2026-03-13T21:00:00"
}

On startup: if PAPER_IMPROVEMENT_STATE.json exists with "status": "in_progress" AND timestamp is within 24 hours, read it + PAPER_IMPROVEMENT_LOG.md to recover context, then resume from the next round. Otherwise (file absent, "status": "completed", or older than 24 hours), start fresh.

After each round: overwrite the state file. On completion: set "status": "completed".

Workflow

Step 0: Preserve Original

cp paper/main.pdf paper/main_round0_original.pdf

Step 1: Collect Paper Text

Concatenate all section files into a single text block for the review prompt:

# Collect all sections in order
for f in paper/sections/*.tex; do
    echo "% === $(basename $f) ==="
    cat "$f"
done > /tmp/paper_full_text.txt

Step 2: Round 1 Review

Send the full paper text to GPT-5.4 xhigh:

spawn_agent:
  model: gpt-5.4
  reasoning_effort: xhigh
  message: |
    You are reviewing a [VENUE] paper. Please provide a detailed, structured review.

    ## Full Paper Text:
    [paste concatenated sections]

    ## Review Instructions
    Please act as a senior ML reviewer ([VENUE] level). Provide:
    1. **Overall Score** (1-10, where 6 = weak accept, 7 = accept)
    2. **Summary** (2-3 sentences)
    3. **Strengths** (bullet list, ranked)
    4. **Weaknesses** (bullet list, ranked: CRITICAL > MAJOR > MINOR)
    5. **For each CRITICAL/MAJOR weakness**: A specific, actionable fix
    6. **Missing References** (if any)
    7. **Verdict**: Ready for submission? Yes / Almost / No

    Focus on: theoretical rigor, claims vs evidence alignment, writing clarity,
    self-containedness, notation consistency.

Save the agent id for Round 2.

Step 2b: Human Checkpoint (if enabled)

Skip if HUMAN_CHECKPOINT = false.

Present the review results and wait for user input:

📋 Round 1 review complete.

Score: X/10 — [verdict]
Key weaknesses (by severity):
1. [CRITICAL] ...
2. [MAJOR] ...
3. [MINOR] ...

Reply "go" to implement all fixes, give custom instructions, "skip 2" to skip specific fixes, or "stop" to end.

Parse user response same as /auto-review-loop: approve / custom instructions / skip / stop.

Step 3: Implement Round 1 Fixes

Parse the review and implement fixes by severity:

Priority order:

  1. CRITICAL fixes (assumption mismatches, internal contradictions)
  2. MAJOR fixes (overclaims, missing content, notation issues)
  3. MINOR fixes (if time permits)

Common fix patterns:

IssueFix Pattern
Assumption-model mismatchRewrite assumption to match the model, add formal proposition bridging the gap
OverclaimsSoften language: "validate" → "demonstrate practical relevance", "comparable" → "qualitatively competitive"
Missing metricsAdd quantitative table with honest parameter counts and caveats
Theorem not self-containedAdd "Interpretation" paragraph listing all dependencies
Notation confusionRename conflicting symbols globally, add Notation paragraph
Missing referencesAdd to references.bib, cite in appropriate locations
Theory-practice gapExplicitly frame theory as idealized; add synthetic validation subsection

Step 4: Recompile Round 1

cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round1.pdf

Verify: 0 undefined references, 0 undefined citations.

Step 5: Round 2 Review

Use send_input with the saved agent id:

send_input:
  id: [saved from Round 1]
  model: gpt-5.4
  reasoning_effort: xhigh
  message: |
    [Round 2 update]

    Since your last review, we have implemented:
    1. [Fix 1]: [description]
    2. [Fix 2]: [description]
    ...

    Please re-score and re-assess. Same format:
    Score, Summary, Strengths, Weaknesses, Actionable fixes, Verdict.

Step 5b: Human Checkpoint (if enabled)

Skip if HUMAN_CHECKPOINT = false. Same as Step 2b — present Round 2 review, wait for user input.

Step 6: Implement Round 2 Fixes

Same process as Step 3. Typical Round 2 fixes:

  • Add controlled synthetic experiments validating theory
  • Further soften any remaining overclaims
  • Formalize informal arguments (e.g., truncation → formal proposition)
  • Strengthen limitations section

Step 7: Recompile Round 2

cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round2.pdf

Step 8: Format Check

After the final recompilation, run a format compliance check:

# 1. Page count vs venue limit
PAGES=$(pdfinfo paper/main.pdf | grep Pages | awk '{print $2}')
echo "Pages: $PAGES (limit: 9 main body for ICLR/NeurIPS)"

# 2. Overfull hbox warnings (content exceeding margins)
OVERFULL=$(grep -c "Overfull" paper/main.log 2>/dev/null || echo 0)
echo "Overfull hbox warnings: $OVERFULL"
grep "Overfull" paper/main.log 2>/dev/null | head -10

# 3. Underfull hbox warnings (loose spacing)
UNDERFULL=$(grep -c "Underfull" paper/main.log 2>/dev/null || echo 0)
echo "Underfull hbox warnings: $UNDERFULL"

# 4. Bad boxes summary
grep -c "badness" paper/main.log 2>/dev/null || echo "0 badness warnings"

Auto-fix patterns:

IssueFix
Overfull hbox in equationWrap in \resizebox or split with \split/aligned
Overfull hbox in tableReduce font (\small/\footnotesize) or use \resizebox{\linewidth}{!}{...}
Overfull hbox in textRephrase sentence or add \allowbreak / \- hints
Over page limitMove content to appendix, compress tables, reduce figure sizes
Underfull hbox (loose)Rephrase for better line filling or add \looseness=-1

If any overfull hbox > 10pt is found, fix it and recompile before documenting.

Step 9: Document Results

Create PAPER_IMPROVEMENT_LOG.md in the paper directory:

# Paper Improvement Log

## Score Progression

| Round | Score | Verdict | Key Changes |
|-------|-------|---------|-------------|
| Round 0 (original) | X/10 | No/Almost/Yes | Baseline |
| Round 1 | Y/10 | No/Almost/Yes | [summary of fixes] |
| Round 2 | Z/10 | No/Almost/Yes | [summary of fixes] |

## Round 1 Review & Fixes

<details>
<summary>GPT-5.4 xhigh Review (Round 1)</summary>

[Full raw review text, verbatim]

</details>

### Fixes Implemented
1. [Fix description]
2. [Fix description]
...

## Round 2 Review & Fixes

<details>
<summary>GPT-5.4 xhigh Review (Round 2)</summary>

[Full raw review text, verbatim]

</details>

### Fixes Implemented
1. [Fix description]
2. [Fix description]
...

## PDFs
- `main_round0_original.pdf` — Original generated paper
- `main_round1.pdf` — After Round 1 fixes
- `main_round2.pdf` — Final version after Round 2 fixes

Step 9: Summary

Report to user:

  • Score progression table
  • Number of CRITICAL/MAJOR/MINOR issues fixed per round
  • Final page count
  • Remaining issues (if any)

Feishu Notification (if configured)

After each round's review AND at final completion, check ~/.codex/feishu.json:

  • After each round: Send review_scored — "Round N: X/10 — [key changes]"
  • After final round: Send pipeline_done — score progression table + final page count
  • If config absent or mode "off": skip entirely (no-op)

Output

paper/
├── main_round0_original.pdf    # Original
├── main_round1.pdf             # After Round 1
├── main_round2.pdf             # After Round 2 (final)
├── main.pdf                    # = main_round2.pdf
└── PAPER_IMPROVEMENT_LOG.md    # Full review log with scores

Key Rules

  • Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.

  • Preserve all PDF versions — user needs to compare progression

  • Save FULL raw review text — do not summarize or truncate GPT-5.4 responses

  • Use send_input for Round 2 to maintain conversation context

  • Always recompile after fixes — verify 0 errors before proceeding

  • Do not fabricate experimental results — synthetic validation must describe methodology, not invent numbers

  • Respect the paper's claims — soften overclaims rather than adding unsupported new claims

  • Global consistency — when renaming notation or softening claims, check ALL files (abstract, intro, method, experiments, theory sections, conclusion, tables, figure captions)

Typical Score Progression

Based on end-to-end testing on a 9-page ICLR 2026 theory paper:

RoundScoreKey Improvements
Round 04/10 (content)Baseline: assumption-model mismatch, overclaims, notation issues
Round 16/10 (content)Fixed assumptions, softened claims, added interpretation, renamed notation
Round 27/10 (content)Added synthetic validation, formal truncation proposition, stronger limitations
Round 35→8.5/10 (format)Removed hero fig, appendix, compressed conclusion, fixed overfull hbox

+4.5 points across 3 rounds (2 content + 1 format) is typical for a well-structured but rough first draft. Final: 8 pages main body, 0 overfull hbox, ICLR-compliant.

Repository
wanshuiyin/Auto-claude-code-research-in-sleep
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.