Use when experiments complete to judge what claims the results support, what they don't, and what evidence is still missing. A secondary Codex agent evaluates results against intended claims and routes to next action (pivot, supplement, or confirm). Use after experiments finish — before writing the paper or running ablations.
79
76%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/skills-codex/result-to-claim/SKILL.mdExperiments produce numbers; this gate decides what those numbers mean. Collect results from available sources, get a secondary Codex judgment, then auto-route based on the verdict.
Gather experiment data from whatever sources are available in the project:
wandb.Api().run("<entity>/<project>/<run_id>").history() — metrics, training curves, comparisonsssh server "tail -100 /path/to/training.log" if no other sourceAssemble the key information:
Send the collected results to a secondary Codex agent for objective evaluation:
spawn_agent:
reasoning_effort: xhigh
message: |
RESULT-TO-CLAIM EVALUATION
I need you to judge whether experimental results support the intended claim.
Intended claim: [the claim these experiments test]
Experiments run:
[list experiments with method, dataset, metrics]
Results:
[paste key numbers, comparison deltas, significance]
Baselines:
[baseline numbers and sources — reproduced or from paper]
Known caveats:
[any confounding factors, limited datasets, missing comparisons]
Please evaluate:
1. claim_supported: yes | partial | no
2. what_results_support: what the data actually shows
3. what_results_dont_support: where the data falls short of the claim
4. missing_evidence: specific evidence gaps
5. suggested_claim_revision: if the claim should be strengthened, weakened, or reframed
6. next_experiments_needed: specific experiments to fill gaps (if any)
7. confidence: high | medium | low
Be honest. Do not inflate claims beyond what the data supports.
A single positive result on one dataset does not support a general claim.Extract structured fields from the secondary Codex response:
- claim_supported: yes | partial | no
- what_results_support: "..."
- what_results_dont_support: "..."
- missing_evidence: "..."
- suggested_claim_revision: "..."
- next_experiments_needed: "..."
- confidence: high | medium | lowSkip this step if EXPERIMENT_AUDIT.json does not exist.
if EXPERIMENT_AUDIT.json exists:
read integrity_status from file
attach to verdict output:
integrity_status: pass | warn | fail
if integrity_status == "fail":
append to verdict: "[INTEGRITY CONCERN] — audit found issues, see EXPERIMENT_AUDIT.md"
downgrade confidence to "low" regardless of Codex judgment
if integrity_status == "warn":
append to verdict: "[INTEGRITY: WARN] — audit flagged potential issues"
else:
integrity_status = "unavailable"
verdict is labeled "provisional — no integrity audit run"
(this does NOT block anything — pipeline continues normally)See shared-references/experiment-integrity.md for the full integrity protocol.
no — Claim not supportedAGENTS.md or project notespartial — Claim partially supportedpartial on the same claim → record analysis in findings.md, consider whether to narrow the claim scope or switch ideasyes — Claim supported/ablation-plannerSkip this step entirely if research-wiki/ does not exist.
if research-wiki/ exists:
# 1. Create experiment page
Create research-wiki/experiments/<exp_id>.md with:
- node_id: exp:<id>
- idea_id: idea:<active_idea>
- date, hardware, duration, metrics
- verdict, confidence, reasoning summary
# 2. Update claim status
for each claim resolved by this verdict:
if verdict == "yes":
Update claim page: status → supported
run the installed ARIS research_wiki.py helper to add a supports edge from "exp:<id>" to "claim:<cid>"
elif verdict == "partial":
Update claim page: status → partial
run the installed ARIS research_wiki.py helper to add a partial supports edge from "exp:<id>" to "claim:<cid>"
else:
Update claim page: status → invalidated
run the installed ARIS research_wiki.py helper to add an invalidates edge from "exp:<id>" to "claim:<cid>"
# 3. Update idea outcome
Update research-wiki/ideas/<idea_id>.md:
- outcome: positive | mixed | negative
- If negative: fill "Failure / Risk Notes" and "Lessons Learned"
- If positive: fill "Actual Outcome" and "Reusable Components"
# 4. Rebuild + log
rebuild the query pack with the installed ARIS research_wiki.py helper
log "result-to-claim: exp:<id> verdict=<verdict> for idea:<idea_id>" with the installed ARIS research_wiki.py helper
# 5. Re-ideation suggestion
Count failed/partial ideas since last /idea-creator run.
If >= 3: print "💡 3+ ideas tested since last ideation. Consider re-running /idea-creator — the wiki now knows what doesn't work."confidence is low, treat the judgment as inconclusive and add experiments rather than committing to a claim.[pending external review] - do not block the pipeline.After the secondary Codex judgment, save a trace following ../shared-references/review-tracing.md. Write files directly to .aris/traces/result-to-claim/<date>_run<NN>/ and include the prompt, raw reviewer response, parsed verdict, routing action, and whether the result is [pending external review]. Respect the --- trace: parameter when present (default: full).
700fbe2
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.