CtrlK
BlogDocsLog inGet started
Tessl Logo

result-to-claim

Use when experiments complete to judge what claims the results support, what they do not, and what evidence is still missing. A secondary Codex agent evaluates results against intended claims and routes to the next action (pivot, supplement, or confirm). Use after experiments finish - before writing the paper or running ablations.

80

Quality

76%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/skills-codex/result-to-claim/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

75%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description effectively communicates when to use the skill with clear temporal positioning in the research workflow (after experiments, before paper writing). It has a distinct niche and good completeness. However, the specific capabilities could be more concrete (e.g., what metrics or methods are used to evaluate claims), and trigger term coverage could be broader to capture more natural user phrasings.

Suggestions

Add more concrete actions such as 'compares effect sizes against hypothesized thresholds', 'checks statistical significance', or 'identifies unsupported claims' to improve specificity.

Expand trigger terms to include natural variations like 'interpret results', 'evaluate findings', 'do my results support', 'experiment analysis', or 'what do my results mean'.

DimensionReasoningScore

Specificity

The description names the domain (experiment evaluation) and some actions ('judge what claims the results support', 'routes to the next action: pivot, supplement, or confirm'), but the concrete actions are somewhat abstract rather than listing specific technical operations.

2 / 3

Completeness

Clearly answers both 'what' (a secondary Codex agent evaluates results against intended claims and routes to next action) and 'when' (after experiments finish, before writing the paper or running ablations), with explicit temporal triggers and a 'Use when' clause.

3 / 3

Trigger Term Quality

Includes some relevant terms like 'experiments complete', 'results', 'claims', 'ablations', 'writing the paper', but misses common variations users might say such as 'evaluate results', 'interpret findings', 'statistical significance', or 'experiment analysis'. The terms are moderately natural but not comprehensive.

2 / 3

Distinctiveness Conflict Risk

The description carves out a very specific niche — post-experiment claim evaluation with routing decisions (pivot, supplement, confirm) — that is unlikely to conflict with other skills like data analysis, paper writing, or experiment design skills.

3 / 3

Total

10

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured, actionable skill that clearly defines a multi-step evaluation workflow with proper routing logic and feedback loops. Its main strengths are the concrete prompt template for the secondary agent and the explicit branching paths for each verdict. Minor weaknesses include some unnecessary contextual explanation and the lack of clear navigation to referenced external files and skills.

Suggestions

Trim the introductory text and 'When to Use' section—Claude can infer when to apply this skill from the workflow itself.

Add brief inline descriptions or links for referenced files (e.g., `IDEA_CANDIDATES.md`, `/ablation-planner`) so the user understands the broader pipeline without hunting.

DimensionReasoningScore

Conciseness

The skill is mostly efficient but includes some unnecessary elaboration. Phrases like 'Experiments produce numbers; this gate decides what those numbers mean' and the 'When to Use' section explain context Claude can infer. The workflow steps themselves are reasonably tight, but the overall document could be trimmed by ~20%.

2 / 3

Actionability

The skill provides concrete, actionable guidance throughout: specific API calls for W&B data collection, a complete prompt template for the secondary agent, structured output fields to extract, and explicit routing actions for each verdict outcome. The instructions are specific enough to execute directly.

3 / 3

Workflow Clarity

The four-step workflow is clearly sequenced (Collect → Judge → Parse → Route) with explicit branching logic for each verdict. The partial verdict path includes a feedback loop (re-run after supplementary experiments), and there's a clear escalation path for repeated partial verdicts. Validation is embedded via the external reviewer judgment step.

3 / 3

Progressive Disclosure

The skill references external files like `findings.md`, `IDEA_CANDIDATES.md`, `EXPERIMENT_LOG.md`, and `/ablation-planner` but doesn't provide clear navigation links or explain what those contain. The content is well-structured with headers but is somewhat long for a single file—the detailed prompt template and routing logic could potentially be split out.

2 / 3

Total

10

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

allowed_tools_field

'allowed-tools' contains unusual tool name(s)

Warning

Total

10

/

11

Passed

Repository
wanshuiyin/Auto-claude-code-research-in-sleep
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.