CtrlK
BlogDocsLog inGet started
Tessl Logo

result-to-claim

Use when experiments complete to judge what claims the results support, what they don't, and what evidence is still missing. A secondary Codex agent evaluates results against intended claims and routes to next action (pivot, supplement, or confirm). Use after experiments finish — before writing the paper or running ablations.

79

Quality

76%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/skills-codex/result-to-claim/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

75%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is strong on completeness and distinctiveness, clearly articulating both when to use the skill and its unique niche in the experimental workflow. However, it could be more specific about the concrete actions performed and include more natural trigger terms that users might employ when seeking this functionality. The language is appropriately third-person and avoids fluff.

Suggestions

Add more concrete actions the skill performs, e.g., 'Generates a structured evidence report listing supported claims, unsupported claims, and gaps in evidence'

Include additional natural trigger terms users might say, such as 'analyze results', 'interpret findings', 'evaluate experiment outcomes', or 'check if results support hypothesis'

DimensionReasoningScore

Specificity

The description names the domain (experiment evaluation) and some actions (judge claims, evaluate results against intended claims, route to next action), but the actions are somewhat abstract — 'pivot, supplement, or confirm' are named but not fully concrete in terms of what the skill actually does mechanically.

2 / 3

Completeness

The description clearly answers both 'what' (a secondary Codex agent evaluates results against intended claims and routes to next action) and 'when' (after experiments finish, before writing the paper or running ablations), with explicit trigger guidance including 'Use when experiments complete'.

3 / 3

Trigger Term Quality

Includes some relevant terms like 'experiments complete', 'results', 'claims', 'ablations', 'paper', but misses common natural variations a user might say such as 'analyze results', 'interpret findings', 'evaluate outcomes', or 'results analysis'. The terms are somewhat niche and academic.

2 / 3

Distinctiveness Conflict Risk

This skill occupies a very specific niche — post-experiment claim evaluation with routing decisions — that is unlikely to conflict with other skills. The workflow positioning (after experiments, before paper writing) and the specific mechanism (secondary Codex agent) make it highly distinctive.

3 / 3

Total

10

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured, highly actionable skill with clear multi-step workflow and explicit routing logic. Its main weakness is length — particularly the research wiki update section (Step 5) which could be extracted to a separate file. The skill does a good job of providing concrete guidance and validation checkpoints, though some introductory/explanatory text could be trimmed.

Suggestions

Extract Step 5 (Research Wiki updates) into a separate reference file (e.g., `result-to-claim-wiki-updates.md`) and link to it from the main skill to reduce inline verbosity.

Remove the 'When to Use' section — this information is better suited for the YAML frontmatter description and is redundant context for Claude.

DimensionReasoningScore

Conciseness

The skill is fairly long but most content is functional (workflow steps, routing logic, wiki updates). However, some sections like 'When to Use' and the introductory sentence ('Experiments produce numbers; this gate decides what those numbers *mean*') are unnecessary for Claude. The research wiki update section (Step 5) is quite verbose and could be more compact.

2 / 3

Actionability

The skill provides concrete, executable guidance throughout: specific W&B API calls, exact prompt templates for the secondary Codex agent, structured output fields to extract, conditional logic for routing, and specific file paths and commands. The spawn_agent prompt is copy-paste ready and the routing decisions are explicit.

3 / 3

Workflow Clarity

The workflow is clearly sequenced (Steps 1-5) with explicit validation via the secondary Codex judgment and integrity audit check. The routing logic in Step 4 provides clear branching paths for each verdict (yes/partial/no) with specific actions. The integrity check in Step 3.5 includes a feedback loop that downgrades confidence on failure. Multiple rounds of 'partial' are addressed.

3 / 3

Progressive Disclosure

The skill references external files appropriately (shared-references/experiment-integrity.md, shared-references/review-tracing.md, findings.md, EXPERIMENT_AUDIT.json) but the main file itself is quite long. Step 5 (Research Wiki updates) is extensive inline content that could be split into a separate reference file. The conditional 'skip if not exists' pattern is good but the wiki update details bloat the main skill.

2 / 3

Total

10

/

12

Passed

Validation

81%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation9 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

allowed_tools_field

'allowed-tools' contains unusual tool name(s)

Warning

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

9

/

11

Passed

Repository
wanshuiyin/Auto-claude-code-research-in-sleep
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.