CtrlK
BlogDocsLog inGet started
Tessl Logo

analyze-results

Analyze ML experiment results, compute statistics, generate comparison tables and insights. Use when user says "analyze results", "compare", or needs to interpret experimental data.

68

Quality

59%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/skills-codex/analyze-results/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

77%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a solid description that clearly states what the skill does and when to use it, with a proper 'Use when' clause. Its main weaknesses are somewhat generic trigger terms that could overlap with other analytical skills, and missing natural keyword variations that users commonly use when discussing ML experiments (e.g., 'metrics', 'model performance', 'benchmarks', 'evaluation').

Suggestions

Expand trigger terms to include ML-specific vocabulary users naturally say: 'metrics', 'accuracy', 'loss curves', 'model performance', 'benchmarks', 'evaluation results', 'training logs'

Add file format triggers if applicable (e.g., 'CSV results files', 'TensorBoard logs', 'wandb exports') to improve distinctiveness from generic data analysis skills

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: 'Analyze ML experiment results', 'compute statistics', 'generate comparison tables and insights'. These are clear, actionable capabilities.

3 / 3

Completeness

Clearly answers both 'what' (analyze ML experiment results, compute statistics, generate comparison tables and insights) and 'when' (explicit 'Use when' clause with trigger terms like 'analyze results', 'compare', or interpreting experimental data).

3 / 3

Trigger Term Quality

Includes some natural keywords like 'analyze results', 'compare', and 'experimental data', but misses common variations users might say such as 'metrics', 'accuracy', 'loss', 'training results', 'model performance', 'benchmark', or 'evaluation'.

2 / 3

Distinctiveness Conflict Risk

The ML experiment focus provides some specificity, but terms like 'analyze results' and 'compare' are quite generic and could overlap with general data analysis, statistical analysis, or other comparison-oriented skills. The 'ML experiment' qualifier helps but the trigger terms themselves are broad.

2 / 3

Total

10

/

12

Passed

Implementation

42%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill provides a well-structured conceptual framework for analyzing ML experiment results, but critically lacks actionable, executable guidance. The workflow steps read more like a checklist of what to think about rather than concrete instructions Claude can follow—there are no code snippets for parsing results, computing statistics, or generating comparison tables. The organization and conciseness are reasonable, but the lack of concrete implementation significantly limits the skill's utility.

Suggestions

Add executable code examples for key operations: parsing JSON/CSV result files, computing mean/std statistics, and generating markdown comparison tables.

Include a concrete example showing sample input data and the expected output format (e.g., a markdown table with deltas and a numbered findings list).

Add a validation step after parsing results to verify data integrity (e.g., check for NaN values, confirm expected number of results, validate metric ranges).

DimensionReasoningScore

Conciseness

The content is reasonably efficient and doesn't over-explain basic concepts, but some sections like 'Generate Insights' with its four-part structure template and 'Update Documentation' add bulk that could be tightened. No unnecessary explanations of what ML experiments are.

2 / 3

Actionability

The skill is entirely abstract guidance with no concrete code, commands, or executable examples. It describes what to do ('Parse JSON results', 'Build Comparison Table') but never shows how—no code for parsing files, computing statistics, generating tables, or identifying trends.

1 / 3

Workflow Clarity

Steps are clearly sequenced and logically ordered, but there are no validation checkpoints or feedback loops. For an analysis workflow that could involve parsing potentially malformed data or producing incorrect statistics, there's no verification step to confirm results are correct before generating insights.

2 / 3

Progressive Disclosure

For a skill of this size (~40 lines) with a single analytical task, the content is well-organized with clear section headers and a logical flow from data location through analysis to output. No external references are needed and the structure is easy to navigate.

3 / 3

Total

8

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
wanshuiyin/Auto-claude-code-research-in-sleep
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.