CtrlK
BlogDocsLog inGet started
Tessl Logo

analyze-results

Analyze ML experiment results, compute statistics, generate comparison tables and insights. Use when user says "analyze results", "compare", or needs to interpret experimental data.

54

Quality

59%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/skills-codex/analyze-results/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

77%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a solid description that clearly states what the skill does and when to use it, with an explicit 'Use when' clause. Its main weaknesses are somewhat generic trigger terms ('analyze results', 'compare') that could conflict with other analytical skills, and missing common ML-specific vocabulary that users would naturally use (e.g., 'metrics', 'model performance', 'benchmarks').

Suggestions

Add more ML-specific trigger terms like 'metrics', 'model performance', 'benchmarks', 'training results', 'accuracy', 'loss curves' to improve trigger term coverage.

Strengthen distinctiveness by specifying the types of ML experiments or data formats supported (e.g., 'hyperparameter sweeps', 'ablation studies', 'CSV logs') to reduce overlap with generic data analysis skills.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: 'Analyze ML experiment results', 'compute statistics', 'generate comparison tables and insights'. These are clear, actionable capabilities.

3 / 3

Completeness

Clearly answers both 'what' (analyze ML experiment results, compute statistics, generate comparison tables and insights) and 'when' (explicit 'Use when' clause with trigger terms like 'analyze results', 'compare', or interpreting experimental data).

3 / 3

Trigger Term Quality

Includes some natural keywords like 'analyze results', 'compare', and 'experimental data', but misses common variations users might say such as 'metrics', 'accuracy', 'loss', 'training results', 'model performance', 'benchmark', or 'evaluation'.

2 / 3

Distinctiveness Conflict Risk

The ML experiment focus provides some specificity, but terms like 'analyze results' and 'compare' are quite broad and could overlap with general data analysis or statistics skills. The 'experimental data' qualifier helps but isn't strongly distinctive.

2 / 3

Total

10

/

12

Passed

Implementation

42%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill provides a well-structured workflow for ML experiment analysis with clear steps and a useful output format specification. However, it is critically lacking in actionability—there are no concrete code examples, no specific library usage (pandas, scipy, etc.), and no example input/output to guide Claude. The workflow reads more like a high-level checklist than an executable skill.

Suggestions

Add concrete Python code examples for key steps, e.g., parsing JSON results with pandas, computing mean/std across seeds, and generating a markdown comparison table.

Include a concrete example showing sample input data (e.g., a small JSON result snippet) and the expected output (formatted table + findings), so Claude knows exactly what to produce.

Add a validation checkpoint after Step 1 (e.g., 'Confirm all expected result files were found and parsed without errors; list any missing or malformed files before proceeding').

DimensionReasoningScore

Conciseness

The content is reasonably efficient but includes some unnecessary framing (e.g., 'For each finding, structure as' could be tighter). It doesn't over-explain concepts Claude knows, but some sections like Step 5 feel padded for a skill that's primarily about analysis.

2 / 3

Actionability

The skill provides only abstract guidance with no concrete code, commands, or executable examples. Instructions like 'Parse JSON results into structured data' and 'report mean +/- std' are vague descriptions rather than actionable steps—no Python snippets, no specific library calls, no example input/output.

1 / 3

Workflow Clarity

Steps are clearly sequenced and logically ordered, but there are no validation checkpoints or feedback loops. For an analysis workflow that could involve parsing errors or suspicious data, there's no explicit 'verify parsed data is complete' or 'confirm table matches source files' step.

2 / 3

Progressive Disclosure

For a skill of this size (~40 lines) with no bundle files, the content is well-organized into clear sections with logical headers. No external references are needed, and the structure supports easy scanning.

3 / 3

Total

8

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
wanshuiyin/Auto-claude-code-research-in-sleep
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.