analyze-results

Analyze ML experiment results, compute statistics, generate comparison tables and insights. Use when user says "analyze results", "compare", or needs to interpret experimental data.

Quality

59%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/skills-codex/analyze-results/SKILL.md

Quality

Discovery

77%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a solid description that clearly states what the skill does and when to use it, with an explicit 'Use when' clause. Its main weaknesses are somewhat generic trigger terms that could overlap with other analytical skills, and missing common ML-specific vocabulary that users would naturally use (e.g., 'metrics', 'model performance', 'benchmarks').

Suggestions

Add more ML-specific trigger terms users would naturally say, such as 'metrics', 'model performance', 'benchmarks', 'accuracy', 'loss curves', 'training results', or 'evaluation'.

Strengthen distinctiveness by specifying the types of ML experiments or data formats supported (e.g., 'hyperparameter sweeps', 'ablation studies', 'CSV logs') to reduce overlap with generic data analysis skills.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: 'Analyze ML experiment results', 'compute statistics', 'generate comparison tables and insights'. These are clear, actionable capabilities.	3 / 3
Completeness	Clearly answers both 'what' (analyze ML experiment results, compute statistics, generate comparison tables and insights) and 'when' (explicit 'Use when' clause with trigger terms like 'analyze results', 'compare', or interpreting experimental data).	3 / 3
Trigger Term Quality	Includes some natural keywords like 'analyze results', 'compare', and 'experimental data', but misses common variations users might say such as 'metrics', 'accuracy', 'loss', 'training results', 'model performance', 'benchmark', or 'evaluation'.	2 / 3
Distinctiveness Conflict Risk	The ML experiment focus provides some specificity, but terms like 'analyze results' and 'compare' are quite broad and could overlap with general data analysis or statistics skills. The 'experimental data' qualifier helps but isn't strongly distinctive.	2 / 3
	Total	10 / 12 Passed

Implementation

42%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill provides a well-structured workflow outline for analyzing ML experiment results, with clear sequencing and good organization. However, it critically lacks actionability—there are no concrete code examples, specific commands, or executable snippets for any of the analysis steps. The skill reads more like a checklist of what to do rather than instructions on how to do it.

Suggestions

Add executable Python code examples for key steps, e.g., loading JSON results with `json.load()`, computing statistics with `numpy`, and generating comparison tables with `pandas.DataFrame`.

Include a concrete example showing sample input data and expected output (e.g., a sample comparison table and finding statement).

Add a validation checkpoint after Step 1 to verify that result files were found and parsed correctly before proceeding to analysis.

Replace vague instructions like 'Parse JSON results into structured data' with specific code patterns showing the expected data structure.

Dimension	Reasoning	Score
Conciseness	The content is reasonably efficient and doesn't over-explain basic concepts, but some sections like 'Generate Insights' with its 4-part structure template and 'Update Documentation' add bulk without providing concrete, executable guidance. Could be tightened.	2 / 3
Actionability	The skill is entirely abstract and descriptive—no concrete code, commands, or executable examples. Instructions like 'Parse JSON results into structured data' and 'report mean +/- std' are vague directions without any actual implementation (e.g., no Python snippets for loading JSON, computing statistics, or generating tables).	1 / 3
Workflow Clarity	Steps are clearly sequenced and logically ordered, but there are no validation checkpoints or feedback loops. For an analysis workflow that could involve parsing errors or suspicious data, there's no explicit 'verify parsed data is complete' or error recovery step beyond 'flag outliers.'	2 / 3
Progressive Disclosure	For a skill of this size (~40 lines, single-purpose), the content is well-organized with clear section headers and a logical flow from locating results to generating output. No need for external file references at this length.	3 / 3
	Total	8 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: wanshuiyin/Auto-claude-code-research-in-sleep
Commit: 700fbe2

Reviewed: 2 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.