Analyze ML experiment results, compute statistics, generate comparison tables and insights. Use when user says "analyze results", "compare", or needs to interpret experimental data.
68
59%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/skills-codex/analyze-results/SKILL.mdQuality
Discovery
77%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a solid description that clearly states what the skill does and when to use it, with a proper 'Use when' clause. Its main weaknesses are somewhat generic trigger terms that could overlap with other analytical skills, and missing natural keyword variations that users commonly use when discussing ML experiments (e.g., 'metrics', 'model performance', 'benchmarks', 'evaluation').
Suggestions
Expand trigger terms to include ML-specific vocabulary users naturally say: 'metrics', 'accuracy', 'loss curves', 'model performance', 'benchmarks', 'evaluation results', 'training logs'
Add file format triggers if applicable (e.g., 'CSV results files', 'TensorBoard logs', 'wandb exports') to improve distinctiveness from generic data analysis skills
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: 'Analyze ML experiment results', 'compute statistics', 'generate comparison tables and insights'. These are clear, actionable capabilities. | 3 / 3 |
Completeness | Clearly answers both 'what' (analyze ML experiment results, compute statistics, generate comparison tables and insights) and 'when' (explicit 'Use when' clause with trigger terms like 'analyze results', 'compare', or interpreting experimental data). | 3 / 3 |
Trigger Term Quality | Includes some natural keywords like 'analyze results', 'compare', and 'experimental data', but misses common variations users might say such as 'metrics', 'accuracy', 'loss', 'training results', 'model performance', 'benchmark', or 'evaluation'. | 2 / 3 |
Distinctiveness Conflict Risk | The ML experiment focus provides some specificity, but terms like 'analyze results' and 'compare' are quite generic and could overlap with general data analysis, statistical analysis, or other comparison-oriented skills. The 'ML experiment' qualifier helps but the trigger terms themselves are broad. | 2 / 3 |
Total | 10 / 12 Passed |
Implementation
42%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill provides a well-structured conceptual framework for analyzing ML experiment results, but critically lacks actionable, executable guidance. The workflow steps read more like a checklist of what to think about rather than concrete instructions Claude can follow—there are no code snippets for parsing results, computing statistics, or generating comparison tables. The organization and conciseness are reasonable, but the lack of concrete implementation significantly limits the skill's utility.
Suggestions
Add executable code examples for key operations: parsing JSON/CSV result files, computing mean/std statistics, and generating markdown comparison tables.
Include a concrete example showing sample input data and the expected output format (e.g., a markdown table with deltas and a numbered findings list).
Add a validation step after parsing results to verify data integrity (e.g., check for NaN values, confirm expected number of results, validate metric ranges).
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is reasonably efficient and doesn't over-explain basic concepts, but some sections like 'Generate Insights' with its four-part structure template and 'Update Documentation' add bulk that could be tightened. No unnecessary explanations of what ML experiments are. | 2 / 3 |
Actionability | The skill is entirely abstract guidance with no concrete code, commands, or executable examples. It describes what to do ('Parse JSON results', 'Build Comparison Table') but never shows how—no code for parsing files, computing statistics, generating tables, or identifying trends. | 1 / 3 |
Workflow Clarity | Steps are clearly sequenced and logically ordered, but there are no validation checkpoints or feedback loops. For an analysis workflow that could involve parsing potentially malformed data or producing incorrect statistics, there's no verification step to confirm results are correct before generating insights. | 2 / 3 |
Progressive Disclosure | For a skill of this size (~40 lines) with a single analytical task, the content is well-organized with clear section headers and a logical flow from data location through analysis to output. No external references are needed and the structure is easy to navigate. | 3 / 3 |
Total | 8 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
dc00dfb
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.