Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.
77
71%
Does it follow best practices?
Impact
77%
1.28xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./tests/ext_conformance/artifacts/agents-wshobson/data-engineering/skills/spark-optimization/SKILL.mdQuality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a well-crafted skill description that concisely covers specific Spark optimization techniques, includes natural trigger terms users would employ, and clearly delineates both what the skill does and when to use it. It uses proper third-person voice and is distinct enough to avoid conflicts with other data-related skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: partitioning, caching, shuffle optimization, and memory tuning. These are distinct, well-defined Spark optimization techniques. | 3 / 3 |
Completeness | Clearly answers both 'what' (optimize Spark jobs with partitioning, caching, shuffle optimization, memory tuning) and 'when' (improving Spark performance, debugging slow jobs, scaling data processing pipelines) with an explicit 'Use when' clause. | 3 / 3 |
Trigger Term Quality | Includes strong natural keywords users would say: 'Spark', 'partitioning', 'caching', 'shuffle', 'memory tuning', 'performance', 'slow jobs', 'data processing pipelines'. These cover common terms a user would use when seeking Spark optimization help. | 3 / 3 |
Distinctiveness Conflict Risk | Clearly scoped to Apache Spark optimization specifically, with distinct triggers like 'Spark', 'shuffle optimization', and 'partitioning' that are unlikely to conflict with general data engineering or other processing skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
42%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill excels at actionability with comprehensive, executable code examples covering all major Spark optimization areas. However, it is far too verbose for a SKILL.md—it explains concepts Claude already knows, includes everything inline rather than using progressive disclosure, and lacks an integrated diagnostic workflow with validation checkpoints. This would benefit greatly from being restructured as a concise overview pointing to detailed pattern files.
Suggestions
Remove the 'Core Concepts' section (execution model diagram, key performance factors table)—Claude already knows these. Cut the 'When to Use This Skill' list as it restates the description.
Extract Patterns 1-7 into separate referenced files (e.g., JOINS.md, CACHING.md, MEMORY.md) and keep only the Quick Start and Configuration Cheat Sheet inline with brief pattern summaries linking out.
Add a diagnostic workflow at the top: 'Check Spark UI → identify bottleneck type → apply relevant pattern → verify improvement via metrics', with explicit validation steps between optimization attempts.
Remove inline comments that restate obvious things (e.g., '# Cache in memory (MEMORY_AND_DISK is default)', '# Storage levels explained:' list) to reduce token count significantly.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose at ~350+ lines. It explains basic Spark concepts Claude already knows (execution model, what shuffles are, storage level definitions), includes redundant tables of concepts, and has extensive inline comments that restate obvious things. The 'When to Use This Skill' section and 'Core Concepts' section add little value. Much of this could be cut by 50%+ while preserving all actionable content. | 1 / 3 |
Actionability | The skill provides fully executable Python code throughout—complete SparkSession configurations, join patterns, caching examples, partitioning strategies, and monitoring functions. Code is copy-paste ready with concrete examples including specific config values, function implementations like salt_join and check_partition_skew, and real spark-submit parameters. | 3 / 3 |
Workflow Clarity | Individual patterns are clear and well-sequenced (e.g., cache → materialize → use → unpersist), but there's no overarching workflow for diagnosing and fixing Spark performance issues. The monitoring/debugging pattern (Pattern 7) comes last rather than being integrated as validation checkpoints. There are no explicit feedback loops like 'check Spark UI → identify bottleneck → apply pattern → verify improvement'. | 2 / 3 |
Progressive Disclosure | This is a monolithic wall of text with all content inline—7 detailed patterns, a configuration cheat sheet, best practices, and resources all in one file. The patterns for join optimization, caching, memory tuning, shuffle optimization, and data formats could each be separate referenced files. The external links at the bottom are to general Spark docs rather than skill-specific detailed materials. | 1 / 3 |
Total | 7 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
6e3d68c
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.