Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.
75
71%
Does it follow best practices?
Impact
73%
1.40xAverage score across 6 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/data-engineering/skills/spark-optimization/SKILL.mdQuality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a well-crafted skill description that concisely covers specific Spark optimization techniques, includes natural trigger terms users would employ, and clearly delineates both what the skill does and when to use it. The description is distinctive enough to avoid conflicts with adjacent skills like general data engineering or Python optimization.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: partitioning, caching, shuffle optimization, and memory tuning. These are distinct, well-defined Spark optimization techniques. | 3 / 3 |
Completeness | Clearly answers both what ('Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning') and when ('Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines') with explicit trigger guidance. | 3 / 3 |
Trigger Term Quality | Includes strong natural keywords users would say: 'Spark', 'partitioning', 'caching', 'shuffle', 'memory tuning', 'performance', 'slow jobs', 'data processing pipelines'. These cover common terms a user would use when seeking Spark optimization help. | 3 / 3 |
Distinctiveness Conflict Risk | Clearly scoped to Apache Spark optimization specifically, with domain-specific triggers like 'Spark', 'shuffle optimization', 'partitioning' that are unlikely to conflict with general data engineering or other processing skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
42%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill provides excellent, actionable PySpark code examples covering a comprehensive range of Spark optimization topics. However, it is far too verbose for a SKILL.md - it explains concepts Claude already understands (execution model, what shuffles are, storage level definitions), and packs ~300 lines of content that should be split across multiple files into a single monolithic document. The lack of a diagnostic workflow with validation steps means there's no clear process for iteratively improving Spark job performance.
Suggestions
Reduce the SKILL.md to a concise overview (~50-80 lines) with quick-start config and the most critical patterns, moving detailed patterns (join optimization, caching, memory tuning, monitoring) to separate referenced files like JOINS.md, MEMORY.md, MONITORING.md.
Remove the 'Core Concepts' section entirely - Claude already knows the Spark execution model, what shuffles are, and the relationship between stages and tasks.
Add a diagnostic workflow with validation steps: e.g., 1) Check Spark UI for bottlenecks, 2) Identify issue type (skew/shuffle/memory), 3) Apply relevant pattern, 4) Re-run and compare metrics, 5) Iterate if needed.
Remove inline explanatory comments that state the obvious (e.g., '# Fast compression', '# No shuffle', '# Spills to disk if needed') and the storage levels explanation list.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose at ~300+ lines, explaining concepts Claude already knows (Spark execution model, what shuffles are, storage level definitions, memory breakdown math). The 'Core Concepts' section with the execution model diagram and key performance factors table is unnecessary context. Comments like '# Fast compression' and '# No shuffle' explain things Claude knows. | 1 / 3 |
Actionability | The code examples are fully executable, copy-paste ready PySpark code with concrete configurations, specific function implementations (salt_join, calculate_partitions, check_partition_skew), and real spark-submit parameters. The configuration cheat sheet is immediately usable. | 3 / 3 |
Workflow Clarity | The patterns are individually clear but there's no sequenced workflow for diagnosing and fixing performance issues. There are no validation checkpoints - for example, after applying optimizations there's no step to verify improvement. The monitoring/debugging pattern exists but isn't integrated into an optimization workflow with feedback loops. | 2 / 3 |
Progressive Disclosure | This is a monolithic wall of text with all content inline. At 300+ lines, the detailed join patterns, caching strategies, memory tuning, and monitoring code should be split into separate reference files. There are no references to external files for advanced topics, and the skill tries to cover everything in one document. | 1 / 3 |
Total | 7 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
70444e5
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.