Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.
75
71%
Does it follow best practices?
Impact
73%
1.40xAverage score across 6 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/data-engineering/skills/spark-optimization/SKILL.mdQuality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a well-crafted skill description that concisely covers specific Spark optimization techniques, includes natural trigger terms users would employ, and clearly delineates both what the skill does and when to use it. It uses proper third-person voice and is distinctive enough to avoid conflicts with other data-related skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: partitioning, caching, shuffle optimization, and memory tuning. These are distinct, well-defined Spark optimization techniques. | 3 / 3 |
Completeness | Clearly answers both 'what' (optimize Spark jobs with partitioning, caching, shuffle optimization, memory tuning) and 'when' (improving Spark performance, debugging slow jobs, scaling data processing pipelines) with an explicit 'Use when' clause. | 3 / 3 |
Trigger Term Quality | Includes strong natural keywords users would say: 'Spark', 'partitioning', 'caching', 'shuffle', 'memory tuning', 'performance', 'slow jobs', 'data processing pipelines'. These cover common terms a user would use when seeking Spark optimization help. | 3 / 3 |
Distinctiveness Conflict Risk | Clearly scoped to Apache Spark optimization specifically, with distinct triggers like 'Spark', 'shuffle optimization', and 'partitioning' that are unlikely to conflict with general data engineering or other processing skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
42%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill provides excellent, executable code examples covering a comprehensive range of Spark optimization techniques, making it highly actionable. However, it is far too verbose — explaining basic Spark concepts Claude already knows, including a full execution model diagram and storage level definitions. The lack of any progressive disclosure structure means all content is crammed into one large file, and the absence of a diagnostic workflow with validation checkpoints limits its practical utility for systematic performance tuning.
Suggestions
Remove the 'Core Concepts' section (execution model, key performance factors table) — Claude already knows these fundamentals. This alone would save ~20 lines.
Split into SKILL.md (quick start + config cheat sheet + best practices) with references to separate files like JOINS.md, MEMORY.md, CACHING.md for detailed patterns.
Add a diagnostic workflow: 1. Check Spark UI for bottlenecks → 2. Identify issue type (skew/shuffle/memory) → 3. Apply relevant pattern → 4. Validate improvement with metrics → 5. Iterate.
Remove inline comments that explain obvious things (e.g., '# Cache in memory', '# Fast compression') and the storage levels explanation list.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose at ~300+ lines. It explains basic Spark concepts Claude already knows (execution model, what shuffles are, storage level definitions), includes redundant comments, and the 'When to Use This Skill' section restates the description. The key performance factors table and core concepts section are textbook material that wastes tokens. | 1 / 3 |
Actionability | The skill provides fully executable Python code throughout — complete SparkSession configurations, working join examples, caching patterns, and utility functions like salt_join and check_partition_skew. Code is copy-paste ready with realistic S3 paths and proper imports. | 3 / 3 |
Workflow Clarity | Patterns are individually clear but there's no sequenced workflow for diagnosing and fixing performance issues. There are no validation checkpoints — e.g., after repartitioning, no step to verify partition distribution; after caching, no step to confirm materialization succeeded. The monitoring pattern exists but isn't integrated into an optimization workflow. | 2 / 3 |
Progressive Disclosure | Everything is in a single monolithic file with no references to supporting documents. The content covers 7 patterns plus configuration and best practices — topics like join optimization, memory tuning, and monitoring could each be separate referenced files. No bundle files exist to offload this content. | 1 / 3 |
Total | 7 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
112197c
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.