spark-optimization

Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

1.40x

Quality

71%

Does it follow best practices?

Impact

73%

1.40x

Average score across 6 eval scenarios

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/data-engineering/skills/spark-optimization/SKILL.md

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-crafted skill description that concisely covers specific Spark optimization techniques, includes natural trigger terms users would employ, and clearly delineates both what the skill does and when to use it. It uses proper third-person voice and is distinctive enough to avoid conflicts with other data-related skills.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: partitioning, caching, shuffle optimization, and memory tuning. These are distinct, well-defined Spark optimization techniques.	3 / 3
Completeness	Clearly answers both 'what' (optimize Spark jobs with partitioning, caching, shuffle optimization, memory tuning) and 'when' (improving Spark performance, debugging slow jobs, scaling data processing pipelines) with an explicit 'Use when' clause.	3 / 3
Trigger Term Quality	Includes strong natural keywords users would say: 'Spark', 'partitioning', 'caching', 'shuffle', 'memory tuning', 'performance', 'slow jobs', 'data processing pipelines'. These cover common terms a user would use when seeking Spark optimization help.	3 / 3
Distinctiveness Conflict Risk	Clearly scoped to Apache Spark optimization specifically, with distinct triggers like 'Spark', 'shuffle optimization', and 'partitioning' that are unlikely to conflict with general data engineering or other processing skills.	3 / 3
	Total	12 / 12 Passed

Implementation

42%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill provides excellent, executable code examples covering a comprehensive range of Spark optimization techniques, making it highly actionable. However, it is far too verbose — explaining basic Spark concepts Claude already knows, including a full execution model diagram and storage level definitions. The lack of any progressive disclosure structure means all content is crammed into one large file, and the absence of a diagnostic workflow with validation checkpoints limits its practical utility for systematic performance tuning.

Suggestions

Remove the 'Core Concepts' section (execution model, key performance factors table) — Claude already knows these fundamentals. This alone would save ~20 lines.

Split into SKILL.md (quick start + config cheat sheet + best practices) with references to separate files like JOINS.md, MEMORY.md, CACHING.md for detailed patterns.

Add a diagnostic workflow: 1. Check Spark UI for bottlenecks → 2. Identify issue type (skew/shuffle/memory) → 3. Apply relevant pattern → 4. Validate improvement with metrics → 5. Iterate.

Remove inline comments that explain obvious things (e.g., '# Cache in memory', '# Fast compression') and the storage levels explanation list.

Dimension	Reasoning	Score
Conciseness	The skill is extremely verbose at ~300+ lines. It explains basic Spark concepts Claude already knows (execution model, what shuffles are, storage level definitions), includes redundant comments, and the 'When to Use This Skill' section restates the description. The key performance factors table and core concepts section are textbook material that wastes tokens.	1 / 3
Actionability	The skill provides fully executable Python code throughout — complete SparkSession configurations, working join examples, caching patterns, and utility functions like salt_join and check_partition_skew. Code is copy-paste ready with realistic S3 paths and proper imports.	3 / 3
Workflow Clarity	Patterns are individually clear but there's no sequenced workflow for diagnosing and fixing performance issues. There are no validation checkpoints — e.g., after repartitioning, no step to verify partition distribution; after caching, no step to confirm materialization succeeded. The monitoring pattern exists but isn't integrated into an optimization workflow.	2 / 3
Progressive Disclosure	Everything is in a single monolithic file with no references to supporting documents. The content covers 7 patterns plus configuration and best practices — topics like join optimization, memory tuning, and monitoring could each be separate referenced files. No bundle files exist to offload this content.	1 / 3
	Total	7 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: wshobson/agents
Commit: 112197c

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.