spark-optimization

Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

1.28x

Quality

71%

Does it follow best practices?

Impact

77%

1.28x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./tests/ext_conformance/artifacts/agents-wshobson/data-engineering/skills/spark-optimization/SKILL.md

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-crafted skill description that concisely covers specific Spark optimization techniques, includes natural trigger terms users would employ, and clearly delineates both what the skill does and when to use it. It uses proper third-person voice and is distinct enough to avoid conflicts with other data-related skills.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: partitioning, caching, shuffle optimization, and memory tuning. These are distinct, well-defined Spark optimization techniques.	3 / 3
Completeness	Clearly answers both 'what' (optimize Spark jobs with partitioning, caching, shuffle optimization, memory tuning) and 'when' (improving Spark performance, debugging slow jobs, scaling data processing pipelines) with an explicit 'Use when' clause.	3 / 3
Trigger Term Quality	Includes strong natural keywords users would say: 'Spark', 'partitioning', 'caching', 'shuffle', 'memory tuning', 'performance', 'slow jobs', 'data processing pipelines'. These cover common terms a user would use when seeking Spark optimization help.	3 / 3
Distinctiveness Conflict Risk	Clearly scoped to Apache Spark optimization specifically, with domain-specific triggers like 'Spark', 'shuffle optimization', 'partitioning' that are unlikely to conflict with general data engineering or other processing skills.	3 / 3
	Total	12 / 12 Passed

Implementation

42%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill excels at actionability with complete, executable code examples covering a wide range of Spark optimization patterns. However, it is far too verbose for a SKILL.md — it explains concepts Claude already knows, includes a monolithic structure with no progressive disclosure, and lacks a clear diagnostic workflow that would guide Claude through identifying and fixing performance issues in sequence.

Suggestions

Reduce the SKILL.md to a concise overview (~50-80 lines) with quick-start config and the most critical patterns, then move detailed patterns (join optimization, caching, memory tuning, etc.) into separate bundle files like JOINS.md, MEMORY.md, MONITORING.md

Remove the 'Core Concepts' section entirely — Claude knows the Spark execution model, what shuffles are, and what serialization does

Add a diagnostic workflow at the top: 1) Check Spark UI for symptoms → 2) Identify bottleneck type → 3) Apply specific pattern → 4) Validate improvement with metrics, with explicit validation checkpoints

Remove the 'When to Use This Skill' section and storage level explanations — these are basic knowledge that wastes tokens

Dimension	Reasoning	Score
Conciseness	The skill is extremely verbose at ~300+ lines. It explains basic Spark concepts Claude already knows (execution model, what shuffles are, storage level definitions), includes a 'When to Use This Skill' section that restates the description, and has extensive inline comments explaining obvious things. The 'Core Concepts' section with the execution model diagram and key performance factors table is unnecessary context for Claude.	1 / 3
Actionability	The skill provides fully executable Python code throughout — complete SparkSession configurations, working join examples, caching patterns, and utility functions like salt_join and check_partition_skew. Code is copy-paste ready with concrete configurations and real-world patterns.	3 / 3
Workflow Clarity	Patterns are individually clear but there's no overarching workflow for diagnosing and fixing Spark performance issues. The monitoring/debugging pattern (Pattern 7) comes last instead of being positioned as a diagnostic first step. There are no explicit validation checkpoints — e.g., after repartitioning, there's no step to verify partition distribution improved.	2 / 3
Progressive Disclosure	This is a monolithic wall of text with no bundle files to offload detailed content. The configuration cheat sheet, all 7 patterns, and best practices are all inline. Content like the full salt_join implementation, memory monitoring function, and format optimization details should be in separate referenced files. External links at the bottom don't substitute for structured bundle references.	1 / 3
	Total	7 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: Dicklesworthstone/pi_agent_rust
Commit: 99da384

Reviewed: 3 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.