Use when writing Spark jobs, debugging performance issues, or configuring cluster settings for Apache Spark applications, distributed data processing pipelines, or big data workloads. Invoke to write DataFrame transformations, optimize Spark SQL queries, implement RDD pipelines, tune shuffle operations, configure executor memory, process .parquet files, handle data partitioning, or build structured streaming analytics.
94
100%
Does it follow best practices?
Impact
89%
1.05xAverage score across 6 eval scenarios
Passed
No known issues
Schema definition and built-in function usage
Explicit schema defined
0%
100%
Schema applied on read
0%
100%
No Python UDFs
100%
100%
Built-in functions used
100%
100%
No collect() on full data
100%
100%
Partitioned write by region
100%
100%
Coalesce or repartition after filter
0%
30%
Filter applied before aggregation
100%
100%
Column selection before aggregation
0%
0%
Broadcast joins and data skew handling
Broadcast join used
100%
100%
Broadcast on small table
100%
100%
Skew addressed
100%
100%
AQE enabled
100%
100%
Shuffle partitions tuned
100%
100%
Skew documented
100%
100%
Join strategy documented
100%
100%
Filter/select before join
50%
50%
No collect on large data
100%
100%
Structured streaming with watermarks and checkpointing
Watermark defined
100%
100%
Watermark duration matches late data
100%
100%
Checkpoint location set
100%
100%
foreachBatch used
0%
0%
Correct output mode
100%
100%
Explicit schema on read
100%
100%
Event time window used
100%
100%
Late data handling documented
100%
100%
Fault tolerance documented
100%
100%
Caching strategy and multi-pass analytics
Cache called on intermediate df
100%
100%
Materialized with count()
100%
100%
Unpersist called after use
100%
100%
Cached df reused for all reports
100%
100%
Output DataFrames not cached
100%
100%
Partition count printed
100%
100%
Explicit schema defined
0%
100%
No collect() on data
100%
100%
Caching rationale in notes
100%
100%
Production configuration and output documentation
Configuration recommendations present
100%
100%
5 cores per executor
100%
100%
memoryOverhead configured
100%
100%
Partitioning strategy explained
100%
100%
Performance analysis present
100%
100%
Monitoring recommendations present
100%
100%
Broadcast join for merchant table
100%
100%
AQE enabled in script
100%
100%
Shuffle partitions tuned
100%
100%
No collect() on large data
100%
100%
Explicit schema or parquet read
100%
100%
RDD operations and small file compaction
RDD API used for log parsing
66%
0%
reduceByKey used (not groupByKey)
0%
0%
No groupByKey calls
100%
100%
Coalesce on output
100%
100%
No collect() on full data
100%
100%
Small file rationale in notes
100%
100%
Aggregation approach explained
100%
100%
Correct top-N extraction
100%
100%
5b76101
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.