CtrlK
BlogDocsLog inGet started
Tessl Logo

spark-engineer

Use when writing Spark jobs, debugging performance issues, or configuring cluster settings for Apache Spark applications, distributed data processing pipelines, or big data workloads. Invoke to write DataFrame transformations, optimize Spark SQL queries, implement RDD pipelines, tune shuffle operations, configure executor memory, process .parquet files, handle data partitioning, or build structured streaming analytics.

94

1.05x
Quality

100%

Does it follow best practices?

Impact

89%

1.05x

Average score across 6 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Evaluation results

88%

28%

Customer Transaction ETL Pipeline

Schema definition and built-in function usage

Criteria
Without context
With context

Explicit schema defined

0%

100%

Schema applied on read

0%

100%

No Python UDFs

100%

100%

Built-in functions used

100%

100%

No collect() on full data

100%

100%

Partitioned write by region

100%

100%

Coalesce or repartition after filter

0%

30%

Filter applied before aggregation

100%

100%

Column selection before aggregation

0%

0%

95%

Product Category Revenue Analysis

Broadcast joins and data skew handling

Criteria
Without context
With context

Broadcast join used

100%

100%

Broadcast on small table

100%

100%

Skew addressed

100%

100%

AQE enabled

100%

100%

Shuffle partitions tuned

100%

100%

Skew documented

100%

100%

Join strategy documented

100%

100%

Filter/select before join

50%

50%

No collect on large data

100%

100%

85%

Real-Time User Activity Aggregation Pipeline

Structured streaming with watermarks and checkpointing

Criteria
Without context
With context

Watermark defined

100%

100%

Watermark duration matches late data

100%

100%

Checkpoint location set

100%

100%

foreachBatch used

0%

0%

Correct output mode

100%

100%

Explicit schema on read

100%

100%

Event time window used

100%

100%

Late data handling documented

100%

100%

Fault tolerance documented

100%

100%

100%

10%

Customer Retention Analytics Pipeline

Caching strategy and multi-pass analytics

Criteria
Without context
With context

Cache called on intermediate df

100%

100%

Materialized with count()

100%

100%

Unpersist called after use

100%

100%

Cached df reused for all reports

100%

100%

Output DataFrames not cached

100%

100%

Partition count printed

100%

100%

Explicit schema defined

0%

100%

No collect() on data

100%

100%

Caching rationale in notes

100%

100%

100%

Production ETL Deployment Package

Production configuration and output documentation

Criteria
Without context
With context

Configuration recommendations present

100%

100%

5 cores per executor

100%

100%

memoryOverhead configured

100%

100%

Partitioning strategy explained

100%

100%

Performance analysis present

100%

100%

Monitoring recommendations present

100%

100%

Broadcast join for merchant table

100%

100%

AQE enabled in script

100%

100%

Shuffle partitions tuned

100%

100%

No collect() on large data

100%

100%

Explicit schema or parquet read

100%

100%

68%

-8%

Web Server Log Frequency Analysis

RDD operations and small file compaction

Criteria
Without context
With context

RDD API used for log parsing

66%

0%

reduceByKey used (not groupByKey)

0%

0%

No groupByKey calls

100%

100%

Coalesce on output

100%

100%

No collect() on full data

100%

100%

Small file rationale in notes

100%

100%

Aggregation approach explained

100%

100%

Correct top-N extraction

100%

100%

Repository
jeffallan/claude-skills
Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.