Apache Spark Core provides distributed computing capabilities including RDD abstractions, task scheduling, memory management, and fault recovery.
npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-12@3.5.0Apache Spark Core is the foundational module of the Apache Spark unified analytics engine for large-scale data processing. It provides fundamental distributed computing capabilities including RDD (Resilient Distributed Dataset) abstractions, task scheduling, memory management, fault recovery, and storage system interactions. Spark Core serves as the base layer for all other Spark components and provides high-performance distributed computing primitives with automatic fault tolerance.
pom.xml or build.sbtMaven:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.5.6</version>
</dependency>SBT:
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.6"Scala:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.util.{AccumulatorV2, LongAccumulator, DoubleAccumulator, CollectionAccumulator}
import org.apache.spark.{Dependency, Partition, Partitioner}
import org.apache.spark.TaskContext
import org.apache.spark.input.PortableDataStream
import scala.reflect.ClassTagJava:
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;import org.apache.spark.{SparkContext, SparkConf}
// Create Spark configuration and context
val conf = new SparkConf()
.setAppName("My Spark Application")
.setMaster("local[*]")
val sc = new SparkContext(conf)
// Create RDD from local collection
val data = sc.parallelize(1 to 10000)
// Transform and compute
val squares = data.map(x => x * x)
val sum = squares.reduce(_ + _)
println(s"Sum of squares: $sum")
// Clean up
sc.stop()Java example:
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.SparkConf;
SparkConf conf = new SparkConf()
.setAppName("My Java Spark Application")
.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
// Create RDD from list
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = sc.parallelize(data);
// Transform and collect
JavaRDD<Integer> squares = rdd.map(x -> x * x);
List<Integer> result = squares.collect();
sc.stop();Apache Spark Core is built around several fundamental components:
Core entry points for creating and configuring Spark applications. SparkContext serves as the main API for creating RDDs and managing application lifecycle.
class SparkContext(config: SparkConf)
class SparkConf(loadDefaults: Boolean = true)Fundamental distributed dataset operations including transformations (map, filter, join) and actions (collect, reduce, save). RDDs provide the core abstraction for distributed data processing with automatic fault tolerance.
abstract class RDD[T: ClassTag]
def map[U: ClassTag](f: T => U): RDD[U]
def filter(f: T => Boolean): RDD[T]
def collect(): Array[T]Complete Java API wrappers providing Java-friendly interfaces for all Spark Core functionality. Enables seamless usage from Java applications while maintaining full feature compatibility.
public class JavaSparkContext
public class JavaRDD<T>
public class JavaPairRDD<K, V>Persistence mechanisms for RDDs including memory, disk, and off-heap storage options. Provides fine-grained control over data persistence with various storage levels and replication strategies.
class StorageLevel(useDisk: Boolean, useMemory: Boolean, useOffHeap: Boolean, deserialized: Boolean, replication: Int)
def persist(newLevel: StorageLevel): RDD[T]
def cache(): RDD[T]Broadcast variables and accumulators for efficient data sharing across distributed computations. Broadcast variables enable read-only sharing of large datasets while accumulators provide write-only variables for aggregations.
abstract class Broadcast[T]
abstract class AccumulatorV2[IN, OUT]
class LongAccumulator extends AccumulatorV2[java.lang.Long, java.lang.Long]Runtime context and utilities available to running tasks including partition information, memory management, and lifecycle hooks.
abstract class TaskContext
def partitionId(): Int
def stageId(): Int
def taskAttemptId(): LongModern resource allocation and management including resource profiles, executor resource requests, and task resource requirements for heterogeneous workloads.
class ResourceProfile(executorResources: Map[String, ExecutorResourceRequest], taskResources: Map[String, TaskResourceRequest])
class ResourceProfileBuilder()Pluggable serialization system supporting Java serialization and Kryo for optimized network communication and storage. Provides extensible serialization with performance tuning options.
abstract class SerializerInstance
class JavaSerializer(conf: SparkConf) extends Serializer
class KryoSerializer(conf: SparkConf) extends Serializer