Apache Spark Core provides the fundamental distributed computing infrastructure for large-scale data processing with RDDs, SparkContext, and cluster management
npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-13@3.5.0Apache Spark Core provides the fundamental distributed computing infrastructure for large-scale data processing. It implements the core Spark execution model with Resilient Distributed Datasets (RDDs) as the primary abstraction for fault-tolerant distributed collections, SparkContext as the main entry point, and a sophisticated task scheduler for efficient cluster computation.
Maven:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>3.5.6</version>
</dependency>SBT:
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.6"import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDDFor broadcast variables and accumulators:
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.util.AccumulatorV2import org.apache.spark.{SparkContext, SparkConf}
// Create Spark configuration
val conf = new SparkConf()
.setAppName("MySparkApp")
.setMaster("local[*]")
// Initialize SparkContext
val sc = new SparkContext(conf)
// Create RDD from collection
val data = sc.parallelize(Seq(1, 2, 3, 4, 5))
// Transform and collect results
val squares = data.map(_ * 2).filter(_ > 4)
val result = squares.collect()
// Clean up
sc.stop()Spark Core implements a distributed computing framework with several key components:
This architecture enables fault-tolerant distributed computing with automatic recovery, lineage tracking, and optimized data locality.
The main entry point for Spark applications, providing methods to create RDDs, manage configuration, and control cluster resources.
class SparkContext(config: SparkConf)
class SparkConf(loadDefaults: Boolean = true)
// Key SparkContext methods
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
def broadcast[T: ClassTag](value: T): Broadcast[T]
def longAccumulator(name: String): LongAccumulator
def stop(): UnitSparkContext and Configuration
Core distributed dataset operations including transformations (lazy) and actions (eager execution).
abstract class RDD[T: ClassTag]
// Core transformations
def map[U: ClassTag](f: T => U): RDD[U]
def filter(f: T => Boolean): RDD[T]
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
def union(other: RDD[T]): RDD[T]
def distinct(): RDD[T]
// Core actions
def collect(): Array[T]
def count(): Long
def reduce(f: (T, T) => T): T
def foreach(f: T => Unit): UnitSpecialized operations available on RDDs of key-value pairs, including joins, grouping, and aggregations.
// Available on RDD[(K, V)] via implicit conversion
def groupByKey(): RDD[(K, Iterable[V])]
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)]Shared variables for efficient data distribution and accumulation across cluster nodes.
abstract class Broadcast[T: ClassTag]
def value: T
def unpersist(): Unit
abstract class AccumulatorV2[IN, OUT]
def add(v: IN): Unit
def value: OUT
def reset(): UnitBroadcast Variables and Accumulators
Input/output operations for various data sources and RDD caching strategies.
// Input operations
def textFile(path: String): RDD[String]
def sequenceFile[K, V](path: String): RDD[(K, V)]
def hadoopRDD[K, V](conf: JobConf, inputFormat: Class[_ <: InputFormat[K, V]]): RDD[(K, V)]
// Persistence
def persist(newLevel: StorageLevel): RDD[T]
def cache(): RDD[T]
def unpersist(): RDD[T]// Core configuration
class SparkConf(loadDefaults: Boolean = true) {
def set(key: String, value: String): SparkConf
def setAppName(name: String): SparkConf
def setMaster(master: String): SparkConf
def get(key: String): String
}
// Storage levels
object StorageLevel {
val NONE: StorageLevel
val MEMORY_ONLY: StorageLevel
val MEMORY_AND_DISK: StorageLevel
val MEMORY_ONLY_SER: StorageLevel
val DISK_ONLY: StorageLevel
}
// Partitioning
abstract class Partitioner {
def numPartitions: Int
def getPartition(key: Any): Int
}
case class HashPartitioner(partitions: Int) extends Partitioner
// Binary data representation
class PortableDataStream(
isDirectory: Boolean,
path: String,
length: Long,
modificationTime: Long
) {
def open(): DataInputStream
def toArray(): Array[Byte]
}