Apache Spark Core provides distributed computing capabilities with RDDs, task scheduling, and cluster management for big data processing
npx @tessl/cli install tessl/maven-org-apache-spark--spark-core@2.2.0Apache Spark Core is the foundational engine for large-scale distributed data processing. It provides resilient distributed datasets (RDDs), in-memory computing capabilities, and a unified execution engine for batch and interactive data processing across clusters.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.3</version>
</dependency>Scala:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.util.{LongAccumulator, DoubleAccumulator, AccumulatorV2}Java:
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.SparkConf;
import org.apache.spark.broadcast.Broadcast;
import org.apache.spark.util.LongAccumulator;import org.apache.spark.{SparkContext, SparkConf}
// Create Spark configuration
val conf = new SparkConf()
.setAppName("My Spark Application")
.setMaster("local[*]")
// Create Spark context
val sc = new SparkContext(conf)
// Create RDD from collection
val data = sc.parallelize(Seq(1, 2, 3, 4, 5))
// Transform and compute
val result = data
.filter(_ % 2 == 0)
.map(_ * 2)
.collect()
// Clean up
sc.stop()Apache Spark Core is built around several key abstractions:
The SparkContext serves as the primary entry point for all Spark functionality, providing methods for creating RDDs, managing cluster resources, and configuring applications.
class SparkContext(config: SparkConf) {
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
def stop(): Unit
}
class SparkConf(loadDefaults: Boolean = true) {
def setAppName(name: String): SparkConf
def setMaster(master: String): SparkConf
def set(key: String, value: String): SparkConf
}Core SparkContext and Configuration
RDDs provide the core abstraction for distributed data processing with lazy transformations and eager actions that enable fault-tolerant computation across cluster nodes.
abstract class RDD[T: ClassTag] {
def map[U: ClassTag](f: T => U): RDD[U]
def filter(f: T => Boolean): RDD[T]
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
def groupBy[K](f: T => K): RDD[(K, Iterable[T])]
def collect(): Array[T]
def count(): Long
def reduce(f: (T, T) => T): T
}RDD Operations and Transformations
Java-friendly wrappers provide seamless integration with Java applications while maintaining full access to Spark's distributed computing capabilities.
public class JavaSparkContext {
public JavaSparkContext(SparkConf conf)
public <T> JavaRDD<T> parallelize(List<T> list)
public JavaRDD<String> textFile(String path)
public void close()
}
public class JavaRDD<T> {
public <R> JavaRDD<R> map(Function<T, R> f)
public JavaRDD<T> filter(Function<T, Boolean> f)
public List<T> collect()
}Storage and persistence mechanisms allow RDDs to be cached in memory or persisted to disk with configurable storage levels for performance optimization.
abstract class RDD[T] {
def persist(newLevel: StorageLevel): this.type
def cache(): this.type
def unpersist(blocking: Boolean = true): this.type
}
object StorageLevel {
val MEMORY_ONLY: StorageLevel
val MEMORY_AND_DISK: StorageLevel
val DISK_ONLY: StorageLevel
}Shared variables enable efficient distribution of read-only data and aggregation of values across distributed computations without expensive network operations.
abstract class Broadcast[T] {
def value: T
def unpersist(): Unit
def destroy(): Unit
}
trait AccumulatorV2[IN, OUT] {
def add(v: IN): Unit
def value: OUT
def isZero: Boolean
}// Core cluster management
trait TaskContext {
def stageId(): Int
def partitionId(): Int
def taskAttemptId(): Long
}
// Partitioning
abstract class Partitioner {
def numPartitions: Int
def getPartition(key: Any): Int
}
// Asynchronous operations
trait FutureAction[T] {
def cancel(): Unit
def isCompleted: Boolean
def result(atMost: Duration): T
}