or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

context-config.mdindex.mdjava-api.mdrdd-operations.mdresource-management.mdserialization.mdshared-variables.mdstorage-caching.mdtask-context.md
tile.json

tessl/maven-org-apache-spark--spark-core_2-12

Apache Spark Core provides distributed computing capabilities including RDD abstractions, task scheduling, memory management, and fault recovery.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-core_2.12@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-12@3.5.0

index.mddocs/

Apache Spark Core

Apache Spark Core is the foundational module of the Apache Spark unified analytics engine for large-scale data processing. It provides fundamental distributed computing capabilities including RDD (Resilient Distributed Dataset) abstractions, task scheduling, memory management, fault recovery, and storage system interactions. Spark Core serves as the base layer for all other Spark components and provides high-performance distributed computing primitives with automatic fault tolerance.

Package Information

  • Package Name: org.apache.spark:spark-core_2.12
  • Package Type: maven
  • Language: Scala (with Java compatibility)
  • Version: 3.5.6
  • Installation: Add to pom.xml or build.sbt

Maven:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.5.6</version>
</dependency>

SBT:

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.6"

Core Imports

Scala:

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.util.{AccumulatorV2, LongAccumulator, DoubleAccumulator, CollectionAccumulator}
import org.apache.spark.{Dependency, Partition, Partitioner}
import org.apache.spark.TaskContext
import org.apache.spark.input.PortableDataStream
import scala.reflect.ClassTag

Java:

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;

Basic Usage

import org.apache.spark.{SparkContext, SparkConf}

// Create Spark configuration and context
val conf = new SparkConf()
  .setAppName("My Spark Application")
  .setMaster("local[*]")

val sc = new SparkContext(conf)

// Create RDD from local collection
val data = sc.parallelize(1 to 10000)

// Transform and compute
val squares = data.map(x => x * x)
val sum = squares.reduce(_ + _)

println(s"Sum of squares: $sum")

// Clean up
sc.stop()

Java example:

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.SparkConf;

SparkConf conf = new SparkConf()
    .setAppName("My Java Spark Application")
    .setMaster("local[*]");

JavaSparkContext sc = new JavaSparkContext(conf);

// Create RDD from list
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> rdd = sc.parallelize(data);

// Transform and collect
JavaRDD<Integer> squares = rdd.map(x -> x * x);
List<Integer> result = squares.collect();

sc.stop();

Architecture

Apache Spark Core is built around several fundamental components:

  • SparkContext: The main entry point and orchestrator for Spark functionality
  • RDD (Resilient Distributed Dataset): The core abstraction for distributed collections with fault tolerance
  • Task Scheduler: Manages execution of operations across cluster nodes
  • Storage System: Handles caching, persistence, and data locality optimization
  • Shared Variables: Broadcast variables and accumulators for efficient data sharing
  • Serialization Framework: Pluggable serialization for network communication and storage

Capabilities

Spark Context and Configuration

Core entry points for creating and configuring Spark applications. SparkContext serves as the main API for creating RDDs and managing application lifecycle.

class SparkContext(config: SparkConf)
class SparkConf(loadDefaults: Boolean = true)

Context and Configuration

RDD Operations

Fundamental distributed dataset operations including transformations (map, filter, join) and actions (collect, reduce, save). RDDs provide the core abstraction for distributed data processing with automatic fault tolerance.

abstract class RDD[T: ClassTag]
def map[U: ClassTag](f: T => U): RDD[U]
def filter(f: T => Boolean): RDD[T]
def collect(): Array[T]

RDD Operations

Java API Compatibility

Complete Java API wrappers providing Java-friendly interfaces for all Spark Core functionality. Enables seamless usage from Java applications while maintaining full feature compatibility.

public class JavaSparkContext
public class JavaRDD<T>
public class JavaPairRDD<K, V>

Java API

Storage and Caching

Persistence mechanisms for RDDs including memory, disk, and off-heap storage options. Provides fine-grained control over data persistence with various storage levels and replication strategies.

class StorageLevel(useDisk: Boolean, useMemory: Boolean, useOffHeap: Boolean, deserialized: Boolean, replication: Int)
def persist(newLevel: StorageLevel): RDD[T]
def cache(): RDD[T]

Storage and Caching

Shared Variables

Broadcast variables and accumulators for efficient data sharing across distributed computations. Broadcast variables enable read-only sharing of large datasets while accumulators provide write-only variables for aggregations.

abstract class Broadcast[T]
abstract class AccumulatorV2[IN, OUT]
class LongAccumulator extends AccumulatorV2[java.lang.Long, java.lang.Long]

Shared Variables

Task Execution Context

Runtime context and utilities available to running tasks including partition information, memory management, and lifecycle hooks.

abstract class TaskContext
def partitionId(): Int
def stageId(): Int
def taskAttemptId(): Long

Task Context

Resource Management

Modern resource allocation and management including resource profiles, executor resource requests, and task resource requirements for heterogeneous workloads.

class ResourceProfile(executorResources: Map[String, ExecutorResourceRequest], taskResources: Map[String, TaskResourceRequest])
class ResourceProfileBuilder()

Resource Management

Serialization Framework

Pluggable serialization system supporting Java serialization and Kryo for optimized network communication and storage. Provides extensible serialization with performance tuning options.

abstract class SerializerInstance
class JavaSerializer(conf: SparkConf) extends Serializer
class KryoSerializer(conf: SparkConf) extends Serializer

Serialization