or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

broadcast-accumulators.mddata-io-persistence.mdindex.mdkey-value-operations.mdrdd-operations.mdsparkcontext.md
tile.json

tessl/maven-org-apache-spark--spark-core_2-13

Apache Spark Core provides the fundamental distributed computing infrastructure for large-scale data processing with RDDs, SparkContext, and cluster management

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-core_2.13@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-13@3.5.0

index.mddocs/

Apache Spark Core

Apache Spark Core provides the fundamental distributed computing infrastructure for large-scale data processing. It implements the core Spark execution model with Resilient Distributed Datasets (RDDs) as the primary abstraction for fault-tolerant distributed collections, SparkContext as the main entry point, and a sophisticated task scheduler for efficient cluster computation.

Package Information

  • Package Name: org.apache.spark:spark-core_2.13
  • Package Type: Maven
  • Language: Scala (with Java/Python/R APIs)
  • Version: 3.5.6
  • Installation: Add to Maven dependencies or SBT build

Maven:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.13</artifactId>
  <version>3.5.6</version>
</dependency>

SBT:

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.6"

Core Imports

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD

For broadcast variables and accumulators:

import org.apache.spark.broadcast.Broadcast
import org.apache.spark.util.AccumulatorV2

Basic Usage

import org.apache.spark.{SparkContext, SparkConf}

// Create Spark configuration
val conf = new SparkConf()
  .setAppName("MySparkApp")
  .setMaster("local[*]")

// Initialize SparkContext
val sc = new SparkContext(conf)

// Create RDD from collection
val data = sc.parallelize(Seq(1, 2, 3, 4, 5))

// Transform and collect results
val squares = data.map(_ * 2).filter(_ > 4)
val result = squares.collect()

// Clean up
sc.stop()

Architecture

Spark Core implements a distributed computing framework with several key components:

  • SparkContext: The main entry point that coordinates cluster resources and manages the application lifecycle
  • RDD (Resilient Distributed Dataset): Immutable, fault-tolerant distributed collections that form the core abstraction
  • DAG Scheduler: Optimizes computation graphs and creates stages for efficient execution
  • Task Scheduler: Distributes tasks across cluster nodes and handles task failures
  • Cluster Manager: Interfaces with YARN, Mesos, Kubernetes, or standalone cluster managers
  • Storage System: Manages memory and disk storage for cached RDDs and shuffle data

This architecture enables fault-tolerant distributed computing with automatic recovery, lineage tracking, and optimized data locality.

Capabilities

SparkContext and Configuration

The main entry point for Spark applications, providing methods to create RDDs, manage configuration, and control cluster resources.

class SparkContext(config: SparkConf)
class SparkConf(loadDefaults: Boolean = true)

// Key SparkContext methods
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
def broadcast[T: ClassTag](value: T): Broadcast[T]
def longAccumulator(name: String): LongAccumulator
def stop(): Unit

SparkContext and Configuration

RDD Operations

Core distributed dataset operations including transformations (lazy) and actions (eager execution).

abstract class RDD[T: ClassTag]

// Core transformations
def map[U: ClassTag](f: T => U): RDD[U]
def filter(f: T => Boolean): RDD[T]
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
def union(other: RDD[T]): RDD[T]
def distinct(): RDD[T]

// Core actions  
def collect(): Array[T]
def count(): Long
def reduce(f: (T, T) => T): T
def foreach(f: T => Unit): Unit

RDD Operations

Key-Value Operations

Specialized operations available on RDDs of key-value pairs, including joins, grouping, and aggregations.

// Available on RDD[(K, V)] via implicit conversion
def groupByKey(): RDD[(K, Iterable[V])]
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)]

Key-Value Operations

Broadcast Variables and Accumulators

Shared variables for efficient data distribution and accumulation across cluster nodes.

abstract class Broadcast[T: ClassTag]
def value: T
def unpersist(): Unit

abstract class AccumulatorV2[IN, OUT]
def add(v: IN): Unit
def value: OUT
def reset(): Unit

Broadcast Variables and Accumulators

Data I/O and Persistence

Input/output operations for various data sources and RDD caching strategies.

// Input operations
def textFile(path: String): RDD[String]
def sequenceFile[K, V](path: String): RDD[(K, V)]
def hadoopRDD[K, V](conf: JobConf, inputFormat: Class[_ <: InputFormat[K, V]]): RDD[(K, V)]

// Persistence
def persist(newLevel: StorageLevel): RDD[T]
def cache(): RDD[T]
def unpersist(): RDD[T]

Data I/O and Persistence

Types

// Core configuration
class SparkConf(loadDefaults: Boolean = true) {
  def set(key: String, value: String): SparkConf
  def setAppName(name: String): SparkConf  
  def setMaster(master: String): SparkConf
  def get(key: String): String
}

// Storage levels
object StorageLevel {
  val NONE: StorageLevel
  val MEMORY_ONLY: StorageLevel
  val MEMORY_AND_DISK: StorageLevel
  val MEMORY_ONLY_SER: StorageLevel
  val DISK_ONLY: StorageLevel
}

// Partitioning
abstract class Partitioner {
  def numPartitions: Int
  def getPartition(key: Any): Int
}

case class HashPartitioner(partitions: Int) extends Partitioner

// Binary data representation
class PortableDataStream(
  isDirectory: Boolean,
  path: String,
  length: Long,
  modificationTime: Long
) {
  def open(): DataInputStream
  def toArray(): Array[Byte]
}