or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

broadcast-accumulators.md data-io-persistence.md index.md key-value-operations.md rdd-operations.md sparkcontext.md

tile.json

tessl/maven-org-apache-spark--spark-core_2-13

Apache Spark Core provides the fundamental distributed computing infrastructure for large-scale data processing with RDDs, SparkContext, and cluster management

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:maven/org.apache.spark/spark-core_2.13@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-13@3.5.0

Apache Spark Core

Apache Spark Core provides the fundamental distributed computing infrastructure for large-scale data processing. It implements the core Spark execution model with Resilient Distributed Datasets (RDDs) as the primary abstraction for fault-tolerant distributed collections, SparkContext as the main entry point, and a sophisticated task scheduler for efficient cluster computation.

Package Information

Package Name: org.apache.spark:spark-core_2.13
Package Type: Maven
Language: Scala (with Java/Python/R APIs)
Version: 3.5.6
Installation: Add to Maven dependencies or SBT build

Maven:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.13</artifactId>
  <version>3.5.6</version>
</dependency>

SBT:

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.6"

Core Imports

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD

For broadcast variables and accumulators:

import org.apache.spark.broadcast.Broadcast
import org.apache.spark.util.AccumulatorV2

Basic Usage

import org.apache.spark.{SparkContext, SparkConf}

// Create Spark configuration
val conf = new SparkConf()
  .setAppName("MySparkApp")
  .setMaster("local[*]")

// Initialize SparkContext
val sc = new SparkContext(conf)

// Create RDD from collection
val data = sc.parallelize(Seq(1, 2, 3, 4, 5))

// Transform and collect results
val squares = data.map(_ * 2).filter(_ > 4)
val result = squares.collect()

// Clean up
sc.stop()

Architecture

Spark Core implements a distributed computing framework with several key components:

SparkContext: The main entry point that coordinates cluster resources and manages the application lifecycle
RDD (Resilient Distributed Dataset): Immutable, fault-tolerant distributed collections that form the core abstraction
DAG Scheduler: Optimizes computation graphs and creates stages for efficient execution
Task Scheduler: Distributes tasks across cluster nodes and handles task failures
Cluster Manager: Interfaces with YARN, Mesos, Kubernetes, or standalone cluster managers
Storage System: Manages memory and disk storage for cached RDDs and shuffle data

This architecture enables fault-tolerant distributed computing with automatic recovery, lineage tracking, and optimized data locality.

Capabilities

SparkContext and Configuration

The main entry point for Spark applications, providing methods to create RDDs, manage configuration, and control cluster resources.

class SparkContext(config: SparkConf)
class SparkConf(loadDefaults: Boolean = true)

// Key SparkContext methods
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
def broadcast[T: ClassTag](value: T): Broadcast[T]
def longAccumulator(name: String): LongAccumulator
def stop(): Unit

SparkContext and Configuration

RDD Operations

Core distributed dataset operations including transformations (lazy) and actions (eager execution).

abstract class RDD[T: ClassTag]

// Core transformations
def map[U: ClassTag](f: T => U): RDD[U]
def filter(f: T => Boolean): RDD[T]
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
def union(other: RDD[T]): RDD[T]
def distinct(): RDD[T]

// Core actions  
def collect(): Array[T]
def count(): Long
def reduce(f: (T, T) => T): T
def foreach(f: T => Unit): Unit

RDD Operations

Key-Value Operations

Specialized operations available on RDDs of key-value pairs, including joins, grouping, and aggregations.

// Available on RDD[(K, V)] via implicit conversion
def groupByKey(): RDD[(K, Iterable[V])]
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)]

Key-Value Operations

Broadcast Variables and Accumulators

Shared variables for efficient data distribution and accumulation across cluster nodes.

abstract class Broadcast[T: ClassTag]
def value: T
def unpersist(): Unit

abstract class AccumulatorV2[IN, OUT]
def add(v: IN): Unit
def value: OUT
def reset(): Unit

Broadcast Variables and Accumulators

Data I/O and Persistence

Input/output operations for various data sources and RDD caching strategies.

// Input operations
def textFile(path: String): RDD[String]
def sequenceFile[K, V](path: String): RDD[(K, V)]
def hadoopRDD[K, V](conf: JobConf, inputFormat: Class[_ <: InputFormat[K, V]]): RDD[(K, V)]

// Persistence
def persist(newLevel: StorageLevel): RDD[T]
def cache(): RDD[T]
def unpersist(): RDD[T]

Data I/O and Persistence

Types

// Core configuration
class SparkConf(loadDefaults: Boolean = true) {
  def set(key: String, value: String): SparkConf
  def setAppName(name: String): SparkConf  
  def setMaster(master: String): SparkConf
  def get(key: String): String
}

// Storage levels
object StorageLevel {
  val NONE: StorageLevel
  val MEMORY_ONLY: StorageLevel
  val MEMORY_AND_DISK: StorageLevel
  val MEMORY_ONLY_SER: StorageLevel
  val DISK_ONLY: StorageLevel
}

// Partitioning
abstract class Partitioner {
  def numPartitions: Int
  def getPartition(key: Any): Int
}

case class HashPartitioner(partitions: Int) extends Partitioner

// Binary data representation
class PortableDataStream(
  isDirectory: Boolean,
  path: String,
  length: Long,
  modificationTime: Long
) {
  def open(): DataInputStream
  def toArray(): Array[Byte]
}

Version

Tile

Files

tessl/maven-org-apache-spark--spark-core_2-13

To install, run

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Apache Spark Core

Package Information

Core Imports

Basic Usage

Architecture

Capabilities

SparkContext and Configuration

RDD Operations

Key-Value Operations

Broadcast Variables and Accumulators

Data I/O and Persistence

Types

index.mddocs/