or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

broadcasting-accumulators.mdindex.mdjava-api.mdrdd-operations.mdspark-context.mdstorage-persistence.md
tile.json

tessl/maven-org-apache-spark--spark-core

Apache Spark Core provides distributed computing capabilities with RDDs, task scheduling, and cluster management for big data processing

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-core_2.11@2.2.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-core@2.2.0

index.mddocs/

Apache Spark Core

Apache Spark Core is the foundational engine for large-scale distributed data processing. It provides resilient distributed datasets (RDDs), in-memory computing capabilities, and a unified execution engine for batch and interactive data processing across clusters.

Package Information

  • Package Name: org.apache.spark:spark-core_2.11
  • Package Type: Maven
  • Language: Scala (with Java API)
  • Installation:
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.2.3</version>
    </dependency>

Core Imports

Scala:

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.util.{LongAccumulator, DoubleAccumulator, AccumulatorV2}

Java:

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.SparkConf;
import org.apache.spark.broadcast.Broadcast;
import org.apache.spark.util.LongAccumulator;

Basic Usage

import org.apache.spark.{SparkContext, SparkConf}

// Create Spark configuration
val conf = new SparkConf()
  .setAppName("My Spark Application")
  .setMaster("local[*]")

// Create Spark context
val sc = new SparkContext(conf)

// Create RDD from collection
val data = sc.parallelize(Seq(1, 2, 3, 4, 5))

// Transform and compute
val result = data
  .filter(_ % 2 == 0)
  .map(_ * 2)
  .collect()

// Clean up
sc.stop()

Architecture

Apache Spark Core is built around several key abstractions:

  • SparkContext: The main entry point that coordinates distributed computing
  • RDD (Resilient Distributed Dataset): Immutable, fault-tolerant collections distributed across cluster nodes
  • Transformations: Lazy operations that create new RDDs (map, filter, groupBy)
  • Actions: Operations that trigger computation and return results (collect, count, save)
  • Broadcast Variables: Read-only variables cached on all nodes
  • Accumulators: Variables for aggregating information across tasks

Capabilities

Core SparkContext and Configuration

The SparkContext serves as the primary entry point for all Spark functionality, providing methods for creating RDDs, managing cluster resources, and configuring applications.

class SparkContext(config: SparkConf) {
  def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
  def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
  def stop(): Unit
}

class SparkConf(loadDefaults: Boolean = true) {
  def setAppName(name: String): SparkConf
  def setMaster(master: String): SparkConf
  def set(key: String, value: String): SparkConf
}

Core SparkContext and Configuration

RDD Operations and Transformations

RDDs provide the core abstraction for distributed data processing with lazy transformations and eager actions that enable fault-tolerant computation across cluster nodes.

abstract class RDD[T: ClassTag] {
  def map[U: ClassTag](f: T => U): RDD[U]
  def filter(f: T => Boolean): RDD[T]
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
  def groupBy[K](f: T => K): RDD[(K, Iterable[T])]
  def collect(): Array[T]
  def count(): Long
  def reduce(f: (T, T) => T): T
}

RDD Operations and Transformations

Java API

Java-friendly wrappers provide seamless integration with Java applications while maintaining full access to Spark's distributed computing capabilities.

public class JavaSparkContext {
  public JavaSparkContext(SparkConf conf)
  public <T> JavaRDD<T> parallelize(List<T> list)
  public JavaRDD<String> textFile(String path)
  public void close()
}

public class JavaRDD<T> {
  public <R> JavaRDD<R> map(Function<T, R> f)
  public JavaRDD<T> filter(Function<T, Boolean> f)
  public List<T> collect()
}

Java API

Storage and Persistence

Storage and persistence mechanisms allow RDDs to be cached in memory or persisted to disk with configurable storage levels for performance optimization.

abstract class RDD[T] {
  def persist(newLevel: StorageLevel): this.type
  def cache(): this.type
  def unpersist(blocking: Boolean = true): this.type
}

object StorageLevel {
  val MEMORY_ONLY: StorageLevel
  val MEMORY_AND_DISK: StorageLevel
  val DISK_ONLY: StorageLevel
}

Storage and Persistence

Broadcasting and Accumulators

Shared variables enable efficient distribution of read-only data and aggregation of values across distributed computations without expensive network operations.

abstract class Broadcast[T] {
  def value: T
  def unpersist(): Unit
  def destroy(): Unit
}

trait AccumulatorV2[IN, OUT] {
  def add(v: IN): Unit
  def value: OUT
  def isZero: Boolean
}

Broadcasting and Accumulators

Core Types

// Core cluster management
trait TaskContext {
  def stageId(): Int
  def partitionId(): Int
  def taskAttemptId(): Long
}

// Partitioning
abstract class Partitioner {
  def numPartitions: Int
  def getPartition(key: Any): Int
}

// Asynchronous operations
trait FutureAction[T] {
  def cancel(): Unit
  def isCompleted: Boolean
  def result(atMost: Duration): T
}