or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

broadcast-accumulators.md context-management.md index.md java-api.md pair-rdd-operations.md rdd-operations.md storage-persistence.md

tile.json

tessl/maven-org-apache-spark--spark-core_2-11

Apache Spark Core - The foundational distributed computing engine for Apache Spark that provides RDD abstractions, task scheduling, memory management, and cluster execution capabilities.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:maven/org.apache.spark/spark-core_2.11@1.6.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-11@1.6.0

Apache Spark Core

Apache Spark Core provides the foundational distributed computing engine for Apache Spark. It implements the Resilient Distributed Dataset (RDD) programming model, sophisticated task scheduling, advanced memory management, and comprehensive support for multiple cluster managers. The core engine enables fault-tolerant parallel operations on large datasets across distributed clusters.

Package Information

Package Name: spark-core_2.11
Package Type: maven
Language: Scala (with Java API)
Version: 1.6.3
Installation: Add to Maven/SBT dependencies: org.apache.spark:spark-core_2.11:1.6.3

Core Imports

Scala:

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel

Java:

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.SparkConf;

Basic Usage

Scala:

import org.apache.spark.{SparkContext, SparkConf}

// Create Spark configuration
val conf = new SparkConf()
  .setAppName("MySparkApp")
  .setMaster("local[*]")

// Create Spark context
val sc = new SparkContext(conf)

// Create RDD from collection
val data = sc.parallelize(1 to 10)

// Transform and collect results
val result = data
  .map(_ * 2)
  .filter(_ > 10)
  .collect()

// Stop the context
sc.stop()

Java:

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.SparkConf;

// Create configuration
SparkConf conf = new SparkConf()
    .setAppName("MyJavaSparkApp")
    .setMaster("local[*]");

// Create Java Spark context
JavaSparkContext sc = new JavaSparkContext(conf);

// Create RDD and perform operations
JavaRDD<Integer> data = sc.parallelize(Arrays.asList(1, 2, 3, 4, 5));
JavaRDD<Integer> result = data
    .map(x -> x * 2)
    .filter(x -> x > 5);

List<Integer> collected = result.collect();
sc.stop();

Architecture

Apache Spark Core is built around several key components:

SparkContext: The main entry point that coordinates all Spark operations and manages the connection to the cluster
RDD (Resilient Distributed Dataset): The fundamental data abstraction representing an immutable, partitioned collection that can be operated on in parallel
Task Scheduler: Sophisticated scheduling system that optimizes job execution across cluster resources with data locality awareness
Memory Management: Advanced caching and storage system with configurable storage levels and automatic spill-to-disk capabilities
Cluster Managers: Support for multiple cluster managers including Standalone, YARN, and Mesos
Fault Tolerance: Automatic recovery from node failures through RDD lineage and checkpointing

Capabilities

Spark Context Management

Core functionality for creating and managing Spark applications, including cluster connections, resource allocation, and application lifecycle management.

class SparkContext(config: SparkConf) extends Logging {
  def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
  def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
  def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]
  def stop(): Unit
  def broadcast[T: ClassTag](value: T): Broadcast[T]
  def accumulator[T](initialValue: T)(implicit param: AccumulatorParam[T]): Accumulator[T]
}

class SparkConf(loadDefaults: Boolean = true) extends Cloneable {
  def set(key: String, value: String): SparkConf
  def setMaster(master: String): SparkConf
  def setAppName(name: String): SparkConf
  def get(key: String): String
  def get(key: String, defaultValue: String): String
}

Context Management

RDD Operations

The core RDD API providing transformations and actions for distributed data processing, including map, filter, reduce operations and advanced transformations like joins and aggregations.

abstract class RDD[T: ClassTag] extends Serializable {
  // Transformations
  def map[U: ClassTag](f: T => U): RDD[U]
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
  def filter(f: T => Boolean): RDD[T]
  def distinct(numPartitions: Int = partitions.length): RDD[T]
  def union(other: RDD[T]): RDD[T]
  
  // Actions
  def collect(): Array[T]
  def count(): Long
  def first(): T
  def take(num: Int): Array[T]
  def reduce(f: (T, T) => T): T
  def foreach(f: T => Unit): Unit
  
  // Persistence
  def cache(): RDD.this.type
  def persist(newLevel: StorageLevel): RDD.this.type
}

RDD Operations

Pair RDD Operations

Advanced operations for key-value pair RDDs including grouping, joining, and aggregation operations essential for data processing workflows.

class PairRDDFunctions[K, V](self: RDD[(K, V)]) {
  def groupByKey(): RDD[(K, Iterable[V])]
  def reduceByKey(func: (V, V) => V): RDD[(K, V)]
  def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)]
  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
  def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
  def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
  def sortByKey(ascending: Boolean = true): RDD[(K, V)]
}

Pair RDD Operations

Java API

Java-friendly wrappers providing the complete Spark functionality through Java-compatible interfaces, lambda support, and familiar Java collection types.

public class JavaSparkContext implements Closeable {
    public JavaSparkContext(SparkConf conf)
    public <T> JavaRDD<T> parallelize(List<T> list)
    public <T> JavaRDD<T> parallelize(List<T> list, int numSlices)
    public JavaRDD<String> textFile(String path)
    public <T> Broadcast<T> broadcast(T value)
    public void stop()
}

public class JavaRDD<T> extends AbstractJavaRDDLike<T, JavaRDD<T>> {
    public <R> JavaRDD<R> map(Function<T, R> f)
    public <R> JavaRDD<R> flatMap(FlatMapFunction<T, R> f)
    public JavaRDD<T> filter(Function<T, Boolean> f)
    public List<T> collect()
    public long count()
    public T first()
}

Java API

Storage and Persistence

Memory management and persistence strategies for optimizing RDD storage across cluster nodes, including various storage levels and caching mechanisms.

object StorageLevel {
  val NONE: StorageLevel
  val DISK_ONLY: StorageLevel
  val DISK_ONLY_2: StorageLevel
  val MEMORY_ONLY: StorageLevel
  val MEMORY_ONLY_2: StorageLevel
  val MEMORY_ONLY_SER: StorageLevel
  val MEMORY_AND_DISK: StorageLevel
  val MEMORY_AND_DISK_2: StorageLevel
  val MEMORY_AND_DISK_SER: StorageLevel
}

Storage and Persistence

Broadcast Variables and Accumulators

Distributed variable support for efficiently sharing read-only data across tasks (broadcast variables) and collecting information from executors (accumulators).

abstract class Broadcast[T: ClassTag] extends Serializable {
  def value: T
  def unpersist(blocking: Boolean = true): Unit
  def destroy(): Unit
  def id: Long
}

class Accumulator[T] private[spark] (
    @transient private[spark] val initialValue: T,
    param: AccumulatorParam[T],
    name: Option[String] = None) extends Serializable {
  def value: T
  def add(term: T): Unit
  def += (term: T): Unit
  def localValue: T
}

Broadcast and Accumulators

Types

// Core configuration class
class SparkConf(loadDefaults: Boolean = true) extends Cloneable with Logging

// Main entry point for Spark functionality  
class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient

// Basic distributed dataset abstraction
abstract class RDD[T: ClassTag] extends Serializable with Logging

// Storage levels for RDD persistence
class StorageLevel private(
    private var _useDisk: Boolean,
    private var _useMemory: Boolean, 
    private var _useOffHeap: Boolean,
    private var _deserialized: Boolean,
    private var _replication: Int = 1) extends Externalizable

// Partitioning strategies
abstract class Partitioner extends Serializable {
  def numPartitions: Int
  def getPartition(key: Any): Int
}

// Task execution context
abstract class TaskContext extends Serializable {
  def partitionId(): Int
  def stageId(): Int
  def taskAttemptId(): Long
  def attemptNumber(): Int
}

Version

Tile

Files

tessl/maven-org-apache-spark--spark-core_2-11

To install, run

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Apache Spark Core

Package Information

Core Imports

Basic Usage

Architecture

Capabilities

Spark Context Management

RDD Operations

Pair RDD Operations

Java API

Storage and Persistence

Broadcast Variables and Accumulators

Types

index.mddocs/