Apache Spark Core provides distributed computing capabilities including RDD abstractions, task scheduling, memory management, and fault recovery.
npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-12@3.5.00
# Apache Spark Core
1
2
Apache Spark Core is the foundational module of the Apache Spark unified analytics engine for large-scale data processing. It provides fundamental distributed computing capabilities including RDD (Resilient Distributed Dataset) abstractions, task scheduling, memory management, fault recovery, and storage system interactions. Spark Core serves as the base layer for all other Spark components and provides high-performance distributed computing primitives with automatic fault tolerance.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-core_2.12
7
- **Package Type**: maven
8
- **Language**: Scala (with Java compatibility)
9
- **Version**: 3.5.6
10
- **Installation**: Add to `pom.xml` or `build.sbt`
11
12
Maven:
13
```xml
14
<dependency>
15
<groupId>org.apache.spark</groupId>
16
<artifactId>spark-core_2.12</artifactId>
17
<version>3.5.6</version>
18
</dependency>
19
```
20
21
SBT:
22
```scala
23
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.6"
24
```
25
26
## Core Imports
27
28
Scala:
29
```scala
30
import org.apache.spark.{SparkContext, SparkConf}
31
import org.apache.spark.rdd.RDD
32
import org.apache.spark.storage.StorageLevel
33
import org.apache.spark.broadcast.Broadcast
34
import org.apache.spark.util.{AccumulatorV2, LongAccumulator, DoubleAccumulator, CollectionAccumulator}
35
import org.apache.spark.{Dependency, Partition, Partitioner}
36
import org.apache.spark.TaskContext
37
import org.apache.spark.input.PortableDataStream
38
import scala.reflect.ClassTag
39
```
40
41
Java:
42
```java
43
import org.apache.spark.api.java.JavaSparkContext;
44
import org.apache.spark.api.java.JavaRDD;
45
import org.apache.spark.SparkConf;
46
```
47
48
## Basic Usage
49
50
```scala
51
import org.apache.spark.{SparkContext, SparkConf}
52
53
// Create Spark configuration and context
54
val conf = new SparkConf()
55
.setAppName("My Spark Application")
56
.setMaster("local[*]")
57
58
val sc = new SparkContext(conf)
59
60
// Create RDD from local collection
61
val data = sc.parallelize(1 to 10000)
62
63
// Transform and compute
64
val squares = data.map(x => x * x)
65
val sum = squares.reduce(_ + _)
66
67
println(s"Sum of squares: $sum")
68
69
// Clean up
70
sc.stop()
71
```
72
73
Java example:
74
```java
75
import org.apache.spark.api.java.JavaSparkContext;
76
import org.apache.spark.SparkConf;
77
78
SparkConf conf = new SparkConf()
79
.setAppName("My Java Spark Application")
80
.setMaster("local[*]");
81
82
JavaSparkContext sc = new JavaSparkContext(conf);
83
84
// Create RDD from list
85
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
86
JavaRDD<Integer> rdd = sc.parallelize(data);
87
88
// Transform and collect
89
JavaRDD<Integer> squares = rdd.map(x -> x * x);
90
List<Integer> result = squares.collect();
91
92
sc.stop();
93
```
94
95
## Architecture
96
97
Apache Spark Core is built around several fundamental components:
98
99
- **SparkContext**: The main entry point and orchestrator for Spark functionality
100
- **RDD (Resilient Distributed Dataset)**: The core abstraction for distributed collections with fault tolerance
101
- **Task Scheduler**: Manages execution of operations across cluster nodes
102
- **Storage System**: Handles caching, persistence, and data locality optimization
103
- **Shared Variables**: Broadcast variables and accumulators for efficient data sharing
104
- **Serialization Framework**: Pluggable serialization for network communication and storage
105
106
## Capabilities
107
108
### Spark Context and Configuration
109
110
Core entry points for creating and configuring Spark applications. SparkContext serves as the main API for creating RDDs and managing application lifecycle.
111
112
```scala { .api }
113
class SparkContext(config: SparkConf)
114
class SparkConf(loadDefaults: Boolean = true)
115
```
116
117
[Context and Configuration](./context-config.md)
118
119
### RDD Operations
120
121
Fundamental distributed dataset operations including transformations (map, filter, join) and actions (collect, reduce, save). RDDs provide the core abstraction for distributed data processing with automatic fault tolerance.
122
123
```scala { .api }
124
abstract class RDD[T: ClassTag]
125
def map[U: ClassTag](f: T => U): RDD[U]
126
def filter(f: T => Boolean): RDD[T]
127
def collect(): Array[T]
128
```
129
130
[RDD Operations](./rdd-operations.md)
131
132
### Java API Compatibility
133
134
Complete Java API wrappers providing Java-friendly interfaces for all Spark Core functionality. Enables seamless usage from Java applications while maintaining full feature compatibility.
135
136
```java { .api }
137
public class JavaSparkContext
138
public class JavaRDD<T>
139
public class JavaPairRDD<K, V>
140
```
141
142
[Java API](./java-api.md)
143
144
### Storage and Caching
145
146
Persistence mechanisms for RDDs including memory, disk, and off-heap storage options. Provides fine-grained control over data persistence with various storage levels and replication strategies.
147
148
```scala { .api }
149
class StorageLevel(useDisk: Boolean, useMemory: Boolean, useOffHeap: Boolean, deserialized: Boolean, replication: Int)
150
def persist(newLevel: StorageLevel): RDD[T]
151
def cache(): RDD[T]
152
```
153
154
[Storage and Caching](./storage-caching.md)
155
156
### Shared Variables
157
158
Broadcast variables and accumulators for efficient data sharing across distributed computations. Broadcast variables enable read-only sharing of large datasets while accumulators provide write-only variables for aggregations.
159
160
```scala { .api }
161
abstract class Broadcast[T]
162
abstract class AccumulatorV2[IN, OUT]
163
class LongAccumulator extends AccumulatorV2[java.lang.Long, java.lang.Long]
164
```
165
166
[Shared Variables](./shared-variables.md)
167
168
### Task Execution Context
169
170
Runtime context and utilities available to running tasks including partition information, memory management, and lifecycle hooks.
171
172
```scala { .api }
173
abstract class TaskContext
174
def partitionId(): Int
175
def stageId(): Int
176
def taskAttemptId(): Long
177
```
178
179
[Task Context](./task-context.md)
180
181
### Resource Management
182
183
Modern resource allocation and management including resource profiles, executor resource requests, and task resource requirements for heterogeneous workloads.
184
185
```scala { .api }
186
class ResourceProfile(executorResources: Map[String, ExecutorResourceRequest], taskResources: Map[String, TaskResourceRequest])
187
class ResourceProfileBuilder()
188
```
189
190
[Resource Management](./resource-management.md)
191
192
### Serialization Framework
193
194
Pluggable serialization system supporting Java serialization and Kryo for optimized network communication and storage. Provides extensible serialization with performance tuning options.
195
196
```scala { .api }
197
abstract class SerializerInstance
198
class JavaSerializer(conf: SparkConf) extends Serializer
199
class KryoSerializer(conf: SparkConf) extends Serializer
200
```
201
202
[Serialization](./serialization.md)