Apache Spark Core provides the foundational execution engine and API for distributed data processing with RDDs, task scheduling, and cluster management.
npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2_13@4.0.00
# Apache Spark Core
1
2
Apache Spark Core provides the foundational execution engine and API for distributed data processing across clusters. It implements the core distributed computing primitives including RDDs (Resilient Distributed Datasets), task scheduling, memory management, fault tolerance, and the base APIs that power all other Spark components.
3
4
## Package Information
5
6
- **Package Name**: spark-core_2.13
7
- **Package Type**: maven
8
- **Language**: Scala
9
- **Installation**: `<dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_2.13</artifactId><version>4.0.0</version></dependency>`
10
11
## Core Imports
12
13
```scala
14
import org.apache.spark.{SparkContext, SparkConf}
15
import org.apache.spark.rdd.RDD
16
```
17
18
For Java:
19
20
```java
21
import org.apache.spark.SparkContext;
22
import org.apache.spark.SparkConf;
23
import org.apache.spark.api.java.JavaSparkContext;
24
import org.apache.spark.api.java.JavaRDD;
25
```
26
27
## Basic Usage
28
29
```scala
30
import org.apache.spark.{SparkContext, SparkConf}
31
32
// Create configuration
33
val conf = new SparkConf()
34
.setAppName("MySparkApp")
35
.setMaster("local[*]")
36
37
// Create Spark context
38
val sc = new SparkContext(conf)
39
40
// Create RDD from collection
41
val data = sc.parallelize(Array(1, 2, 3, 4, 5))
42
43
// Transform and action
44
val squared = data.map(x => x * x)
45
val result = squared.collect()
46
47
// Cleanup
48
sc.stop()
49
```
50
51
## Architecture
52
53
Spark Core is built around several key components:
54
55
- **SparkContext**: Main entry point and driver program coordinator
56
- **RDD Abstraction**: Immutable distributed datasets with lineage tracking
57
- **Task Scheduler**: Distributes work across cluster nodes and manages execution
58
- **Block Manager**: Handles data storage, caching, and replication across cluster
59
- **Serialization**: Efficient data serialization for network transfer and storage
60
- **Resource Management**: Integration with cluster managers (YARN, Mesos, Kubernetes)
61
62
## Capabilities
63
64
### Application Context
65
66
Core application setup and cluster connection management. SparkContext serves as the primary interface for creating RDDs and configuring distributed execution.
67
68
```scala { .api }
69
class SparkContext(config: SparkConf) {
70
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
71
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
72
def stop(): Unit
73
def broadcast[T: ClassTag](value: T): Broadcast[T]
74
def version: String
75
def defaultParallelism: Int
76
}
77
78
class SparkConf(loadDefaults: Boolean = true) {
79
def set(key: String, value: String): SparkConf
80
def setAppName(name: String): SparkConf
81
def setMaster(master: String): SparkConf
82
}
83
```
84
85
[Application Context](./application-context.md)
86
87
### RDD Operations
88
89
Resilient Distributed Dataset API providing transformations and actions for distributed data processing. RDDs support fault-tolerant parallel operations on large datasets.
90
91
```scala { .api }
92
abstract class RDD[T: ClassTag] {
93
def map[U: ClassTag](f: T => U): RDD[U]
94
def filter(f: T => Boolean): RDD[T]
95
def flatMap[U: ClassTag](f: T => IterableOnce[U]): RDD[U]
96
def collect(): Array[T]
97
def count(): Long
98
def reduce(f: (T, T) => T): T
99
def persist(newLevel: StorageLevel): RDD[T]
100
def persist(): RDD[T]
101
def cache(): RDD[T]
102
}
103
```
104
105
[RDD Operations](./rdd-operations.md)
106
107
### Java API
108
109
Java-friendly wrappers for Spark functionality providing type-safe distributed processing for Java applications.
110
111
```java { .api }
112
public class JavaSparkContext {
113
public <T> JavaRDD<T> parallelize(java.util.List<T> list)
114
public <T> JavaRDD<T> parallelize(java.util.List<T> list, int numSlices)
115
public JavaRDD<String> textFile(String path)
116
public void stop()
117
}
118
119
public class JavaRDD<T> {
120
public <R> JavaRDD<R> map(org.apache.spark.api.java.function.Function<T, R> f)
121
public JavaRDD<T> filter(org.apache.spark.api.java.function.Function<T, Boolean> f)
122
public java.util.List<T> collect()
123
}
124
```
125
126
[Java API](./java-api.md)
127
128
### Storage and Persistence
129
130
Data storage levels and caching mechanisms for optimizing repeated access to RDDs with configurable memory and disk usage.
131
132
```scala { .api }
133
class StorageLevel(
134
useDisk: Boolean,
135
useMemory: Boolean,
136
useOffHeap: Boolean,
137
deserialized: Boolean,
138
replication: Int
139
)
140
141
object StorageLevel {
142
val MEMORY_ONLY: StorageLevel
143
val MEMORY_AND_DISK: StorageLevel
144
val DISK_ONLY: StorageLevel
145
}
146
```
147
148
[Storage and Persistence](./storage-persistence.md)
149
150
### Broadcast Variables
151
152
Efficient read-only variable distribution to all cluster nodes for sharing large datasets or lookup tables across tasks.
153
154
```scala { .api }
155
abstract class Broadcast[T: ClassTag] {
156
def value: T
157
def unpersist(blocking: Boolean = false): Unit
158
def destroy(): Unit
159
}
160
```
161
162
[Broadcast Variables](./broadcast-variables.md)
163
164
### Accumulators
165
166
Shared variables for collecting information from distributed tasks, supporting aggregation patterns like counters and sums.
167
168
```scala { .api }
169
abstract class AccumulatorV2[IN, OUT] {
170
def add(v: IN): Unit
171
def value: OUT
172
def reset(): Unit
173
def copy(): AccumulatorV2[IN, OUT]
174
}
175
176
class LongAccumulator extends AccumulatorV2[java.lang.Long, java.lang.Long]
177
class DoubleAccumulator extends AccumulatorV2[java.lang.Double, java.lang.Double]
178
```
179
180
[Accumulators](./accumulators.md)
181
182
### Partitioning
183
184
Data partitioning strategies for controlling how RDD elements are distributed across cluster nodes to optimize performance.
185
186
```scala { .api }
187
abstract class Partitioner extends Serializable {
188
def numPartitions: Int
189
def getPartition(key: Any): Int
190
}
191
192
class HashPartitioner(partitions: Int) extends Partitioner
193
class RangePartitioner[K: Ordering: ClassTag, V](
194
partitions: Int,
195
rdd: RDD[_ <: Product2[K, V]]
196
) extends Partitioner
197
```
198
199
[Partitioning](./partitioning.md)
200
201
### Serialization
202
203
Serialization frameworks for efficient data transfer and storage with support for Java serialization and Kryo.
204
205
```scala { .api }
206
abstract class Serializer {
207
def newInstance(): SerializerInstance
208
}
209
210
abstract class SerializerInstance {
211
def serialize[T: ClassTag](t: T): ByteBuffer
212
def deserialize[T: ClassTag](bytes: ByteBuffer): T
213
}
214
215
class JavaSerializer(conf: SparkConf) extends Serializer
216
class KryoSerializer(conf: SparkConf) extends Serializer
217
```
218
219
[Serialization](./serialization.md)
220
221
## Types
222
223
```scala { .api }
224
trait ClassTag[T]
225
trait Ordering[T]
226
227
case class TaskContext(
228
stageId: Int,
229
stageAttemptNumber: Int,
230
partitionId: Int,
231
taskAttemptId: Long
232
)
233
234
sealed trait TaskEndReason
235
case object Success extends TaskEndReason
236
case class ExceptionFailure(
237
className: String,
238
description: String,
239
stackTrace: Array[StackTraceElement]
240
) extends TaskEndReason
241
```