Apache Spark Core provides the fundamental distributed computing infrastructure for large-scale data processing with RDDs, SparkContext, and cluster management
npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-13@3.5.00
# Apache Spark Core
1
2
Apache Spark Core provides the fundamental distributed computing infrastructure for large-scale data processing. It implements the core Spark execution model with Resilient Distributed Datasets (RDDs) as the primary abstraction for fault-tolerant distributed collections, SparkContext as the main entry point, and a sophisticated task scheduler for efficient cluster computation.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-core_2.13
7
- **Package Type**: Maven
8
- **Language**: Scala (with Java/Python/R APIs)
9
- **Version**: 3.5.6
10
- **Installation**: Add to Maven dependencies or SBT build
11
12
Maven:
13
```xml
14
<dependency>
15
<groupId>org.apache.spark</groupId>
16
<artifactId>spark-core_2.13</artifactId>
17
<version>3.5.6</version>
18
</dependency>
19
```
20
21
SBT:
22
```scala
23
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.6"
24
```
25
26
## Core Imports
27
28
```scala
29
import org.apache.spark.{SparkContext, SparkConf}
30
import org.apache.spark.rdd.RDD
31
```
32
33
For broadcast variables and accumulators:
34
```scala
35
import org.apache.spark.broadcast.Broadcast
36
import org.apache.spark.util.AccumulatorV2
37
```
38
39
## Basic Usage
40
41
```scala
42
import org.apache.spark.{SparkContext, SparkConf}
43
44
// Create Spark configuration
45
val conf = new SparkConf()
46
.setAppName("MySparkApp")
47
.setMaster("local[*]")
48
49
// Initialize SparkContext
50
val sc = new SparkContext(conf)
51
52
// Create RDD from collection
53
val data = sc.parallelize(Seq(1, 2, 3, 4, 5))
54
55
// Transform and collect results
56
val squares = data.map(_ * 2).filter(_ > 4)
57
val result = squares.collect()
58
59
// Clean up
60
sc.stop()
61
```
62
63
## Architecture
64
65
Spark Core implements a distributed computing framework with several key components:
66
67
- **SparkContext**: The main entry point that coordinates cluster resources and manages the application lifecycle
68
- **RDD (Resilient Distributed Dataset)**: Immutable, fault-tolerant distributed collections that form the core abstraction
69
- **DAG Scheduler**: Optimizes computation graphs and creates stages for efficient execution
70
- **Task Scheduler**: Distributes tasks across cluster nodes and handles task failures
71
- **Cluster Manager**: Interfaces with YARN, Mesos, Kubernetes, or standalone cluster managers
72
- **Storage System**: Manages memory and disk storage for cached RDDs and shuffle data
73
74
This architecture enables fault-tolerant distributed computing with automatic recovery, lineage tracking, and optimized data locality.
75
76
## Capabilities
77
78
### SparkContext and Configuration
79
80
The main entry point for Spark applications, providing methods to create RDDs, manage configuration, and control cluster resources.
81
82
```scala { .api }
83
class SparkContext(config: SparkConf)
84
class SparkConf(loadDefaults: Boolean = true)
85
86
// Key SparkContext methods
87
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
88
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
89
def broadcast[T: ClassTag](value: T): Broadcast[T]
90
def longAccumulator(name: String): LongAccumulator
91
def stop(): Unit
92
```
93
94
[SparkContext and Configuration](./sparkcontext.md)
95
96
### RDD Operations
97
98
Core distributed dataset operations including transformations (lazy) and actions (eager execution).
99
100
```scala { .api }
101
abstract class RDD[T: ClassTag]
102
103
// Core transformations
104
def map[U: ClassTag](f: T => U): RDD[U]
105
def filter(f: T => Boolean): RDD[T]
106
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
107
def union(other: RDD[T]): RDD[T]
108
def distinct(): RDD[T]
109
110
// Core actions
111
def collect(): Array[T]
112
def count(): Long
113
def reduce(f: (T, T) => T): T
114
def foreach(f: T => Unit): Unit
115
```
116
117
[RDD Operations](./rdd-operations.md)
118
119
### Key-Value Operations
120
121
Specialized operations available on RDDs of key-value pairs, including joins, grouping, and aggregations.
122
123
```scala { .api }
124
// Available on RDD[(K, V)] via implicit conversion
125
def groupByKey(): RDD[(K, Iterable[V])]
126
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
127
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
128
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
129
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)]
130
```
131
132
[Key-Value Operations](./key-value-operations.md)
133
134
### Broadcast Variables and Accumulators
135
136
Shared variables for efficient data distribution and accumulation across cluster nodes.
137
138
```scala { .api }
139
abstract class Broadcast[T: ClassTag]
140
def value: T
141
def unpersist(): Unit
142
143
abstract class AccumulatorV2[IN, OUT]
144
def add(v: IN): Unit
145
def value: OUT
146
def reset(): Unit
147
```
148
149
[Broadcast Variables and Accumulators](./broadcast-accumulators.md)
150
151
### Data I/O and Persistence
152
153
Input/output operations for various data sources and RDD caching strategies.
154
155
```scala { .api }
156
// Input operations
157
def textFile(path: String): RDD[String]
158
def sequenceFile[K, V](path: String): RDD[(K, V)]
159
def hadoopRDD[K, V](conf: JobConf, inputFormat: Class[_ <: InputFormat[K, V]]): RDD[(K, V)]
160
161
// Persistence
162
def persist(newLevel: StorageLevel): RDD[T]
163
def cache(): RDD[T]
164
def unpersist(): RDD[T]
165
```
166
167
[Data I/O and Persistence](./data-io-persistence.md)
168
169
## Types
170
171
```scala { .api }
172
// Core configuration
173
class SparkConf(loadDefaults: Boolean = true) {
174
def set(key: String, value: String): SparkConf
175
def setAppName(name: String): SparkConf
176
def setMaster(master: String): SparkConf
177
def get(key: String): String
178
}
179
180
// Storage levels
181
object StorageLevel {
182
val NONE: StorageLevel
183
val MEMORY_ONLY: StorageLevel
184
val MEMORY_AND_DISK: StorageLevel
185
val MEMORY_ONLY_SER: StorageLevel
186
val DISK_ONLY: StorageLevel
187
}
188
189
// Partitioning
190
abstract class Partitioner {
191
def numPartitions: Int
192
def getPartition(key: Any): Int
193
}
194
195
case class HashPartitioner(partitions: Int) extends Partitioner
196
197
// Binary data representation
198
class PortableDataStream(
199
isDirectory: Boolean,
200
path: String,
201
length: Long,
202
modificationTime: Long
203
) {
204
def open(): DataInputStream
205
def toArray(): Array[Byte]
206
}
207
```