Apache Spark Core provides distributed computing capabilities with RDDs, task scheduling, and cluster management for big data processing
npx @tessl/cli install tessl/maven-org-apache-spark--spark-core@2.2.00
# Apache Spark Core
1
2
Apache Spark Core is the foundational engine for large-scale distributed data processing. It provides resilient distributed datasets (RDDs), in-memory computing capabilities, and a unified execution engine for batch and interactive data processing across clusters.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-core_2.11
7
- **Package Type**: Maven
8
- **Language**: Scala (with Java API)
9
- **Installation**:
10
```xml
11
<dependency>
12
<groupId>org.apache.spark</groupId>
13
<artifactId>spark-core_2.11</artifactId>
14
<version>2.2.3</version>
15
</dependency>
16
```
17
18
## Core Imports
19
20
Scala:
21
```scala { .api }
22
import org.apache.spark.{SparkContext, SparkConf}
23
import org.apache.spark.rdd.RDD
24
import org.apache.spark.storage.StorageLevel
25
import org.apache.spark.broadcast.Broadcast
26
import org.apache.spark.util.{LongAccumulator, DoubleAccumulator, AccumulatorV2}
27
```
28
29
Java:
30
```java { .api }
31
import org.apache.spark.api.java.JavaSparkContext;
32
import org.apache.spark.api.java.JavaRDD;
33
import org.apache.spark.api.java.JavaPairRDD;
34
import org.apache.spark.SparkConf;
35
import org.apache.spark.broadcast.Broadcast;
36
import org.apache.spark.util.LongAccumulator;
37
```
38
39
## Basic Usage
40
41
```scala
42
import org.apache.spark.{SparkContext, SparkConf}
43
44
// Create Spark configuration
45
val conf = new SparkConf()
46
.setAppName("My Spark Application")
47
.setMaster("local[*]")
48
49
// Create Spark context
50
val sc = new SparkContext(conf)
51
52
// Create RDD from collection
53
val data = sc.parallelize(Seq(1, 2, 3, 4, 5))
54
55
// Transform and compute
56
val result = data
57
.filter(_ % 2 == 0)
58
.map(_ * 2)
59
.collect()
60
61
// Clean up
62
sc.stop()
63
```
64
65
## Architecture
66
67
Apache Spark Core is built around several key abstractions:
68
69
- **SparkContext**: The main entry point that coordinates distributed computing
70
- **RDD (Resilient Distributed Dataset)**: Immutable, fault-tolerant collections distributed across cluster nodes
71
- **Transformations**: Lazy operations that create new RDDs (map, filter, groupBy)
72
- **Actions**: Operations that trigger computation and return results (collect, count, save)
73
- **Broadcast Variables**: Read-only variables cached on all nodes
74
- **Accumulators**: Variables for aggregating information across tasks
75
76
## Capabilities
77
78
### Core SparkContext and Configuration
79
80
The SparkContext serves as the primary entry point for all Spark functionality, providing methods for creating RDDs, managing cluster resources, and configuring applications.
81
82
```scala { .api }
83
class SparkContext(config: SparkConf) {
84
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
85
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
86
def stop(): Unit
87
}
88
89
class SparkConf(loadDefaults: Boolean = true) {
90
def setAppName(name: String): SparkConf
91
def setMaster(master: String): SparkConf
92
def set(key: String, value: String): SparkConf
93
}
94
```
95
96
[Core SparkContext and Configuration](./spark-context.md)
97
98
### RDD Operations and Transformations
99
100
RDDs provide the core abstraction for distributed data processing with lazy transformations and eager actions that enable fault-tolerant computation across cluster nodes.
101
102
```scala { .api }
103
abstract class RDD[T: ClassTag] {
104
def map[U: ClassTag](f: T => U): RDD[U]
105
def filter(f: T => Boolean): RDD[T]
106
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
107
def groupBy[K](f: T => K): RDD[(K, Iterable[T])]
108
def collect(): Array[T]
109
def count(): Long
110
def reduce(f: (T, T) => T): T
111
}
112
```
113
114
[RDD Operations and Transformations](./rdd-operations.md)
115
116
### Java API
117
118
Java-friendly wrappers provide seamless integration with Java applications while maintaining full access to Spark's distributed computing capabilities.
119
120
```java { .api }
121
public class JavaSparkContext {
122
public JavaSparkContext(SparkConf conf)
123
public <T> JavaRDD<T> parallelize(List<T> list)
124
public JavaRDD<String> textFile(String path)
125
public void close()
126
}
127
128
public class JavaRDD<T> {
129
public <R> JavaRDD<R> map(Function<T, R> f)
130
public JavaRDD<T> filter(Function<T, Boolean> f)
131
public List<T> collect()
132
}
133
```
134
135
[Java API](./java-api.md)
136
137
### Storage and Persistence
138
139
Storage and persistence mechanisms allow RDDs to be cached in memory or persisted to disk with configurable storage levels for performance optimization.
140
141
```scala { .api }
142
abstract class RDD[T] {
143
def persist(newLevel: StorageLevel): this.type
144
def cache(): this.type
145
def unpersist(blocking: Boolean = true): this.type
146
}
147
148
object StorageLevel {
149
val MEMORY_ONLY: StorageLevel
150
val MEMORY_AND_DISK: StorageLevel
151
val DISK_ONLY: StorageLevel
152
}
153
```
154
155
[Storage and Persistence](./storage-persistence.md)
156
157
### Broadcasting and Accumulators
158
159
Shared variables enable efficient distribution of read-only data and aggregation of values across distributed computations without expensive network operations.
160
161
```scala { .api }
162
abstract class Broadcast[T] {
163
def value: T
164
def unpersist(): Unit
165
def destroy(): Unit
166
}
167
168
trait AccumulatorV2[IN, OUT] {
169
def add(v: IN): Unit
170
def value: OUT
171
def isZero: Boolean
172
}
173
```
174
175
[Broadcasting and Accumulators](./broadcasting-accumulators.md)
176
177
## Core Types
178
179
```scala { .api }
180
// Core cluster management
181
trait TaskContext {
182
def stageId(): Int
183
def partitionId(): Int
184
def taskAttemptId(): Long
185
}
186
187
// Partitioning
188
abstract class Partitioner {
189
def numPartitions: Int
190
def getPartition(key: Any): Int
191
}
192
193
// Asynchronous operations
194
trait FutureAction[T] {
195
def cancel(): Unit
196
def isCompleted: Boolean
197
def result(atMost: Duration): T
198
}
199
```