Core functionality for Apache Spark, providing RDDs, SparkContext, and the fundamental distributed computing engine for big data processing.
npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-10@1.6.00
# Apache Spark Core
1
2
Apache Spark Core provides the foundational distributed computing engine for Apache Spark, implementing core abstractions like Resilient Distributed Datasets (RDDs) that enable fault-tolerant distributed data processing across clusters. It includes the SparkContext for managing distributed applications, schedulers for task execution, serializers for data exchange, broadcast variables for efficient data sharing, accumulators for distributed counters, and comprehensive APIs for data transformations and actions.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-core_2.10
7
- **Package Type**: Maven
8
- **Language**: Scala
9
- **Installation**: Add to your Maven POM or SBT build file
10
- **Documentation**: http://spark.apache.org/docs/1.6.3/
11
12
## Core Imports
13
14
```scala
15
import org.apache.spark.{SparkContext, SparkConf}
16
import org.apache.spark.rdd.RDD
17
```
18
19
For Java applications:
20
21
```java
22
import org.apache.spark.SparkContext;
23
import org.apache.spark.SparkConf;
24
import org.apache.spark.api.java.JavaSparkContext;
25
import org.apache.spark.api.java.JavaRDD;
26
import org.apache.spark.api.java.JavaPairRDD;
27
import org.apache.spark.broadcast.Broadcast;
28
import org.apache.spark.Accumulator;
29
```
30
31
## Basic Usage
32
33
```scala
34
import org.apache.spark.{SparkContext, SparkConf}
35
36
// Configure Spark application
37
val conf = new SparkConf()
38
.setAppName("MySparkApp")
39
.setMaster("local[*]")
40
41
// Create SparkContext
42
val sc = new SparkContext(conf)
43
44
// Create RDD from collection
45
val data = Array(1, 2, 3, 4, 5)
46
val rdd = sc.parallelize(data)
47
48
// Transform and act on RDD
49
val result = rdd
50
.map(_ * 2)
51
.filter(_ > 5)
52
.collect()
53
54
// Broadcast variable
55
val broadcastVar = sc.broadcast(Array(1, 2, 3))
56
57
// Accumulator
58
val accum = sc.accumulator(0)
59
60
// Clean up
61
sc.stop()
62
```
63
64
## Architecture
65
66
Apache Spark Core is built around several key components:
67
68
- **SparkContext**: Main entry point and driver that coordinates distributed execution
69
- **RDD Abstraction**: Resilient Distributed Datasets providing fault-tolerant distributed collections
70
- **Lazy Evaluation**: Operations are lazily evaluated until an action is called
71
- **DAG Scheduler**: Converts logical execution plans into physical stages
72
- **Task Scheduler**: Manages task execution across cluster nodes
73
- **Storage System**: Manages caching and persistence of RDD partitions
74
- **Shuffle System**: Handles data redistribution across cluster nodes
75
76
## Capabilities
77
78
### SparkContext and Configuration
79
80
Core entry point for Spark applications with configuration management and resource coordination. Essential for creating RDDs, managing cluster connections, and coordinating distributed execution.
81
82
```scala { .api }
83
class SparkContext(config: SparkConf)
84
class SparkConf()
85
86
// Core RDD creation methods
87
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
88
def range(start: Long, end: Long, step: Long = 1, numSlices: Int = defaultParallelism): RDD[Long]
89
def makeRDD[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
90
def emptyRDD[T: ClassTag]: RDD[T]
91
92
// File I/O methods
93
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
94
def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]
95
def binaryFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, PortableDataStream)]
96
def objectFile[T: ClassTag](path: String, minPartitions: Int = defaultMinPartitions): RDD[T]
97
98
// Shared variables
99
def broadcast[T: ClassTag](value: T): Broadcast[T]
100
def accumulator[T](initialValue: T)(implicit param: AccumulatorParam[T]): Accumulator[T]
101
102
// Context management
103
def stop(): Unit
104
def setCheckpointDir(directory: String): Unit
105
```
106
107
[SparkContext and Configuration](./spark-context.md)
108
109
### RDD Operations and Transformations
110
111
Resilient Distributed Datasets providing the core abstraction for distributed data processing with transformations, actions, and persistence capabilities.
112
113
```scala { .api }
114
abstract class RDD[T: ClassTag]
115
116
// Core transformations (lazy evaluation)
117
def map[U: ClassTag](f: T => U): RDD[U]
118
def filter(f: T => Boolean): RDD[T]
119
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
120
def union(other: RDD[T]): RDD[T]
121
def distinct(): RDD[T]
122
def sample(withReplacement: Boolean, fraction: Double, seed: Long): RDD[T]
123
124
// Advanced transformations
125
def sortBy[K](f: T => K, ascending: Boolean = true)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
126
def keyBy[K](f: T => K): RDD[(K, T)]
127
def zipWithIndex(): RDD[(T, Long)]
128
def zipWithUniqueId(): RDD[(T, Long)]
129
130
// Core actions
131
def collect(): Array[T]
132
def count(): Long
133
def first(): T
134
def take(num: Int): Array[T]
135
def reduce(f: (T, T) => T): T
136
def fold(zeroValue: T)(op: (T, T) => T): T
137
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U
138
139
// Persistence methods
140
def persist(newLevel: StorageLevel): RDD[T]
141
def cache(): RDD[T]
142
def unpersist(blocking: Boolean = true): RDD[T]
143
```
144
145
[RDD Operations](./rdd-operations.md)
146
147
### Key-Value Pair Operations
148
149
Specialized operations for RDDs containing key-value pairs, including joins, grouping, and aggregation operations essential for data processing workflows.
150
151
```scala { .api }
152
class PairRDDFunctions[K, V](self: RDD[(K, V)])
153
154
def groupByKey(): RDD[(K, Iterable[V])]
155
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
156
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
157
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
158
```
159
160
[Key-Value Operations](./key-value-operations.md)
161
162
### Shared Variables
163
164
Broadcast variables and accumulators for efficient data sharing and distributed counting across cluster nodes.
165
166
```scala { .api }
167
abstract class Broadcast[T]
168
class Accumulator[T]
169
170
// SparkContext methods for shared variables
171
def broadcast[T](value: T): Broadcast[T]
172
def accumulator[T](initialValue: T): Accumulator[T]
173
```
174
175
[Shared Variables](./shared-variables.md)
176
177
### Storage and Persistence
178
179
Storage levels and caching mechanisms for optimizing RDD persistence across memory and disk with various replication strategies.
180
181
```scala { .api }
182
object StorageLevel
183
184
// Storage level constants
185
val MEMORY_ONLY: StorageLevel
186
val MEMORY_AND_DISK: StorageLevel
187
val DISK_ONLY: StorageLevel
188
189
// RDD persistence methods
190
def persist(storageLevel: StorageLevel): RDD[T]
191
def cache(): RDD[T]
192
def unpersist(): RDD[T]
193
```
194
195
[Storage and Persistence](./storage-persistence.md)
196
197
### Input/Output Operations
198
199
File I/O operations for reading from and writing to various data sources including text files, sequence files, and Hadoop-compatible formats.
200
201
```scala { .api }
202
// SparkContext I/O methods
203
def textFile(path: String): RDD[String]
204
def wholeTextFiles(path: String): RDD[(String, String)]
205
def sequenceFile[K, V](path: String): RDD[(K, V)]
206
207
// RDD output methods
208
def saveAsTextFile(path: String): Unit
209
def saveAsSequenceFile(path: String): Unit
210
```
211
212
[Input/Output Operations](./io-operations.md)
213
214
### Partitioning and Shuffling
215
216
Partitioning strategies and shuffle operations for controlling data distribution and optimizing performance across cluster nodes.
217
218
```scala { .api }
219
abstract class Partitioner
220
class HashPartitioner(partitions: Int) extends Partitioner
221
class RangePartitioner[K, V](partitions: Int, rdd: RDD[(K, V)]) extends Partitioner
222
223
// Partitioning methods
224
def partitionBy(partitioner: Partitioner): RDD[(K, V)]
225
def repartition(numPartitions: Int): RDD[T]
226
def coalesce(numPartitions: Int): RDD[T]
227
```
228
229
[Partitioning and Shuffling](./partitioning-shuffling.md)
230
231
## Types
232
233
```scala { .api }
234
// Core type aliases and abstractions
235
type Partition = org.apache.spark.Partition
236
type TaskContext = org.apache.spark.TaskContext
237
type SparkFiles = org.apache.spark.SparkFiles.type
238
239
// Function type aliases for Java interop
240
type Function[T, R] = T => R
241
type Function2[T1, T2, R] = (T1, T2) => R
242
type VoidFunction[T] = T => Unit
243
```