Apache Spark is a unified analytics engine for large-scale data processing with high-level APIs in Scala, Java, Python, and R.
npx @tessl/cli install tessl/maven-org-apache-spark--spark-parent-2-13@4.0.00
# Apache Spark
1
2
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R (deprecated), and an optimized engine that supports general computation graphs for data analysis. Spark includes specialized tools for SQL and DataFrames, machine learning (MLlib), graph processing (GraphX), and stream processing.
3
4
## Overview
5
6
Apache Spark is a unified analytics engine for large-scale data processing with high-level APIs in Scala, Java, Python, and R (deprecated). Key features include:
7
8
- **Unified Processing**: Single engine for batch, interactive, real-time, and machine learning workloads
9
- **High Performance**: In-memory computing with advanced DAG execution engine
10
- **Ease of Use**: Simple APIs in multiple languages with 80+ high-level operators
11
- **Scalability**: Runs everywhere from laptops to large clusters with thousands of nodes
12
- **Advanced Analytics**: Built-in modules for SQL, streaming, machine learning, and graph processing
13
14
## Package Information
15
16
- **Package Name**: org.apache.spark:spark-parent_2.13
17
- **Package Type**: Maven
18
- **Language**: Scala/Java
19
- **Installation**: Add to Maven/SBT dependencies or download distribution
20
- **Maven Coordinates**: `org.apache.spark:spark-core_2.13:4.0.0`
21
22
## Core Imports
23
24
For Scala applications:
25
26
```scala
27
import org.apache.spark.{SparkConf, SparkContext}
28
import org.apache.spark.sql.SparkSession
29
import org.apache.spark.rdd.RDD
30
```
31
32
For Java applications:
33
34
```java
35
import org.apache.spark.SparkConf;
36
import org.apache.spark.api.java.JavaSparkContext;
37
import org.apache.spark.sql.SparkSession;
38
import org.apache.spark.sql.Dataset;
39
import org.apache.spark.sql.Row;
40
```
41
42
## Basic Usage
43
44
### Core RDD API (Scala)
45
46
```scala
47
import org.apache.spark.{SparkConf, SparkContext}
48
49
// Create Spark configuration and context
50
val conf = new SparkConf().setAppName("MyApp").setMaster("local[*]")
51
val sc = new SparkContext(conf)
52
53
// Create RDD from collection
54
val data = Array(1, 2, 3, 4, 5)
55
val rdd = sc.parallelize(data)
56
57
// Transform and collect results
58
val result = rdd.map(_ * 2).filter(_ > 5).collect()
59
60
sc.stop()
61
```
62
63
### SQL and DataFrames (Scala)
64
65
```scala
66
import org.apache.spark.sql.SparkSession
67
68
// Create Spark session
69
val spark = SparkSession.builder()
70
.appName("MyApp")
71
.master("local[*]")
72
.getOrCreate()
73
74
// Read data into DataFrame
75
val df = spark.read.json("path/to/data.json")
76
77
// SQL operations
78
df.select("name", "age")
79
.filter(df("age") > 21)
80
.show()
81
82
// SQL queries
83
df.createOrReplaceTempView("people")
84
val adults = spark.sql("SELECT name FROM people WHERE age >= 18")
85
86
spark.stop()
87
```
88
89
## Architecture
90
91
Apache Spark follows a driver-executor architecture:
92
93
- **Driver Program**: Contains the main function and defines RDDs/DataFrames
94
- **Cluster Manager**: Allocates resources across applications (YARN, Mesos, Standalone)
95
- **Executors**: Worker processes that run tasks and store data
96
- **Tasks**: Units of work sent to executors
97
98
Key architectural components:
99
100
- **Catalyst Optimizer**: Rule-based and cost-based query optimization
101
- **Tungsten**: Off-heap memory management and code generation
102
- **Resilient Distributed Datasets (RDDs)**: Fault-tolerant distributed collections
103
- **DataFrames/Datasets**: Structured data APIs built on RDDs
104
105
## Capabilities
106
107
### Core Engine
108
109
Provides the fundamental distributed computing capabilities with RDDs, transformations, actions, and distributed variables.
110
111
```scala { .api }
112
class SparkContext(config: SparkConf) {
113
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
114
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
115
def broadcast[T](value: T): Broadcast[T]
116
def longAccumulator(): LongAccumulator
117
def stop(): Unit
118
}
119
120
abstract class RDD[T: ClassTag] {
121
def map[U: ClassTag](f: T => U): RDD[U]
122
def filter(f: T => Boolean): RDD[T]
123
def collect(): Array[T]
124
def count(): Long
125
def reduce(f: (T, T) => T): T
126
def cache(): RDD[T]
127
}
128
```
129
130
[Core Engine](./core-engine.md)
131
132
### SQL and DataFrames
133
134
Structured data processing with SQL queries, DataFrames, and type-safe Datasets. Includes data source connectors and streaming capabilities.
135
136
```scala { .api }
137
object SparkSession {
138
def builder(): Builder
139
}
140
141
class SparkSession {
142
def read: DataFrameReader
143
def sql(sqlText: String): DataFrame
144
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
145
def stop(): Unit
146
}
147
148
abstract class Dataset[T] {
149
def select(cols: Column*): DataFrame
150
def filter(condition: Column): Dataset[T]
151
def groupBy(cols: Column*): RelationalGroupedDataset
152
def join(right: Dataset[_]): DataFrame
153
def collect(): Array[T]
154
def show(numRows: Int = 20): Unit
155
}
156
```
157
158
[SQL and DataFrames](./sql-dataframes.md)
159
160
### Machine Learning
161
162
Scalable machine learning algorithms and utilities, including both RDD-based (MLlib) and DataFrame-based (ML) APIs.
163
164
```scala { .api }
165
// DataFrame-based ML Pipeline API
166
abstract class Estimator[M <: Model[M]] extends PipelineStage {
167
def fit(dataset: Dataset[_]): M
168
}
169
170
abstract class Transformer extends PipelineStage {
171
def transform(dataset: Dataset[_]): DataFrame
172
}
173
174
class Pipeline(stages: Array[PipelineStage]) extends Estimator[PipelineModel] {
175
def fit(dataset: Dataset[_]): PipelineModel
176
}
177
```
178
179
[Machine Learning](./machine-learning.md)
180
181
### Graph Processing
182
183
Large-scale graph processing with GraphX, including graph algorithms and graph-parallel computations.
184
185
```scala { .api }
186
abstract class Graph[VD: ClassTag, ED: ClassTag] {
187
def vertices: VertexRDD[VD]
188
def edges: EdgeRDD[ED]
189
def mapVertices[VD2: ClassTag](map: (VertexId, VD) => VD2): Graph[VD2, ED]
190
def subgraph(epred: EdgeTriplet[VD, ED] => Boolean = _ => true,
191
vpred: (VertexId, VD) => Boolean = (_, _) => true): Graph[VD, ED]
192
}
193
194
object GraphLoader {
195
def edgeListFile(sc: SparkContext, path: String): Graph[Int, Int]
196
}
197
```
198
199
[Graph Processing](./graph-processing.md)
200
201
### Stream Processing
202
203
Real-time data processing with both legacy DStreams and modern Structured Streaming APIs.
204
205
```scala { .api }
206
class DataStreamReader {
207
def format(source: String): DataStreamReader
208
def option(key: String, value: String): DataStreamReader
209
def load(): DataFrame
210
}
211
212
abstract class StreamingQuery {
213
def start(): StreamingQuery
214
def stop(): Unit
215
def awaitTermination(): Unit
216
def isActive: Boolean
217
}
218
```
219
220
[Stream Processing](./stream-processing.md)
221
222
## Types
223
224
### Core Types
225
226
```scala { .api }
227
class SparkConf(loadDefaults: Boolean = true) {
228
def set(key: String, value: String): SparkConf
229
def setAppName(name: String): SparkConf
230
def setMaster(master: String): SparkConf
231
def get(key: String): String
232
}
233
234
abstract class Broadcast[T] extends Serializable {
235
def value: T
236
def unpersist(): Unit
237
def destroy(): Unit
238
}
239
240
abstract class AccumulatorV2[IN, OUT] extends Serializable {
241
def add(v: IN): Unit
242
def value: OUT
243
def reset(): Unit
244
}
245
246
object StorageLevel {
247
val MEMORY_ONLY: StorageLevel
248
val MEMORY_AND_DISK: StorageLevel
249
val DISK_ONLY: StorageLevel
250
}
251
```
252
253
### SQL Types
254
255
```scala { .api }
256
trait Row extends Serializable {
257
def get(i: Int): Any
258
def getString(i: Int): String
259
def getInt(i: Int): Int
260
def getDouble(i: Int): Double
261
def isNullAt(i: Int): Boolean
262
def size: Int
263
}
264
265
class Column(expr: Expression) {
266
def ===(other: Any): Column
267
def !==(other: Any): Column
268
def >(other: Any): Column
269
def <(other: Any): Column
270
def isNull: Column
271
def cast(to: DataType): Column
272
def alias(alias: String): Column
273
}
274
275
class StructType(fields: Array[StructField]) extends DataType {
276
def add(field: StructField): StructType
277
def add(name: String, dataType: DataType): StructType
278
def fieldNames: Array[String]
279
}
280
```
281
282
### Graph Types
283
284
```scala { .api }
285
type VertexId = Long
286
287
case class Edge[ED](srcId: VertexId, dstId: VertexId, attr: ED)
288
289
class EdgeTriplet[VD, ED] extends Edge[ED] {
290
def srcAttr: VD
291
def dstAttr: VD
292
}
293
294
abstract class VertexRDD[VD] extends RDD[(VertexId, VD)]
295
abstract class EdgeRDD[ED] extends RDD[Edge[ED]]
296
```