Apache Spark unified analytics engine for large-scale data processing with APIs in Scala, Java, Python and R.
npx @tessl/cli install tessl/maven-org-apache-spark--spark-parent_2.13@4.0.00
# Apache Spark
1
2
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-parent_2.13
7
- **Package Type**: Maven
8
- **Language**: Scala (with Java, Python, R APIs)
9
- **Installation**: See [Building Spark](#building-spark)
10
- **License**: Apache-2.0
11
12
### Maven Dependency
13
14
For Spark SQL (most common):
15
```xml
16
<dependency>
17
<groupId>org.apache.spark</groupId>
18
<artifactId>spark-sql_2.13</artifactId>
19
<version>4.0.0</version>
20
</dependency>
21
```
22
23
For Spark Core:
24
```xml
25
<dependency>
26
<groupId>org.apache.spark</groupId>
27
<artifactId>spark-core_2.13</artifactId>
28
<version>4.0.0</version>
29
</dependency>
30
```
31
32
## Core Imports
33
34
```scala { .api }
35
// Main entry points
36
import org.apache.spark.SparkContext
37
import org.apache.spark.sql.SparkSession
38
39
// Core data structures
40
import org.apache.spark.rdd.RDD
41
import org.apache.spark.sql.{DataFrame, Dataset}
42
43
// Configuration
44
import org.apache.spark.SparkConf
45
46
// Common types
47
import org.apache.spark.sql.types._
48
import org.apache.spark.sql.functions._
49
```
50
51
## Basic Usage
52
53
### Creating a Spark Session (Recommended)
54
55
```scala { .api }
56
import org.apache.spark.sql.SparkSession
57
58
val spark = SparkSession.builder()
59
.appName("MySparkApp")
60
.master("local[*]") // Use all available cores locally
61
.config("spark.some.config.option", "some-value")
62
.getOrCreate()
63
64
// Use the spark session for SQL operations
65
val df = spark.read.json("path/to/file.json")
66
df.show()
67
68
// Access the SparkContext for RDD operations
69
val sc = spark.sparkContext
70
val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5))
71
println(rdd.map(_ * 2).collect().mkString(", "))
72
73
spark.stop()
74
```
75
76
### Working with DataFrames
77
78
```scala { .api }
79
import org.apache.spark.sql.functions._
80
81
// Read data
82
val df = spark.read
83
.option("header", "true")
84
.csv("path/to/data.csv")
85
86
// Transform data
87
val result = df
88
.filter(col("age") > 21)
89
.groupBy("department")
90
.agg(avg("salary").as("avg_salary"))
91
.orderBy(desc("avg_salary"))
92
93
result.show()
94
```
95
96
### Working with RDDs (Low-level API)
97
98
```scala { .api }
99
val sc = spark.sparkContext
100
101
// Create RDD from collection
102
val numbers = sc.parallelize(1 to 1000)
103
104
// Transform and action
105
val result = numbers
106
.filter(_ % 2 == 0)
107
.map(_ * 2)
108
.reduce(_ + _)
109
110
println(s"Result: $result")
111
```
112
113
## Architecture
114
115
Apache Spark is built around several key components:
116
117
### Core Engine
118
- **SparkContext**: The main entry point for Spark functionality and connection to cluster
119
- **RDD (Resilient Distributed Dataset)**: Fundamental data abstraction - immutable distributed collections
120
- **DAG Scheduler**: Converts logical execution plans into physical execution plans
121
- **Task Scheduler**: Schedules and executes tasks across the cluster
122
123
### High-Level APIs
124
- **SparkSession**: Unified entry point for DataFrame and Dataset APIs (recommended)
125
- **DataFrame/Dataset**: Higher-level abstractions built on RDDs with schema information
126
- **Catalyst Optimizer**: Query optimization engine for SQL and DataFrame operations
127
128
### Libraries and Extensions
129
- **Spark SQL**: Module for working with structured data using SQL or DataFrame API
130
- **MLlib**: Machine learning library with algorithms and utilities
131
- **GraphX**: Graph processing framework for graph-based analytics
132
- **Structured Streaming**: Real-time stream processing with DataFrame API
133
134
### Deployment and Storage
135
- **Cluster Managers**: Support for YARN, Kubernetes, Mesos, and standalone mode
136
- **Storage Systems**: Integration with HDFS, S3, databases, and various file formats
137
- **Serialization**: Efficient data serialization for network and storage operations
138
139
## Building Spark
140
141
Spark is built using Apache Maven:
142
143
```bash
144
./build/mvn -DskipTests clean package
145
```
146
147
For development with specific profiles:
148
149
```bash
150
./build/mvn -Pyarn -Phadoop-3.3 -Pscala-2.13 -DskipTests clean package
151
```
152
153
## Capabilities
154
155
### [Core Engine](./core.md)
156
157
Low-level distributed computing with RDDs, SparkContext, and cluster management. Provides the foundation for all other Spark components.
158
159
```scala { .api }
160
class SparkContext(config: SparkConf) extends Logging {
161
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
162
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
163
def stop(): Unit
164
}
165
166
abstract class RDD[T: ClassTag] extends Serializable with Logging {
167
def map[U: ClassTag](f: T => U): RDD[U]
168
def filter(f: T => Boolean): RDD[T]
169
def collect(): Array[T]
170
def count(): Long
171
}
172
```
173
174
[Core Engine Documentation](./core.md)
175
176
### [SQL and DataFrames](./sql.md)
177
178
High-level APIs for working with structured data using SQL, DataFrames, and Datasets with the Catalyst optimizer.
179
180
```scala { .api }
181
class SparkSession extends Serializable with Closeable {
182
def read: DataFrameReader
183
def sql(sqlText: String): DataFrame
184
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
185
def stop(): Unit
186
}
187
188
abstract class Dataset[T] extends Serializable {
189
def select(cols: Column*): DataFrame
190
def filter(condition: Column): Dataset[T]
191
def groupBy(cols: Column*): RelationalGroupedDataset
192
def show(numRows: Int = 20): Unit
193
}
194
```
195
196
[SQL and DataFrames Documentation](./sql.md)
197
198
### [Machine Learning](./mllib.md)
199
200
Comprehensive machine learning library with algorithms for classification, regression, clustering, and collaborative filtering.
201
202
```scala { .api }
203
// Example ML pipeline components
204
import org.apache.spark.ml.Pipeline
205
import org.apache.spark.ml.classification.LogisticRegression
206
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
207
208
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
209
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("features")
210
val lr = new LogisticRegression().setMaxIter(10)
211
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
212
```
213
214
[Machine Learning Documentation](./mllib.md)
215
216
### [Structured Streaming](./streaming.md)
217
218
Scalable and fault-tolerant stream processing engine built on the Spark SQL engine with micro-batch and continuous processing modes.
219
220
```scala { .api }
221
val spark = SparkSession.builder().appName("StructuredNetworkWordCount").getOrCreate()
222
223
val lines = spark.readStream
224
.format("socket")
225
.option("host", "localhost")
226
.option("port", 9999)
227
.load()
228
229
val query = lines.writeStream
230
.outputMode("complete")
231
.format("console")
232
.start()
233
```
234
235
[Structured Streaming Documentation](./streaming.md)
236
237
### [Graph Processing](./graphx.md)
238
239
Graph-parallel computation framework for ETL, exploratory data analysis, and iterative graph computation.
240
241
```scala { .api }
242
import org.apache.spark.graphx._
243
244
val vertices: RDD[(VertexId, String)] = sc.parallelize(Array(
245
(1L, "Alice"), (2L, "Bob"), (3L, "Charlie")
246
))
247
248
val edges: RDD[Edge[String]] = sc.parallelize(Array(
249
Edge(1L, 2L, "friend"), Edge(2L, 3L, "follow")
250
))
251
252
val graph = Graph(vertices, edges)
253
```
254
255
[Graph Processing Documentation](./graphx.md)
256
257
### [Core Utilities](./exceptions.md)
258
259
Essential infrastructure components including exception handling, logging, configuration management, and storage utilities that support all Spark components.
260
261
**Key Utility Areas:**
262
- [Exception Handling](./exceptions.md) - Structured error management with error classes
263
- [Storage Configuration](./storage.md) - RDD storage level management
264
- [Logging Framework](./logging.md) - Structured logging with MDC support
265
- [Thread Utilities](./utils.md) - Thread-safe utilities and lexical scoping
266
267
## Interactive Usage
268
269
### Scala Shell
270
```bash
271
./bin/spark-shell
272
```
273
274
Example:
275
```scala
276
scala> spark.range(1000 * 1000 * 1000).count()
277
res0: Long = 1000000000
278
```
279
280
### Python Shell
281
```bash
282
./bin/pyspark
283
```
284
285
Example:
286
```python
287
>>> spark.range(1000 * 1000 * 1000).count()
288
1000000000
289
```
290
291
### SQL Shell
292
```bash
293
./bin/spark-sql
294
```
295
296
## Configuration
297
298
Key configuration options:
299
300
```scala { .api }
301
val conf = new SparkConf()
302
.setAppName("MyApp")
303
.setMaster("local[4]") // Use 4 cores locally
304
.set("spark.sql.adaptive.enabled", "true")
305
.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
306
.set("spark.executor.memory", "2g")
307
.set("spark.driver.memory", "1g")
308
```
309
310
## Deployment
311
312
Spark supports multiple deployment modes:
313
314
- **Local**: `local`, `local[4]`, `local[*]`
315
- **Standalone**: `spark://host:port`
316
- **YARN**: `yarn`
317
- **Kubernetes**: `k8s://https://kubernetes-api-url`
318
- **Mesos**: `mesos://host:port`
319
320
---
321
322
**Core Documentation:**
323
- [Core Engine](./core.md) - SparkContext, RDDs, and distributed computing
324
- [SQL and DataFrames](./sql.md) - SparkSession, DataFrame, Dataset APIs
325
- [Machine Learning](./mllib.md) - MLlib algorithms and pipelines
326
- [Structured Streaming](./streaming.md) - Real-time stream processing
327
- [Graph Processing](./graphx.md) - GraphX graph-parallel computation
328
329
**Utilities Documentation:**
330
- [Exception Handling](./exceptions.md) - Error management and validation
331
- [Storage Configuration](./storage.md) - RDD persistence and storage levels
332
- [Logging Framework](./logging.md) - Structured logging infrastructure
333
- [Thread Utilities](./utils.md) - Thread-safe utilities and scoping