0
# Apache Spark
1
2
Apache Spark is a unified analytics engine for large-scale data processing that provides advanced programming APIs in Scala, Java, Python, and R. It offers high-level APIs for distributed data processing along with an optimized computation engine supporting general directed acyclic graphs for data analysis.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-parent_2.12
7
- **Package Type**: maven
8
- **Languages**: Scala, Java, Python, R
9
- **Installation**:
10
- Maven/SBT: `org.apache.spark:spark-core_2.12:3.5.6`
11
- Python: `pip install pyspark==3.5.6`
12
- R: `install.packages("SparkR")`
13
14
## Core Imports
15
16
```scala
17
import org.apache.spark.SparkContext
18
import org.apache.spark.SparkConf
19
import org.apache.spark.sql.SparkSession
20
import org.apache.spark.rdd.RDD
21
```
22
23
For Java:
24
25
```java
26
import org.apache.spark.SparkContext;
27
import org.apache.spark.SparkConf;
28
import org.apache.spark.sql.SparkSession;
29
import org.apache.spark.api.java.JavaRDD;
30
import org.apache.spark.api.java.JavaSparkContext;
31
```
32
33
For Python:
34
35
```python
36
from pyspark import SparkContext, SparkConf
37
from pyspark.sql import SparkSession, DataFrame, Column
38
from pyspark.rdd import RDD
39
```
40
41
Common PySpark imports:
42
43
```python
44
from pyspark.sql import functions as F
45
from pyspark.ml import Pipeline
46
from pyspark.ml.feature import VectorAssembler
47
from pyspark.ml.classification import LogisticRegression
48
```
49
50
## Basic Usage
51
52
```scala
53
import org.apache.spark.sql.SparkSession
54
import org.apache.spark.SparkContext
55
56
// Create SparkSession (modern approach)
57
val spark = SparkSession.builder()
58
.appName("MyApp")
59
.master("local[*]")
60
.getOrCreate()
61
62
// Working with DataFrames
63
val df = spark.read
64
.option("header", "true")
65
.csv("data.csv")
66
67
df.select("name", "age")
68
.filter($"age" > 21)
69
.show()
70
71
// Working with RDDs (low-level API)
72
val sc = spark.sparkContext
73
val data = sc.parallelize(1 to 1000)
74
val result = data
75
.map(_ * 2)
76
.filter(_ > 100)
77
.collect()
78
79
spark.stop()
80
```
81
82
Python equivalent:
83
84
```python
85
from pyspark.sql import SparkSession
86
from pyspark.sql import functions as F
87
88
# Create SparkSession
89
spark = SparkSession.builder \
90
.appName("MyApp") \
91
.master("local[*]") \
92
.getOrCreate()
93
94
# Working with DataFrames
95
df = spark.read \
96
.option("header", "true") \
97
.csv("data.csv")
98
99
df.select("name", "age") \
100
.filter(F.col("age") > 21) \
101
.show()
102
103
# Working with RDDs
104
sc = spark.sparkContext
105
data = sc.parallelize(range(1, 1001))
106
result = data \
107
.map(lambda x: x * 2) \
108
.filter(lambda x: x > 100) \
109
.collect()
110
111
spark.stop()
112
```
113
114
## Architecture
115
116
Apache Spark is built around several key components:
117
118
- **Core Engine**: Provides distributed task scheduling, memory management, fault recovery, and storage system interactions
119
- **Spark SQL**: Module for working with structured data using DataFrames and Datasets with SQL queries
120
- **MLlib**: Machine learning library providing common algorithms and utilities
121
- **GraphX**: Graph processing framework for analyzing graph structures and running graph algorithms
122
- **Structured Streaming**: Stream processing engine built on Spark SQL for real-time data processing
123
- **Cluster Management**: Support for various cluster managers (YARN, Kubernetes, Mesos, Standalone)
124
125
## Capabilities
126
127
### Core Data Processing
128
129
Distributed data processing using Resilient Distributed Datasets (RDDs) and the fundamental Spark execution engine. Provides fault-tolerant, parallel data structures and transformations.
130
131
```scala { .api }
132
class SparkContext(config: SparkConf) {
133
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
134
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
135
def broadcast[T: ClassTag](value: T): Broadcast[T]
136
def stop(): Unit
137
}
138
139
abstract class RDD[T: ClassTag] {
140
def map[U: ClassTag](f: T => U): RDD[U]
141
def filter(f: T => Boolean): RDD[T]
142
def collect(): Array[T]
143
def count(): Long
144
def cache(): RDD[T]
145
}
146
```
147
148
Python API:
149
150
```python { .api }
151
class SparkContext:
152
def __init__(self, conf: SparkConf = None)
153
def parallelize(self, c: Iterable, numSlices: int = None) -> RDD
154
def textFile(self, name: str, minPartitions: int = None) -> RDD
155
def broadcast(self, value: Any) -> Broadcast
156
def stop(self) -> None
157
158
class RDD:
159
def map(self, f: Callable) -> RDD
160
def filter(self, f: Callable) -> RDD
161
def collect(self) -> List
162
def count(self) -> int
163
def cache(self) -> RDD
164
```
165
166
[Core Processing](./core.md)
167
168
### Structured Data Processing
169
170
High-level APIs for working with structured data using DataFrames and Datasets. Built on Spark SQL with Catalyst optimizer for query optimization and code generation.
171
172
```scala { .api }
173
class SparkSession {
174
def read: DataFrameReader
175
def sql(sqlText: String): DataFrame
176
def table(tableName: String): DataFrame
177
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
178
def stop(): Unit
179
}
180
181
class Dataset[T] {
182
def select(cols: Column*): DataFrame
183
def filter(condition: Column): Dataset[T]
184
def groupBy(cols: Column*): RelationalGroupedDataset
185
def join(right: Dataset[_]): DataFrame
186
def write: DataFrameWriter[T]
187
def collect(): Array[T]
188
def show(numRows: Int = 20): Unit
189
}
190
191
type DataFrame = Dataset[Row]
192
```
193
194
Python API:
195
196
```python { .api }
197
class SparkSession:
198
@property
199
def read(self) -> DataFrameReader
200
def sql(self, sqlQuery: str) -> DataFrame
201
def table(self, tableName: str) -> DataFrame
202
def createDataFrame(self, data: List, schema: Optional[Union[List, StructType]] = None) -> DataFrame
203
def stop(self) -> None
204
205
class DataFrame:
206
def select(self, *cols: Union[str, Column]) -> DataFrame
207
def filter(self, condition: Union[str, Column]) -> DataFrame
208
def where(self, condition: Union[str, Column]) -> DataFrame
209
def groupBy(self, *cols: Union[str, Column]) -> GroupedData
210
def join(self, other: DataFrame, on: Optional[Union[str, List[str], Column]] = None) -> DataFrame
211
def collect(self) -> List[Row]
212
def show(self, n: int = 20, truncate: bool = True) -> None
213
```
214
215
[SQL and DataFrames](./sql.md)
216
217
### Machine Learning
218
219
Comprehensive machine learning library with algorithms for classification, regression, clustering, and collaborative filtering. Provides both high-level Pipeline API and low-level RDD-based APIs.
220
221
```scala { .api }
222
class Pipeline extends Estimator[PipelineModel] {
223
def setStages(value: Array[PipelineStage]): Pipeline
224
def fit(dataset: Dataset[_]): PipelineModel
225
}
226
227
abstract class Estimator[M <: Model[M]] extends PipelineStage {
228
def fit(dataset: Dataset[_]): M
229
}
230
231
abstract class Transformer extends PipelineStage {
232
def transform(dataset: Dataset[_]): DataFrame
233
}
234
```
235
236
Python API:
237
238
```python { .api }
239
class Pipeline(Estimator):
240
def setStages(self, value: List[PipelineStage]) -> Pipeline
241
def fit(self, dataset: DataFrame) -> PipelineModel
242
243
class Estimator(PipelineStage):
244
def fit(self, dataset: DataFrame) -> Model
245
246
class Transformer(PipelineStage):
247
def transform(self, dataset: DataFrame) -> DataFrame
248
249
class PipelineModel(Model):
250
def transform(self, dataset: DataFrame) -> DataFrame
251
```
252
253
[Machine Learning](./ml.md)
254
255
### Graph Processing
256
257
GraphX provides APIs for graphs and graph-parallel computation with fundamental operators like subgraph, joinVertices, and aggregateMessages, plus optimized variants of graph algorithms. **Note**: GraphX is only available in Scala and Java - there is no Python API.
258
259
```scala { .api }
260
abstract class Graph[VD: ClassTag, ED: ClassTag] {
261
val vertices: VertexRDD[VD]
262
val edges: EdgeRDD[ED]
263
def mapVertices[VD2: ClassTag](map: (VertexId, VD) => VD2): Graph[VD2, ED]
264
def aggregateMessages[A: ClassTag](
265
sendMsg: EdgeContext[VD, ED, A] => Unit,
266
mergeMsg: (A, A) => A
267
): VertexRDD[A]
268
}
269
270
type VertexId = Long
271
```
272
273
**Python Alternative**: Use GraphFrames library (`pip install graphframes`) for graph processing in Python with Spark DataFrames.
274
275
[Graph Processing](./graphx.md)
276
277
### Stream Processing
278
279
Structured Streaming provides real-time stream processing with exactly-once fault-tolerance guarantees. Built on the Spark SQL engine for seamless integration with batch processing.
280
281
```scala { .api }
282
class StreamingContext(conf: SparkConf, batchDuration: Duration) {
283
def socketTextStream(hostname: String, port: Int): ReceiverInputDStream[String]
284
def textFileStream(directory: String): DStream[String]
285
def start(): Unit
286
def stop(): Unit
287
def awaitTermination(): Unit
288
}
289
290
abstract class DStream[T: ClassTag] {
291
def map[U: ClassTag](mapFunc: T => U): DStream[U]
292
def filter(filterFunc: T => Boolean): DStream[T]
293
def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
294
def foreachRDD(foreachFunc: RDD[T] => Unit): Unit
295
}
296
```
297
298
Python API:
299
300
```python { .api }
301
class StreamingContext:
302
def __init__(self, sparkContext: SparkContext, batchDuration: float)
303
def socketTextStream(self, hostname: str, port: int) -> DStream
304
def textFileStream(self, directory: str) -> DStream
305
def start(self) -> None
306
def stop(self, stopSparkContext: bool = True, stopGracefully: bool = False) -> None
307
def awaitTermination(self, timeout: Optional[float] = None) -> None
308
309
class DStream:
310
def map(self, f: Callable) -> DStream
311
def filter(self, f: Callable) -> DStream
312
def window(self, windowDuration: float, slideDuration: Optional[float] = None) -> DStream
313
def foreachRDD(self, func: Callable[[RDD], None]) -> None
314
```
315
316
For modern streaming applications, prefer Structured Streaming via `SparkSession.readStream` which provides better fault tolerance and performance.
317
318
[Stream Processing](./streaming.md)
319
320
### Application Management
321
322
Programmatic interfaces for launching, monitoring, and managing Spark applications across different cluster managers and deployment modes.
323
324
```scala { .api }
325
class SparkConf(loadDefaults: Boolean = true) {
326
def set(key: String, value: String): SparkConf
327
def setMaster(master: String): SparkConf
328
def setAppName(name: String): SparkConf
329
def get(key: String): String
330
}
331
```
332
333
```java { .api }
334
public class SparkLauncher {
335
public SparkLauncher setAppName(String appName);
336
public SparkLauncher setMaster(String master);
337
public SparkLauncher setMainClass(String mainClass);
338
public Process launch() throws IOException;
339
public SparkAppHandle startApplication() throws IOException;
340
}
341
```
342
343
Python API:
344
345
```python { .api }
346
class SparkConf:
347
def __init__(self, loadDefaults: bool = True)
348
def set(self, key: str, value: str) -> SparkConf
349
def setMaster(self, value: str) -> SparkConf
350
def setAppName(self, value: str) -> SparkConf
351
def get(self, key: str, defaultValue: Optional[str] = None) -> str
352
def getAll(self) -> List[Tuple[str, str]]
353
```
354
355
Note: Python does not have a direct equivalent to SparkLauncher. Use `spark-submit` command-line tool or SparkSession.builder for application management.
356
357
### Python-Specific Features
358
359
Python-specific Spark capabilities including pandas API compatibility, type hints, and Python-optimized operations.
360
361
```python { .api }
362
# Pandas API on Spark (pyspark.pandas)
363
import pyspark.pandas as ps
364
365
class DataFrame: # pandas-compatible DataFrame
366
def head(self, n: int = 5) -> DataFrame
367
def describe(self) -> DataFrame
368
def groupby(self, by: Union[str, List[str]]) -> GroupBy
369
def merge(self, right: DataFrame, on: str = None) -> DataFrame
370
371
# SQL Functions
372
from pyspark.sql import functions as F
373
374
# Common functions (473+ available)
375
def col(colName: str) -> Column
376
def lit(col: Any) -> Column
377
def when(condition: Column, value: Any) -> Column
378
def coalesce(*cols: Column) -> Column
379
def concat(*cols: Column) -> Column
380
def regexp_replace(str: Column, pattern: str, replacement: str) -> Column
381
```
382
383
**Key Python Features:**
384
- **Pandas API**: `pyspark.pandas` provides pandas-compatible DataFrame operations
385
- **Type Hints**: Full type annotation support for better IDE integration
386
- **Arrow Integration**: High-performance columnar data transfer between JVM and Python
387
- **UDF Support**: User-defined functions with vectorized operations using pandas UDFs
388
389
[Configuration and Deployment](./deployment.md)