0
# Apache Spark
1
2
Apache Spark is a unified analytics engine for large-scale data processing that provides high-level APIs in Scala, Java, Python, and R, along with an optimized engine supporting general computation graphs. Spark includes multiple specialized components for SQL and DataFrames processing, machine learning, graph processing, and real-time stream processing.
3
4
## Package Information
5
6
- **Package Name**: org.apache.spark:spark-parent_2.13
7
- **Package Type**: maven
8
- **Language**: Scala (with Java, Python, R bindings)
9
- **License**: Apache-2.0
10
- **Installation**: Add Maven dependency or download distribution from https://spark.apache.org/downloads.html
11
12
## Core Imports
13
14
For Scala applications:
15
16
```scala
17
import org.apache.spark.SparkContext
18
import org.apache.spark.SparkConf
19
import org.apache.spark.sql.SparkSession
20
import org.apache.spark.rdd.RDD
21
```
22
23
For Java applications:
24
25
```java
26
import org.apache.spark.SparkConf;
27
import org.apache.spark.api.java.JavaSparkContext;
28
import org.apache.spark.sql.SparkSession;
29
import org.apache.spark.sql.Dataset;
30
import org.apache.spark.sql.Row;
31
```
32
33
Maven dependency:
34
35
```xml
36
<dependency>
37
<groupId>org.apache.spark</groupId>
38
<artifactId>spark-core_2.13</artifactId>
39
<version>3.5.6</version>
40
</dependency>
41
<dependency>
42
<groupId>org.apache.spark</groupId>
43
<artifactId>spark-sql_2.13</artifactId>
44
<version>3.5.6</version>
45
</dependency>
46
```
47
48
## Basic Usage
49
50
```scala
51
import org.apache.spark.sql.SparkSession
52
53
// Create SparkSession (entry point for DataFrame and SQL APIs)
54
val spark = SparkSession.builder()
55
.appName("MySparkApp")
56
.master("local[*]")
57
.getOrCreate()
58
59
// Create DataFrame from data
60
val df = spark.createDataFrame(Seq(
61
("Alice", 25),
62
("Bob", 30),
63
("Charlie", 35)
64
)).toDF("name", "age")
65
66
// Run SQL queries
67
df.createOrReplaceTempView("people")
68
val adults = spark.sql("SELECT * FROM people WHERE age >= 30")
69
adults.show()
70
71
// DataFrame transformations
72
val result = df.filter($"age" > 25)
73
.select($"name", $"age")
74
.orderBy($"age".desc)
75
76
result.collect()
77
78
spark.stop()
79
```
80
81
## Architecture
82
83
Apache Spark consists of several key components:
84
85
- **Spark Core**: The foundation providing basic I/O functionalities, RDDs, and task scheduling
86
- **Spark SQL**: Module for working with structured data using DataFrames and SQL
87
- **MLlib**: Machine learning library with algorithms and utilities
88
- **GraphX**: Graph processing framework for graph-parallel computation
89
- **Structured Streaming**: Scalable and fault-tolerant stream processing
90
- **Spark Streaming**: Legacy stream processing API (DStreams)
91
92
## Capabilities
93
94
### Core Engine (RDDs and SparkContext)
95
96
Fundamental distributed computing capabilities with Resilient Distributed Datasets (RDDs).
97
98
```scala { .api }
99
class SparkContext(config: SparkConf) {
100
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
101
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
102
def stop(): Unit
103
}
104
105
abstract class RDD[T] {
106
def map[U](f: T => U): RDD[U]
107
def filter(f: T => Boolean): RDD[T]
108
def collect(): Array[T]
109
def count(): Long
110
def cache(): this.type
111
}
112
```
113
114
[Core Engine APIs](./core-engine.md)
115
116
### Structured Data Processing (SQL and DataFrames)
117
118
High-level APIs for working with structured data, including DataFrames, Datasets, and SQL.
119
120
```scala { .api }
121
class SparkSession {
122
def sql(sqlText: String): DataFrame
123
def read: DataFrameReader
124
def createDataFrame[A <: Product](rdd: RDD[A]): DataFrame
125
}
126
127
class Dataset[T] {
128
def select(cols: Column*): DataFrame
129
def filter(condition: Column): Dataset[T]
130
def groupBy(cols: Column*): RelationalGroupedDataset
131
def join(right: Dataset[_], joinExprs: Column): DataFrame
132
def show(): Unit
133
def collect(): Array[T]
134
}
135
```
136
137
[SQL and DataFrames](./sql-dataframes.md)
138
139
### Machine Learning
140
141
Comprehensive machine learning library with algorithms, feature engineering, and model evaluation.
142
143
```scala { .api }
144
class Pipeline extends Estimator[PipelineModel] {
145
def setStages(value: Array[PipelineStage]): Pipeline
146
def fit(dataset: Dataset[_]): PipelineModel
147
}
148
149
abstract class Estimator[M <: Model[M]] {
150
def fit(dataset: Dataset[_]): M
151
}
152
153
abstract class Transformer {
154
def transform(dataset: Dataset[_]): DataFrame
155
}
156
```
157
158
[Machine Learning](./machine-learning.md)
159
160
### Graph Processing
161
162
Graph-parallel computation framework for processing property graphs.
163
164
```scala { .api }
165
abstract class Graph[VD, ED] {
166
def vertices: VertexRDD[VD]
167
def edges: EdgeRDD[ED]
168
def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
169
def aggregateMessages[A](sendMsg: EdgeContext[VD, ED, A] => Unit,
170
mergeMsg: (A, A) => A): VertexRDD[A]
171
def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
172
}
173
```
174
175
[Graph Processing](./graph-processing.md)
176
177
### Stream Processing
178
179
Real-time stream processing capabilities for continuous data processing.
180
181
```scala { .api }
182
class StreamingContext(sparkContext: SparkContext, batchDuration: Duration) {
183
def socketTextStream(hostname: String, port: Int): ReceiverInputDStream[String]
184
def textFileStream(directory: String): DStream[String]
185
def start(): Unit
186
def awaitTermination(): Unit
187
}
188
189
abstract class DStream[T] {
190
def map[U](mapFunc: T => U): DStream[U]
191
def filter(filterFunc: T => Boolean): DStream[T]
192
def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
193
def print(): Unit
194
}
195
```
196
197
[Stream Processing](./stream-processing.md)
198
199
### Application Launcher
200
201
Programmatic launching of Spark applications with monitoring capabilities.
202
203
```java { .api }
204
public class SparkLauncher {
205
public SparkLauncher setAppName(String appName);
206
public SparkLauncher setMaster(String master);
207
public SparkLauncher setMainClass(String mainClass);
208
public SparkAppHandle startApplication();
209
}
210
211
public interface SparkAppHandle {
212
State getState();
213
String getAppId();
214
void kill();
215
}
216
```
217
218
[Application Launcher](./application-launcher.md)
219
220
## Types
221
222
### Core Types
223
224
```scala { .api }
225
class SparkConf {
226
def set(key: String, value: String): SparkConf
227
def setMaster(master: String): SparkConf
228
def setAppName(name: String): SparkConf
229
}
230
231
class Broadcast[T] {
232
def value: T
233
def destroy(): Unit
234
}
235
236
object StorageLevel {
237
val MEMORY_ONLY: StorageLevel
238
val MEMORY_AND_DISK: StorageLevel
239
val MEMORY_ONLY_SER: StorageLevel
240
val DISK_ONLY: StorageLevel
241
}
242
```
243
244
### SQL Types
245
246
```scala { .api }
247
import org.apache.spark.sql.types._
248
249
case class StructType(fields: Array[StructField])
250
case class StructField(name: String, dataType: DataType, nullable: Boolean = true)
251
252
abstract class DataType
253
object DataTypes {
254
val StringType: DataType
255
val IntegerType: DataType
256
val DoubleType: DataType
257
val BooleanType: DataType
258
val TimestampType: DataType
259
}
260
261
trait Row {
262
def getString(i: Int): String
263
def getInt(i: Int): Int
264
def getDouble(i: Int): Double
265
def getBoolean(i: Int): Boolean
266
}
267
268
class Column {
269
def ===(other: Any): Column
270
def &&(other: Column): Column
271
def ||(other: Column): Column
272
def isNull: Column
273
def isNotNull: Column
274
}
275
```
276
277
### GraphX Types
278
279
```scala { .api }
280
type VertexId = Long
281
282
case class Edge[ED](srcId: VertexId, dstId: VertexId, attr: ED)
283
284
class EdgeTriplet[VD, ED] extends Edge[ED] {
285
def srcAttr: VD
286
def dstAttr: VD
287
}
288
289
abstract class VertexRDD[VD] extends RDD[(VertexId, VD)]
290
abstract class EdgeRDD[ED] extends RDD[Edge[ED]]
291
```