Tessl Tile for maven/org.apache.spark/spark-parent_2.13@4.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-parent_2.13

Apache Spark unified analytics engine for large-scale data processing with APIs in Scala, Java, Python and R.

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:maven/org.apache.spark/spark-parent_2.13@4.0.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-parent_2.13@4.0.0

0
# Apache Spark
1

2
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.
3

4
## Package Information
5

6
- **Package Name**: org.apache.spark:spark-parent_2.13  
7
- **Package Type**: Maven
8
- **Language**: Scala (with Java, Python, R APIs)
9
- **Installation**: See [Building Spark](#building-spark)
10
- **License**: Apache-2.0
11

12
### Maven Dependency
13

14
For Spark SQL (most common):
15
```xml
16
<dependency>
17
    <groupId>org.apache.spark</groupId>
18
    <artifactId>spark-sql_2.13</artifactId>
19
    <version>4.0.0</version>
20
</dependency>
21
```
22

23
For Spark Core:
24
```xml
25
<dependency>
26
    <groupId>org.apache.spark</groupId>
27
    <artifactId>spark-core_2.13</artifactId>
28
    <version>4.0.0</version>
29
</dependency>
30
```
31

32
## Core Imports
33

34
```scala { .api }
35
// Main entry points
36
import org.apache.spark.SparkContext
37
import org.apache.spark.sql.SparkSession
38

39
// Core data structures  
40
import org.apache.spark.rdd.RDD
41
import org.apache.spark.sql.{DataFrame, Dataset}
42

43
// Configuration
44
import org.apache.spark.SparkConf
45

46
// Common types
47
import org.apache.spark.sql.types._
48
import org.apache.spark.sql.functions._
49
```
50

51
## Basic Usage
52

53
### Creating a Spark Session (Recommended)
54

55
```scala { .api }
56
import org.apache.spark.sql.SparkSession
57

58
val spark = SparkSession.builder()
59
  .appName("MySparkApp")
60
  .master("local[*]") // Use all available cores locally
61
  .config("spark.some.config.option", "some-value")
62
  .getOrCreate()
63

64
// Use the spark session for SQL operations
65
val df = spark.read.json("path/to/file.json")
66
df.show()
67

68
// Access the SparkContext for RDD operations
69
val sc = spark.sparkContext
70
val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5))
71
println(rdd.map(_ * 2).collect().mkString(", "))
72

73
spark.stop()
74
```
75

76
### Working with DataFrames
77

78
```scala { .api }
79
import org.apache.spark.sql.functions._
80

81
// Read data
82
val df = spark.read
83
  .option("header", "true")
84
  .csv("path/to/data.csv")
85

86
// Transform data
87
val result = df
88
  .filter(col("age") > 21)
89
  .groupBy("department")
90
  .agg(avg("salary").as("avg_salary"))
91
  .orderBy(desc("avg_salary"))
92

93
result.show()
94
```
95

96
### Working with RDDs (Low-level API)
97

98
```scala { .api }
99
val sc = spark.sparkContext
100

101
// Create RDD from collection
102
val numbers = sc.parallelize(1 to 1000)
103

104
// Transform and action
105
val result = numbers
106
  .filter(_ % 2 == 0)
107
  .map(_ * 2)
108
  .reduce(_ + _)
109

110
println(s"Result: $result")
111
```
112

113
## Architecture
114

115
Apache Spark is built around several key components:
116

117
### Core Engine
118
- **SparkContext**: The main entry point for Spark functionality and connection to cluster
119
- **RDD (Resilient Distributed Dataset)**: Fundamental data abstraction - immutable distributed collections
120
- **DAG Scheduler**: Converts logical execution plans into physical execution plans
121
- **Task Scheduler**: Schedules and executes tasks across the cluster
122

123
### High-Level APIs
124
- **SparkSession**: Unified entry point for DataFrame and Dataset APIs (recommended)
125
- **DataFrame/Dataset**: Higher-level abstractions built on RDDs with schema information
126
- **Catalyst Optimizer**: Query optimization engine for SQL and DataFrame operations
127

128
### Libraries and Extensions
129
- **Spark SQL**: Module for working with structured data using SQL or DataFrame API
130
- **MLlib**: Machine learning library with algorithms and utilities
131
- **GraphX**: Graph processing framework for graph-based analytics
132
- **Structured Streaming**: Real-time stream processing with DataFrame API
133

134
### Deployment and Storage
135
- **Cluster Managers**: Support for YARN, Kubernetes, Mesos, and standalone mode
136
- **Storage Systems**: Integration with HDFS, S3, databases, and various file formats
137
- **Serialization**: Efficient data serialization for network and storage operations
138

139
## Building Spark
140

141
Spark is built using Apache Maven:
142

143
```bash
144
./build/mvn -DskipTests clean package
145
```
146

147
For development with specific profiles:
148

149
```bash
150
./build/mvn -Pyarn -Phadoop-3.3 -Pscala-2.13 -DskipTests clean package
151
```
152

153
## Capabilities
154

155
### [Core Engine](./core.md)
156

157
Low-level distributed computing with RDDs, SparkContext, and cluster management. Provides the foundation for all other Spark components.
158

159
```scala { .api }
160
class SparkContext(config: SparkConf) extends Logging {
161
  def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
162
  def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
163
  def stop(): Unit
164
}
165

166
abstract class RDD[T: ClassTag] extends Serializable with Logging {
167
  def map[U: ClassTag](f: T => U): RDD[U]
168
  def filter(f: T => Boolean): RDD[T]  
169
  def collect(): Array[T]
170
  def count(): Long
171
}
172
```
173

174
[Core Engine Documentation](./core.md)
175

176
### [SQL and DataFrames](./sql.md)
177

178
High-level APIs for working with structured data using SQL, DataFrames, and Datasets with the Catalyst optimizer.
179

180
```scala { .api }
181
class SparkSession extends Serializable with Closeable {
182
  def read: DataFrameReader
183
  def sql(sqlText: String): DataFrame
184
  def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
185
  def stop(): Unit
186
}
187

188
abstract class Dataset[T] extends Serializable {
189
  def select(cols: Column*): DataFrame
190
  def filter(condition: Column): Dataset[T]
191
  def groupBy(cols: Column*): RelationalGroupedDataset
192
  def show(numRows: Int = 20): Unit
193
}
194
```
195

196
[SQL and DataFrames Documentation](./sql.md)
197

198
### [Machine Learning](./mllib.md)
199

200
Comprehensive machine learning library with algorithms for classification, regression, clustering, and collaborative filtering.
201

202
```scala { .api }
203
// Example ML pipeline components
204
import org.apache.spark.ml.Pipeline
205
import org.apache.spark.ml.classification.LogisticRegression
206
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
207

208
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
209
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("features")
210
val lr = new LogisticRegression().setMaxIter(10)
211
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
212
```
213

214
[Machine Learning Documentation](./mllib.md)
215

216
### [Structured Streaming](./streaming.md)
217

218
Scalable and fault-tolerant stream processing engine built on the Spark SQL engine with micro-batch and continuous processing modes.  
219

220
```scala { .api }
221
val spark = SparkSession.builder().appName("StructuredNetworkWordCount").getOrCreate()
222

223
val lines = spark.readStream
224
  .format("socket")
225
  .option("host", "localhost")
226
  .option("port", 9999)
227
  .load()
228

229
val query = lines.writeStream
230
  .outputMode("complete")
231
  .format("console")
232
  .start()
233
```
234

235
[Structured Streaming Documentation](./streaming.md)
236

237
### [Graph Processing](./graphx.md)
238

239
Graph-parallel computation framework for ETL, exploratory data analysis, and iterative graph computation.
240

241
```scala { .api }
242
import org.apache.spark.graphx._
243

244
val vertices: RDD[(VertexId, String)] = sc.parallelize(Array(
245
  (1L, "Alice"), (2L, "Bob"), (3L, "Charlie")
246
))
247

248
val edges: RDD[Edge[String]] = sc.parallelize(Array(
249
  Edge(1L, 2L, "friend"), Edge(2L, 3L, "follow")
250
))
251

252
val graph = Graph(vertices, edges)
253
```
254

255
[Graph Processing Documentation](./graphx.md)
256

257
### [Core Utilities](./exceptions.md)
258

259
Essential infrastructure components including exception handling, logging, configuration management, and storage utilities that support all Spark components.
260

261
**Key Utility Areas:**
262
- [Exception Handling](./exceptions.md) - Structured error management with error classes
263
- [Storage Configuration](./storage.md) - RDD storage level management  
264
- [Logging Framework](./logging.md) - Structured logging with MDC support
265
- [Thread Utilities](./utils.md) - Thread-safe utilities and lexical scoping
266

267
## Interactive Usage
268

269
### Scala Shell
270
```bash
271
./bin/spark-shell
272
```
273

274
Example:
275
```scala
276
scala> spark.range(1000 * 1000 * 1000).count()
277
res0: Long = 1000000000
278
```
279

280
### Python Shell  
281
```bash
282
./bin/pyspark
283
```
284

285
Example:
286
```python
287
>>> spark.range(1000 * 1000 * 1000).count()
288
1000000000
289
```
290

291
### SQL Shell
292
```bash
293
./bin/spark-sql
294
```
295

296
## Configuration
297

298
Key configuration options:
299

300
```scala { .api }
301
val conf = new SparkConf()
302
  .setAppName("MyApp")
303
  .setMaster("local[4]")  // Use 4 cores locally
304
  .set("spark.sql.adaptive.enabled", "true")
305
  .set("spark.sql.adaptive.coalescePartitions.enabled", "true")
306
  .set("spark.executor.memory", "2g")
307
  .set("spark.driver.memory", "1g")
308
```
309

310
## Deployment
311

312
Spark supports multiple deployment modes:
313

314
- **Local**: `local`, `local[4]`, `local[*]`
315
- **Standalone**: `spark://host:port`  
316
- **YARN**: `yarn`
317
- **Kubernetes**: `k8s://https://kubernetes-api-url`
318
- **Mesos**: `mesos://host:port`
319

320
---
321

322
**Core Documentation:**
323
- [Core Engine](./core.md) - SparkContext, RDDs, and distributed computing
324
- [SQL and DataFrames](./sql.md) - SparkSession, DataFrame, Dataset APIs
325
- [Machine Learning](./mllib.md) - MLlib algorithms and pipelines
326
- [Structured Streaming](./streaming.md) - Real-time stream processing
327
- [Graph Processing](./graphx.md) - GraphX graph-parallel computation
328

329
**Utilities Documentation:**
330
- [Exception Handling](./exceptions.md) - Error management and validation
331
- [Storage Configuration](./storage.md) - RDD persistence and storage levels
332
- [Logging Framework](./logging.md) - Structured logging infrastructure  
333
- [Thread Utilities](./utils.md) - Thread-safe utilities and scoping