Tessl Tile for maven/org.apache.spark/spark-parent_2.12@3.5.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

application-launcher.md core-engine.md graph-processing.md index.md machine-learning.md sql-dataframes.md stream-processing.md

index.mddocs/

0
# Apache Spark
1

2
Apache Spark is a unified analytics engine for large-scale data processing that provides high-level APIs in Scala, Java, Python, and R, along with an optimized engine supporting general computation graphs. Spark includes multiple specialized components for SQL and DataFrames processing, machine learning, graph processing, and real-time stream processing.
3

4
## Package Information
5

6
- **Package Name**: org.apache.spark:spark-parent_2.13
7
- **Package Type**: maven
8
- **Language**: Scala (with Java, Python, R bindings)
9
- **License**: Apache-2.0
10
- **Installation**: Add Maven dependency or download distribution from https://spark.apache.org/downloads.html
11

12
## Core Imports
13

14
For Scala applications:
15

16
```scala
17
import org.apache.spark.SparkContext
18
import org.apache.spark.SparkConf
19
import org.apache.spark.sql.SparkSession
20
import org.apache.spark.rdd.RDD
21
```
22

23
For Java applications:
24

25
```java
26
import org.apache.spark.SparkConf;
27
import org.apache.spark.api.java.JavaSparkContext;
28
import org.apache.spark.sql.SparkSession;
29
import org.apache.spark.sql.Dataset;
30
import org.apache.spark.sql.Row;
31
```
32

33
Maven dependency:
34

35
```xml
36
<dependency>
37
    <groupId>org.apache.spark</groupId>
38
    <artifactId>spark-core_2.13</artifactId>
39
    <version>3.5.6</version>
40
</dependency>
41
<dependency>
42
    <groupId>org.apache.spark</groupId>
43
    <artifactId>spark-sql_2.13</artifactId>
44
    <version>3.5.6</version>
45
</dependency>
46
```
47

48
## Basic Usage
49

50
```scala
51
import org.apache.spark.sql.SparkSession
52

53
// Create SparkSession (entry point for DataFrame and SQL APIs)
54
val spark = SparkSession.builder()
55
  .appName("MySparkApp")
56
  .master("local[*]")
57
  .getOrCreate()
58

59
// Create DataFrame from data
60
val df = spark.createDataFrame(Seq(
61
  ("Alice", 25),
62
  ("Bob", 30),
63
  ("Charlie", 35)
64
)).toDF("name", "age")
65

66
// Run SQL queries
67
df.createOrReplaceTempView("people")
68
val adults = spark.sql("SELECT * FROM people WHERE age >= 30")
69
adults.show()
70

71
// DataFrame transformations
72
val result = df.filter($"age" > 25)
73
  .select($"name", $"age")
74
  .orderBy($"age".desc)
75

76
result.collect()
77

78
spark.stop()
79
```
80

81
## Architecture
82

83
Apache Spark consists of several key components:
84

85
- **Spark Core**: The foundation providing basic I/O functionalities, RDDs, and task scheduling
86
- **Spark SQL**: Module for working with structured data using DataFrames and SQL
87
- **MLlib**: Machine learning library with algorithms and utilities
88
- **GraphX**: Graph processing framework for graph-parallel computation
89
- **Structured Streaming**: Scalable and fault-tolerant stream processing
90
- **Spark Streaming**: Legacy stream processing API (DStreams)
91

92
## Capabilities
93

94
### Core Engine (RDDs and SparkContext)
95

96
Fundamental distributed computing capabilities with Resilient Distributed Datasets (RDDs).
97

98
```scala { .api }
99
class SparkContext(config: SparkConf) {
100
  def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
101
  def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
102
  def stop(): Unit
103
}
104

105
abstract class RDD[T] {
106
  def map[U](f: T => U): RDD[U]
107
  def filter(f: T => Boolean): RDD[T]
108
  def collect(): Array[T]
109
  def count(): Long
110
  def cache(): this.type
111
}
112
```
113

114
[Core Engine APIs](./core-engine.md)
115

116
### Structured Data Processing (SQL and DataFrames)
117

118
High-level APIs for working with structured data, including DataFrames, Datasets, and SQL.
119

120
```scala { .api }
121
class SparkSession {
122
  def sql(sqlText: String): DataFrame
123
  def read: DataFrameReader
124
  def createDataFrame[A <: Product](rdd: RDD[A]): DataFrame
125
}
126

127
class Dataset[T] {
128
  def select(cols: Column*): DataFrame
129
  def filter(condition: Column): Dataset[T]
130
  def groupBy(cols: Column*): RelationalGroupedDataset
131
  def join(right: Dataset[_], joinExprs: Column): DataFrame
132
  def show(): Unit
133
  def collect(): Array[T]
134
}
135
```
136

137
[SQL and DataFrames](./sql-dataframes.md)
138

139
### Machine Learning
140

141
Comprehensive machine learning library with algorithms, feature engineering, and model evaluation.
142

143
```scala { .api }
144
class Pipeline extends Estimator[PipelineModel] {
145
  def setStages(value: Array[PipelineStage]): Pipeline
146
  def fit(dataset: Dataset[_]): PipelineModel
147
}
148

149
abstract class Estimator[M <: Model[M]] {
150
  def fit(dataset: Dataset[_]): M
151
}
152

153
abstract class Transformer {
154
  def transform(dataset: Dataset[_]): DataFrame
155
}
156
```
157

158
[Machine Learning](./machine-learning.md)
159

160
### Graph Processing
161

162
Graph-parallel computation framework for processing property graphs.
163

164
```scala { .api }
165
abstract class Graph[VD, ED] {
166
  def vertices: VertexRDD[VD]
167
  def edges: EdgeRDD[ED]
168
  def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
169
  def aggregateMessages[A](sendMsg: EdgeContext[VD, ED, A] => Unit, 
170
                          mergeMsg: (A, A) => A): VertexRDD[A]
171
  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
172
}
173
```
174

175
[Graph Processing](./graph-processing.md)
176

177
### Stream Processing
178

179
Real-time stream processing capabilities for continuous data processing.
180

181
```scala { .api }
182
class StreamingContext(sparkContext: SparkContext, batchDuration: Duration) {
183
  def socketTextStream(hostname: String, port: Int): ReceiverInputDStream[String]
184
  def textFileStream(directory: String): DStream[String]
185
  def start(): Unit
186
  def awaitTermination(): Unit
187
}
188

189
abstract class DStream[T] {
190
  def map[U](mapFunc: T => U): DStream[U]
191
  def filter(filterFunc: T => Boolean): DStream[T]
192
  def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
193
  def print(): Unit
194
}
195
```
196

197
[Stream Processing](./stream-processing.md)
198

199
### Application Launcher
200

201
Programmatic launching of Spark applications with monitoring capabilities.
202

203
```java { .api }
204
public class SparkLauncher {
205
  public SparkLauncher setAppName(String appName);
206
  public SparkLauncher setMaster(String master);
207
  public SparkLauncher setMainClass(String mainClass);
208
  public SparkAppHandle startApplication();
209
}
210

211
public interface SparkAppHandle {
212
  State getState();
213
  String getAppId();
214
  void kill();
215
}
216
```
217

218
[Application Launcher](./application-launcher.md)
219

220
## Types
221

222
### Core Types
223

224
```scala { .api }
225
class SparkConf {
226
  def set(key: String, value: String): SparkConf
227
  def setMaster(master: String): SparkConf
228
  def setAppName(name: String): SparkConf
229
}
230

231
class Broadcast[T] {
232
  def value: T
233
  def destroy(): Unit
234
}
235

236
object StorageLevel {
237
  val MEMORY_ONLY: StorageLevel
238
  val MEMORY_AND_DISK: StorageLevel
239
  val MEMORY_ONLY_SER: StorageLevel
240
  val DISK_ONLY: StorageLevel
241
}
242
```
243

244
### SQL Types
245

246
```scala { .api }
247
import org.apache.spark.sql.types._
248

249
case class StructType(fields: Array[StructField])
250
case class StructField(name: String, dataType: DataType, nullable: Boolean = true)
251

252
abstract class DataType
253
object DataTypes {
254
  val StringType: DataType
255
  val IntegerType: DataType
256
  val DoubleType: DataType
257
  val BooleanType: DataType
258
  val TimestampType: DataType
259
}
260

261
trait Row {
262
  def getString(i: Int): String
263
  def getInt(i: Int): Int
264
  def getDouble(i: Int): Double
265
  def getBoolean(i: Int): Boolean
266
}
267

268
class Column {
269
  def ===(other: Any): Column
270
  def &&(other: Column): Column
271
  def ||(other: Column): Column
272
  def isNull: Column
273
  def isNotNull: Column
274
}
275
```
276

277
### GraphX Types
278

279
```scala { .api }
280
type VertexId = Long
281

282
case class Edge[ED](srcId: VertexId, dstId: VertexId, attr: ED)
283

284
class EdgeTriplet[VD, ED] extends Edge[ED] {
285
  def srcAttr: VD
286
  def dstAttr: VD
287
}
288

289
abstract class VertexRDD[VD] extends RDD[(VertexId, VD)]
290
abstract class EdgeRDD[ED] extends RDD[Edge[ED]]
291
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/