Tessl Tile for maven/org.apache.spark/spark-parent_2.13@4.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-engine.md graph-processing.md index.md machine-learning.md sql-dataframes.md stream-processing.md

index.mddocs/

0
# Apache Spark
1

2
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R (deprecated), and an optimized engine that supports general computation graphs for data analysis. Spark includes specialized tools for SQL and DataFrames, machine learning (MLlib), graph processing (GraphX), and stream processing.
3

4
## Overview
5

6
Apache Spark is a unified analytics engine for large-scale data processing with high-level APIs in Scala, Java, Python, and R (deprecated). Key features include:
7

8
- **Unified Processing**: Single engine for batch, interactive, real-time, and machine learning workloads
9
- **High Performance**: In-memory computing with advanced DAG execution engine
10
- **Ease of Use**: Simple APIs in multiple languages with 80+ high-level operators
11
- **Scalability**: Runs everywhere from laptops to large clusters with thousands of nodes
12
- **Advanced Analytics**: Built-in modules for SQL, streaming, machine learning, and graph processing
13

14
## Package Information
15

16
- **Package Name**: org.apache.spark:spark-parent_2.13
17
- **Package Type**: Maven
18
- **Language**: Scala/Java
19
- **Installation**: Add to Maven/SBT dependencies or download distribution
20
- **Maven Coordinates**: `org.apache.spark:spark-core_2.13:4.0.0`
21

22
## Core Imports
23

24
For Scala applications:
25

26
```scala
27
import org.apache.spark.{SparkConf, SparkContext}
28
import org.apache.spark.sql.SparkSession
29
import org.apache.spark.rdd.RDD
30
```
31

32
For Java applications:
33

34
```java
35
import org.apache.spark.SparkConf;
36
import org.apache.spark.api.java.JavaSparkContext;
37
import org.apache.spark.sql.SparkSession;
38
import org.apache.spark.sql.Dataset;
39
import org.apache.spark.sql.Row;
40
```
41

42
## Basic Usage
43

44
### Core RDD API (Scala)
45

46
```scala
47
import org.apache.spark.{SparkConf, SparkContext}
48

49
// Create Spark configuration and context
50
val conf = new SparkConf().setAppName("MyApp").setMaster("local[*]")
51
val sc = new SparkContext(conf)
52

53
// Create RDD from collection
54
val data = Array(1, 2, 3, 4, 5)
55
val rdd = sc.parallelize(data)
56

57
// Transform and collect results
58
val result = rdd.map(_ * 2).filter(_ > 5).collect()
59

60
sc.stop()
61
```
62

63
### SQL and DataFrames (Scala)
64

65
```scala
66
import org.apache.spark.sql.SparkSession
67

68
// Create Spark session
69
val spark = SparkSession.builder()
70
  .appName("MyApp")
71
  .master("local[*]")
72
  .getOrCreate()
73

74
// Read data into DataFrame
75
val df = spark.read.json("path/to/data.json")
76

77
// SQL operations
78
df.select("name", "age")
79
  .filter(df("age") > 21)
80
  .show()
81

82
// SQL queries
83
df.createOrReplaceTempView("people")
84
val adults = spark.sql("SELECT name FROM people WHERE age >= 18")
85

86
spark.stop()
87
```
88

89
## Architecture
90

91
Apache Spark follows a driver-executor architecture:
92

93
- **Driver Program**: Contains the main function and defines RDDs/DataFrames
94
- **Cluster Manager**: Allocates resources across applications (YARN, Mesos, Standalone)
95
- **Executors**: Worker processes that run tasks and store data
96
- **Tasks**: Units of work sent to executors
97

98
Key architectural components:
99

100
- **Catalyst Optimizer**: Rule-based and cost-based query optimization
101
- **Tungsten**: Off-heap memory management and code generation
102
- **Resilient Distributed Datasets (RDDs)**: Fault-tolerant distributed collections
103
- **DataFrames/Datasets**: Structured data APIs built on RDDs
104

105
## Capabilities
106

107
### Core Engine
108

109
Provides the fundamental distributed computing capabilities with RDDs, transformations, actions, and distributed variables.
110

111
```scala { .api }
112
class SparkContext(config: SparkConf) {
113
  def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
114
  def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
115
  def broadcast[T](value: T): Broadcast[T]
116
  def longAccumulator(): LongAccumulator
117
  def stop(): Unit
118
}
119

120
abstract class RDD[T: ClassTag] {
121
  def map[U: ClassTag](f: T => U): RDD[U]
122
  def filter(f: T => Boolean): RDD[T]
123
  def collect(): Array[T]
124
  def count(): Long
125
  def reduce(f: (T, T) => T): T
126
  def cache(): RDD[T]
127
}
128
```
129

130
[Core Engine](./core-engine.md)
131

132
### SQL and DataFrames
133

134
Structured data processing with SQL queries, DataFrames, and type-safe Datasets. Includes data source connectors and streaming capabilities.
135

136
```scala { .api }
137
object SparkSession {
138
  def builder(): Builder
139
}
140

141
class SparkSession {
142
  def read: DataFrameReader
143
  def sql(sqlText: String): DataFrame
144
  def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
145
  def stop(): Unit
146
}
147

148
abstract class Dataset[T] {
149
  def select(cols: Column*): DataFrame
150
  def filter(condition: Column): Dataset[T]
151
  def groupBy(cols: Column*): RelationalGroupedDataset
152
  def join(right: Dataset[_]): DataFrame
153
  def collect(): Array[T]
154
  def show(numRows: Int = 20): Unit
155
}
156
```
157

158
[SQL and DataFrames](./sql-dataframes.md)
159

160
### Machine Learning
161

162
Scalable machine learning algorithms and utilities, including both RDD-based (MLlib) and DataFrame-based (ML) APIs.
163

164
```scala { .api }
165
// DataFrame-based ML Pipeline API
166
abstract class Estimator[M <: Model[M]] extends PipelineStage {
167
  def fit(dataset: Dataset[_]): M
168
}
169

170
abstract class Transformer extends PipelineStage {
171
  def transform(dataset: Dataset[_]): DataFrame
172
}
173

174
class Pipeline(stages: Array[PipelineStage]) extends Estimator[PipelineModel] {
175
  def fit(dataset: Dataset[_]): PipelineModel  
176
}
177
```
178

179
[Machine Learning](./machine-learning.md)
180

181
### Graph Processing
182

183
Large-scale graph processing with GraphX, including graph algorithms and graph-parallel computations.
184

185
```scala { .api }
186
abstract class Graph[VD: ClassTag, ED: ClassTag] {
187
  def vertices: VertexRDD[VD]
188
  def edges: EdgeRDD[ED]
189
  def mapVertices[VD2: ClassTag](map: (VertexId, VD) => VD2): Graph[VD2, ED]
190
  def subgraph(epred: EdgeTriplet[VD, ED] => Boolean = _ => true,
191
               vpred: (VertexId, VD) => Boolean = (_, _) => true): Graph[VD, ED]
192
}
193

194
object GraphLoader {
195
  def edgeListFile(sc: SparkContext, path: String): Graph[Int, Int]
196
}
197
```
198

199
[Graph Processing](./graph-processing.md)
200

201
### Stream Processing
202

203
Real-time data processing with both legacy DStreams and modern Structured Streaming APIs.
204

205
```scala { .api }
206
class DataStreamReader {
207
  def format(source: String): DataStreamReader
208
  def option(key: String, value: String): DataStreamReader
209
  def load(): DataFrame
210
}
211

212
abstract class StreamingQuery {
213
  def start(): StreamingQuery
214
  def stop(): Unit
215
  def awaitTermination(): Unit
216
  def isActive: Boolean
217
}
218
```
219

220
[Stream Processing](./stream-processing.md)
221

222
## Types
223

224
### Core Types
225

226
```scala { .api }
227
class SparkConf(loadDefaults: Boolean = true) {
228
  def set(key: String, value: String): SparkConf
229
  def setAppName(name: String): SparkConf
230
  def setMaster(master: String): SparkConf
231
  def get(key: String): String
232
}
233

234
abstract class Broadcast[T] extends Serializable {
235
  def value: T
236
  def unpersist(): Unit
237
  def destroy(): Unit
238
}
239

240
abstract class AccumulatorV2[IN, OUT] extends Serializable {
241
  def add(v: IN): Unit
242
  def value: OUT
243
  def reset(): Unit
244
}
245

246
object StorageLevel {
247
  val MEMORY_ONLY: StorageLevel
248
  val MEMORY_AND_DISK: StorageLevel
249
  val DISK_ONLY: StorageLevel
250
}
251
```
252

253
### SQL Types
254

255
```scala { .api }
256
trait Row extends Serializable {
257
  def get(i: Int): Any
258
  def getString(i: Int): String
259
  def getInt(i: Int): Int
260
  def getDouble(i: Int): Double
261
  def isNullAt(i: Int): Boolean
262
  def size: Int
263
}
264

265
class Column(expr: Expression) {
266
  def ===(other: Any): Column
267
  def !==(other: Any): Column  
268
  def >(other: Any): Column
269
  def <(other: Any): Column
270
  def isNull: Column
271
  def cast(to: DataType): Column
272
  def alias(alias: String): Column
273
}
274

275
class StructType(fields: Array[StructField]) extends DataType {
276
  def add(field: StructField): StructType
277
  def add(name: String, dataType: DataType): StructType
278
  def fieldNames: Array[String]
279
}
280
```
281

282
### Graph Types
283

284
```scala { .api }
285
type VertexId = Long
286

287
case class Edge[ED](srcId: VertexId, dstId: VertexId, attr: ED)
288

289
class EdgeTriplet[VD, ED] extends Edge[ED] {
290
  def srcAttr: VD
291
  def dstAttr: VD
292
}
293

294
abstract class VertexRDD[VD] extends RDD[(VertexId, VD)]
295
abstract class EdgeRDD[ED] extends RDD[Edge[ED]]
296
```

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/