Tessl Tile for maven/org.apache.spark/spark-parent_2.12@3.5.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core.md deployment.md graphx.md index.md ml.md sql.md streaming.md

index.mddocs/

0
# Apache Spark
1

2
Apache Spark is a unified analytics engine for large-scale data processing that provides advanced programming APIs in Scala, Java, Python, and R. It offers high-level APIs for distributed data processing along with an optimized computation engine supporting general directed acyclic graphs for data analysis.
3

4
## Package Information
5

6
- **Package Name**: org.apache.spark:spark-parent_2.12
7
- **Package Type**: maven
8
- **Languages**: Scala, Java, Python, R
9
- **Installation**: 
10
  - Maven/SBT: `org.apache.spark:spark-core_2.12:3.5.6`
11
  - Python: `pip install pyspark==3.5.6`
12
  - R: `install.packages("SparkR")`
13

14
## Core Imports
15

16
```scala
17
import org.apache.spark.SparkContext
18
import org.apache.spark.SparkConf
19
import org.apache.spark.sql.SparkSession
20
import org.apache.spark.rdd.RDD
21
```
22

23
For Java:
24

25
```java
26
import org.apache.spark.SparkContext;
27
import org.apache.spark.SparkConf;
28
import org.apache.spark.sql.SparkSession;
29
import org.apache.spark.api.java.JavaRDD;
30
import org.apache.spark.api.java.JavaSparkContext;
31
```
32

33
For Python:
34

35
```python
36
from pyspark import SparkContext, SparkConf
37
from pyspark.sql import SparkSession, DataFrame, Column
38
from pyspark.rdd import RDD
39
```
40

41
Common PySpark imports:
42

43
```python
44
from pyspark.sql import functions as F
45
from pyspark.ml import Pipeline
46
from pyspark.ml.feature import VectorAssembler
47
from pyspark.ml.classification import LogisticRegression
48
```
49

50
## Basic Usage
51

52
```scala
53
import org.apache.spark.sql.SparkSession
54
import org.apache.spark.SparkContext
55

56
// Create SparkSession (modern approach)
57
val spark = SparkSession.builder()
58
  .appName("MyApp")
59
  .master("local[*]")
60
  .getOrCreate()
61

62
// Working with DataFrames
63
val df = spark.read
64
  .option("header", "true")
65
  .csv("data.csv")
66

67
df.select("name", "age")
68
  .filter($"age" > 21)
69
  .show()
70

71
// Working with RDDs (low-level API)
72
val sc = spark.sparkContext
73
val data = sc.parallelize(1 to 1000)
74
val result = data
75
  .map(_ * 2)
76
  .filter(_ > 100)
77
  .collect()
78

79
spark.stop()
80
```
81

82
Python equivalent:
83

84
```python
85
from pyspark.sql import SparkSession
86
from pyspark.sql import functions as F
87

88
# Create SparkSession
89
spark = SparkSession.builder \
90
    .appName("MyApp") \
91
    .master("local[*]") \
92
    .getOrCreate()
93

94
# Working with DataFrames
95
df = spark.read \
96
    .option("header", "true") \
97
    .csv("data.csv")
98

99
df.select("name", "age") \
100
  .filter(F.col("age") > 21) \
101
  .show()
102

103
# Working with RDDs
104
sc = spark.sparkContext
105
data = sc.parallelize(range(1, 1001))
106
result = data \
107
    .map(lambda x: x * 2) \
108
    .filter(lambda x: x > 100) \
109
    .collect()
110

111
spark.stop()
112
```
113

114
## Architecture
115

116
Apache Spark is built around several key components:
117

118
- **Core Engine**: Provides distributed task scheduling, memory management, fault recovery, and storage system interactions
119
- **Spark SQL**: Module for working with structured data using DataFrames and Datasets with SQL queries
120
- **MLlib**: Machine learning library providing common algorithms and utilities
121
- **GraphX**: Graph processing framework for analyzing graph structures and running graph algorithms
122
- **Structured Streaming**: Stream processing engine built on Spark SQL for real-time data processing
123
- **Cluster Management**: Support for various cluster managers (YARN, Kubernetes, Mesos, Standalone)
124

125
## Capabilities
126

127
### Core Data Processing
128

129
Distributed data processing using Resilient Distributed Datasets (RDDs) and the fundamental Spark execution engine. Provides fault-tolerant, parallel data structures and transformations.
130

131
```scala { .api }
132
class SparkContext(config: SparkConf) {
133
  def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
134
  def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
135
  def broadcast[T: ClassTag](value: T): Broadcast[T]
136
  def stop(): Unit
137
}
138

139
abstract class RDD[T: ClassTag] {
140
  def map[U: ClassTag](f: T => U): RDD[U]
141
  def filter(f: T => Boolean): RDD[T]
142
  def collect(): Array[T]
143
  def count(): Long
144
  def cache(): RDD[T]
145
}
146
```
147

148
Python API:
149

150
```python { .api }
151
class SparkContext:
152
    def __init__(self, conf: SparkConf = None)
153
    def parallelize(self, c: Iterable, numSlices: int = None) -> RDD
154
    def textFile(self, name: str, minPartitions: int = None) -> RDD
155
    def broadcast(self, value: Any) -> Broadcast
156
    def stop(self) -> None
157

158
class RDD:
159
    def map(self, f: Callable) -> RDD
160
    def filter(self, f: Callable) -> RDD
161
    def collect(self) -> List
162
    def count(self) -> int
163
    def cache(self) -> RDD
164
```
165

166
[Core Processing](./core.md)
167

168
### Structured Data Processing
169

170
High-level APIs for working with structured data using DataFrames and Datasets. Built on Spark SQL with Catalyst optimizer for query optimization and code generation.
171

172
```scala { .api }
173
class SparkSession {
174
  def read: DataFrameReader
175
  def sql(sqlText: String): DataFrame
176
  def table(tableName: String): DataFrame
177
  def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
178
  def stop(): Unit
179
}
180

181
class Dataset[T] {
182
  def select(cols: Column*): DataFrame
183
  def filter(condition: Column): Dataset[T]
184
  def groupBy(cols: Column*): RelationalGroupedDataset
185
  def join(right: Dataset[_]): DataFrame
186
  def write: DataFrameWriter[T]
187
  def collect(): Array[T]
188
  def show(numRows: Int = 20): Unit
189
}
190

191
type DataFrame = Dataset[Row]
192
```
193

194
Python API:
195

196
```python { .api }
197
class SparkSession:
198
    @property
199
    def read(self) -> DataFrameReader
200
    def sql(self, sqlQuery: str) -> DataFrame
201
    def table(self, tableName: str) -> DataFrame
202
    def createDataFrame(self, data: List, schema: Optional[Union[List, StructType]] = None) -> DataFrame
203
    def stop(self) -> None
204

205
class DataFrame:
206
    def select(self, *cols: Union[str, Column]) -> DataFrame
207
    def filter(self, condition: Union[str, Column]) -> DataFrame
208
    def where(self, condition: Union[str, Column]) -> DataFrame
209
    def groupBy(self, *cols: Union[str, Column]) -> GroupedData
210
    def join(self, other: DataFrame, on: Optional[Union[str, List[str], Column]] = None) -> DataFrame
211
    def collect(self) -> List[Row]
212
    def show(self, n: int = 20, truncate: bool = True) -> None
213
```
214

215
[SQL and DataFrames](./sql.md)
216

217
### Machine Learning
218

219
Comprehensive machine learning library with algorithms for classification, regression, clustering, and collaborative filtering. Provides both high-level Pipeline API and low-level RDD-based APIs.
220

221
```scala { .api }
222
class Pipeline extends Estimator[PipelineModel] {
223
  def setStages(value: Array[PipelineStage]): Pipeline
224
  def fit(dataset: Dataset[_]): PipelineModel
225
}
226

227
abstract class Estimator[M <: Model[M]] extends PipelineStage {
228
  def fit(dataset: Dataset[_]): M
229
}
230

231
abstract class Transformer extends PipelineStage {
232
  def transform(dataset: Dataset[_]): DataFrame
233
}
234
```
235

236
Python API:
237

238
```python { .api }
239
class Pipeline(Estimator):
240
    def setStages(self, value: List[PipelineStage]) -> Pipeline
241
    def fit(self, dataset: DataFrame) -> PipelineModel
242

243
class Estimator(PipelineStage):
244
    def fit(self, dataset: DataFrame) -> Model
245

246
class Transformer(PipelineStage):
247
    def transform(self, dataset: DataFrame) -> DataFrame
248

249
class PipelineModel(Model):
250
    def transform(self, dataset: DataFrame) -> DataFrame
251
```
252

253
[Machine Learning](./ml.md)
254

255
### Graph Processing
256

257
GraphX provides APIs for graphs and graph-parallel computation with fundamental operators like subgraph, joinVertices, and aggregateMessages, plus optimized variants of graph algorithms. **Note**: GraphX is only available in Scala and Java - there is no Python API.
258

259
```scala { .api }
260
abstract class Graph[VD: ClassTag, ED: ClassTag] {
261
  val vertices: VertexRDD[VD]
262
  val edges: EdgeRDD[ED]
263
  def mapVertices[VD2: ClassTag](map: (VertexId, VD) => VD2): Graph[VD2, ED]
264
  def aggregateMessages[A: ClassTag](
265
    sendMsg: EdgeContext[VD, ED, A] => Unit,
266
    mergeMsg: (A, A) => A
267
  ): VertexRDD[A]
268
}
269

270
type VertexId = Long
271
```
272

273
**Python Alternative**: Use GraphFrames library (`pip install graphframes`) for graph processing in Python with Spark DataFrames.
274

275
[Graph Processing](./graphx.md)
276

277
### Stream Processing
278

279
Structured Streaming provides real-time stream processing with exactly-once fault-tolerance guarantees. Built on the Spark SQL engine for seamless integration with batch processing.
280

281
```scala { .api }
282
class StreamingContext(conf: SparkConf, batchDuration: Duration) {
283
  def socketTextStream(hostname: String, port: Int): ReceiverInputDStream[String]
284
  def textFileStream(directory: String): DStream[String]
285
  def start(): Unit
286
  def stop(): Unit
287
  def awaitTermination(): Unit
288
}
289

290
abstract class DStream[T: ClassTag] {
291
  def map[U: ClassTag](mapFunc: T => U): DStream[U]
292
  def filter(filterFunc: T => Boolean): DStream[T]
293
  def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
294
  def foreachRDD(foreachFunc: RDD[T] => Unit): Unit
295
}
296
```
297

298
Python API:
299

300
```python { .api }
301
class StreamingContext:
302
    def __init__(self, sparkContext: SparkContext, batchDuration: float)
303
    def socketTextStream(self, hostname: str, port: int) -> DStream
304
    def textFileStream(self, directory: str) -> DStream
305
    def start(self) -> None
306
    def stop(self, stopSparkContext: bool = True, stopGracefully: bool = False) -> None
307
    def awaitTermination(self, timeout: Optional[float] = None) -> None
308

309
class DStream:
310
    def map(self, f: Callable) -> DStream
311
    def filter(self, f: Callable) -> DStream
312
    def window(self, windowDuration: float, slideDuration: Optional[float] = None) -> DStream
313
    def foreachRDD(self, func: Callable[[RDD], None]) -> None
314
```
315

316
For modern streaming applications, prefer Structured Streaming via `SparkSession.readStream` which provides better fault tolerance and performance.
317

318
[Stream Processing](./streaming.md)
319

320
### Application Management
321

322
Programmatic interfaces for launching, monitoring, and managing Spark applications across different cluster managers and deployment modes.
323

324
```scala { .api }
325
class SparkConf(loadDefaults: Boolean = true) {
326
  def set(key: String, value: String): SparkConf
327
  def setMaster(master: String): SparkConf
328
  def setAppName(name: String): SparkConf
329
  def get(key: String): String
330
}
331
```
332

333
```java { .api }
334
public class SparkLauncher {
335
  public SparkLauncher setAppName(String appName);
336
  public SparkLauncher setMaster(String master);
337
  public SparkLauncher setMainClass(String mainClass);
338
  public Process launch() throws IOException;
339
  public SparkAppHandle startApplication() throws IOException;
340
}
341
```
342

343
Python API:
344

345
```python { .api }
346
class SparkConf:
347
    def __init__(self, loadDefaults: bool = True)
348
    def set(self, key: str, value: str) -> SparkConf
349
    def setMaster(self, value: str) -> SparkConf
350
    def setAppName(self, value: str) -> SparkConf
351
    def get(self, key: str, defaultValue: Optional[str] = None) -> str
352
    def getAll(self) -> List[Tuple[str, str]]
353
```
354

355
Note: Python does not have a direct equivalent to SparkLauncher. Use `spark-submit` command-line tool or SparkSession.builder for application management.
356

357
### Python-Specific Features
358

359
Python-specific Spark capabilities including pandas API compatibility, type hints, and Python-optimized operations.
360

361
```python { .api }
362
# Pandas API on Spark (pyspark.pandas)
363
import pyspark.pandas as ps
364

365
class DataFrame:  # pandas-compatible DataFrame
366
    def head(self, n: int = 5) -> DataFrame
367
    def describe(self) -> DataFrame
368
    def groupby(self, by: Union[str, List[str]]) -> GroupBy
369
    def merge(self, right: DataFrame, on: str = None) -> DataFrame
370

371
# SQL Functions
372
from pyspark.sql import functions as F
373

374
# Common functions (473+ available)
375
def col(colName: str) -> Column
376
def lit(col: Any) -> Column
377
def when(condition: Column, value: Any) -> Column
378
def coalesce(*cols: Column) -> Column
379
def concat(*cols: Column) -> Column
380
def regexp_replace(str: Column, pattern: str, replacement: str) -> Column
381
```
382

383
**Key Python Features:**
384
- **Pandas API**: `pyspark.pandas` provides pandas-compatible DataFrame operations
385
- **Type Hints**: Full type annotation support for better IDE integration
386
- **Arrow Integration**: High-performance columnar data transfer between JVM and Python
387
- **UDF Support**: User-defined functions with vectorized operations using pandas UDFs
388

389
[Configuration and Deployment](./deployment.md)

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/