or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core.mddeployment.mdgraphx.mdindex.mdml.mdsql.mdstreaming.md

index.mddocs/

0

# Apache Spark

1

2

Apache Spark is a unified analytics engine for large-scale data processing that provides advanced programming APIs in Scala, Java, Python, and R. It offers high-level APIs for distributed data processing along with an optimized computation engine supporting general directed acyclic graphs for data analysis.

3

4

## Package Information

5

6

- **Package Name**: org.apache.spark:spark-parent_2.12

7

- **Package Type**: maven

8

- **Languages**: Scala, Java, Python, R

9

- **Installation**:

10

- Maven/SBT: `org.apache.spark:spark-core_2.12:3.5.6`

11

- Python: `pip install pyspark==3.5.6`

12

- R: `install.packages("SparkR")`

13

14

## Core Imports

15

16

```scala

17

import org.apache.spark.SparkContext

18

import org.apache.spark.SparkConf

19

import org.apache.spark.sql.SparkSession

20

import org.apache.spark.rdd.RDD

21

```

22

23

For Java:

24

25

```java

26

import org.apache.spark.SparkContext;

27

import org.apache.spark.SparkConf;

28

import org.apache.spark.sql.SparkSession;

29

import org.apache.spark.api.java.JavaRDD;

30

import org.apache.spark.api.java.JavaSparkContext;

31

```

32

33

For Python:

34

35

```python

36

from pyspark import SparkContext, SparkConf

37

from pyspark.sql import SparkSession, DataFrame, Column

38

from pyspark.rdd import RDD

39

```

40

41

Common PySpark imports:

42

43

```python

44

from pyspark.sql import functions as F

45

from pyspark.ml import Pipeline

46

from pyspark.ml.feature import VectorAssembler

47

from pyspark.ml.classification import LogisticRegression

48

```

49

50

## Basic Usage

51

52

```scala

53

import org.apache.spark.sql.SparkSession

54

import org.apache.spark.SparkContext

55

56

// Create SparkSession (modern approach)

57

val spark = SparkSession.builder()

58

.appName("MyApp")

59

.master("local[*]")

60

.getOrCreate()

61

62

// Working with DataFrames

63

val df = spark.read

64

.option("header", "true")

65

.csv("data.csv")

66

67

df.select("name", "age")

68

.filter($"age" > 21)

69

.show()

70

71

// Working with RDDs (low-level API)

72

val sc = spark.sparkContext

73

val data = sc.parallelize(1 to 1000)

74

val result = data

75

.map(_ * 2)

76

.filter(_ > 100)

77

.collect()

78

79

spark.stop()

80

```

81

82

Python equivalent:

83

84

```python

85

from pyspark.sql import SparkSession

86

from pyspark.sql import functions as F

87

88

# Create SparkSession

89

spark = SparkSession.builder \

90

.appName("MyApp") \

91

.master("local[*]") \

92

.getOrCreate()

93

94

# Working with DataFrames

95

df = spark.read \

96

.option("header", "true") \

97

.csv("data.csv")

98

99

df.select("name", "age") \

100

.filter(F.col("age") > 21) \

101

.show()

102

103

# Working with RDDs

104

sc = spark.sparkContext

105

data = sc.parallelize(range(1, 1001))

106

result = data \

107

.map(lambda x: x * 2) \

108

.filter(lambda x: x > 100) \

109

.collect()

110

111

spark.stop()

112

```

113

114

## Architecture

115

116

Apache Spark is built around several key components:

117

118

- **Core Engine**: Provides distributed task scheduling, memory management, fault recovery, and storage system interactions

119

- **Spark SQL**: Module for working with structured data using DataFrames and Datasets with SQL queries

120

- **MLlib**: Machine learning library providing common algorithms and utilities

121

- **GraphX**: Graph processing framework for analyzing graph structures and running graph algorithms

122

- **Structured Streaming**: Stream processing engine built on Spark SQL for real-time data processing

123

- **Cluster Management**: Support for various cluster managers (YARN, Kubernetes, Mesos, Standalone)

124

125

## Capabilities

126

127

### Core Data Processing

128

129

Distributed data processing using Resilient Distributed Datasets (RDDs) and the fundamental Spark execution engine. Provides fault-tolerant, parallel data structures and transformations.

130

131

```scala { .api }

132

class SparkContext(config: SparkConf) {

133

def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]

134

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

135

def broadcast[T: ClassTag](value: T): Broadcast[T]

136

def stop(): Unit

137

}

138

139

abstract class RDD[T: ClassTag] {

140

def map[U: ClassTag](f: T => U): RDD[U]

141

def filter(f: T => Boolean): RDD[T]

142

def collect(): Array[T]

143

def count(): Long

144

def cache(): RDD[T]

145

}

146

```

147

148

Python API:

149

150

```python { .api }

151

class SparkContext:

152

def __init__(self, conf: SparkConf = None)

153

def parallelize(self, c: Iterable, numSlices: int = None) -> RDD

154

def textFile(self, name: str, minPartitions: int = None) -> RDD

155

def broadcast(self, value: Any) -> Broadcast

156

def stop(self) -> None

157

158

class RDD:

159

def map(self, f: Callable) -> RDD

160

def filter(self, f: Callable) -> RDD

161

def collect(self) -> List

162

def count(self) -> int

163

def cache(self) -> RDD

164

```

165

166

[Core Processing](./core.md)

167

168

### Structured Data Processing

169

170

High-level APIs for working with structured data using DataFrames and Datasets. Built on Spark SQL with Catalyst optimizer for query optimization and code generation.

171

172

```scala { .api }

173

class SparkSession {

174

def read: DataFrameReader

175

def sql(sqlText: String): DataFrame

176

def table(tableName: String): DataFrame

177

def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame

178

def stop(): Unit

179

}

180

181

class Dataset[T] {

182

def select(cols: Column*): DataFrame

183

def filter(condition: Column): Dataset[T]

184

def groupBy(cols: Column*): RelationalGroupedDataset

185

def join(right: Dataset[_]): DataFrame

186

def write: DataFrameWriter[T]

187

def collect(): Array[T]

188

def show(numRows: Int = 20): Unit

189

}

190

191

type DataFrame = Dataset[Row]

192

```

193

194

Python API:

195

196

```python { .api }

197

class SparkSession:

198

@property

199

def read(self) -> DataFrameReader

200

def sql(self, sqlQuery: str) -> DataFrame

201

def table(self, tableName: str) -> DataFrame

202

def createDataFrame(self, data: List, schema: Optional[Union[List, StructType]] = None) -> DataFrame

203

def stop(self) -> None

204

205

class DataFrame:

206

def select(self, *cols: Union[str, Column]) -> DataFrame

207

def filter(self, condition: Union[str, Column]) -> DataFrame

208

def where(self, condition: Union[str, Column]) -> DataFrame

209

def groupBy(self, *cols: Union[str, Column]) -> GroupedData

210

def join(self, other: DataFrame, on: Optional[Union[str, List[str], Column]] = None) -> DataFrame

211

def collect(self) -> List[Row]

212

def show(self, n: int = 20, truncate: bool = True) -> None

213

```

214

215

[SQL and DataFrames](./sql.md)

216

217

### Machine Learning

218

219

Comprehensive machine learning library with algorithms for classification, regression, clustering, and collaborative filtering. Provides both high-level Pipeline API and low-level RDD-based APIs.

220

221

```scala { .api }

222

class Pipeline extends Estimator[PipelineModel] {

223

def setStages(value: Array[PipelineStage]): Pipeline

224

def fit(dataset: Dataset[_]): PipelineModel

225

}

226

227

abstract class Estimator[M <: Model[M]] extends PipelineStage {

228

def fit(dataset: Dataset[_]): M

229

}

230

231

abstract class Transformer extends PipelineStage {

232

def transform(dataset: Dataset[_]): DataFrame

233

}

234

```

235

236

Python API:

237

238

```python { .api }

239

class Pipeline(Estimator):

240

def setStages(self, value: List[PipelineStage]) -> Pipeline

241

def fit(self, dataset: DataFrame) -> PipelineModel

242

243

class Estimator(PipelineStage):

244

def fit(self, dataset: DataFrame) -> Model

245

246

class Transformer(PipelineStage):

247

def transform(self, dataset: DataFrame) -> DataFrame

248

249

class PipelineModel(Model):

250

def transform(self, dataset: DataFrame) -> DataFrame

251

```

252

253

[Machine Learning](./ml.md)

254

255

### Graph Processing

256

257

GraphX provides APIs for graphs and graph-parallel computation with fundamental operators like subgraph, joinVertices, and aggregateMessages, plus optimized variants of graph algorithms. **Note**: GraphX is only available in Scala and Java - there is no Python API.

258

259

```scala { .api }

260

abstract class Graph[VD: ClassTag, ED: ClassTag] {

261

val vertices: VertexRDD[VD]

262

val edges: EdgeRDD[ED]

263

def mapVertices[VD2: ClassTag](map: (VertexId, VD) => VD2): Graph[VD2, ED]

264

def aggregateMessages[A: ClassTag](

265

sendMsg: EdgeContext[VD, ED, A] => Unit,

266

mergeMsg: (A, A) => A

267

): VertexRDD[A]

268

}

269

270

type VertexId = Long

271

```

272

273

**Python Alternative**: Use GraphFrames library (`pip install graphframes`) for graph processing in Python with Spark DataFrames.

274

275

[Graph Processing](./graphx.md)

276

277

### Stream Processing

278

279

Structured Streaming provides real-time stream processing with exactly-once fault-tolerance guarantees. Built on the Spark SQL engine for seamless integration with batch processing.

280

281

```scala { .api }

282

class StreamingContext(conf: SparkConf, batchDuration: Duration) {

283

def socketTextStream(hostname: String, port: Int): ReceiverInputDStream[String]

284

def textFileStream(directory: String): DStream[String]

285

def start(): Unit

286

def stop(): Unit

287

def awaitTermination(): Unit

288

}

289

290

abstract class DStream[T: ClassTag] {

291

def map[U: ClassTag](mapFunc: T => U): DStream[U]

292

def filter(filterFunc: T => Boolean): DStream[T]

293

def window(windowDuration: Duration, slideDuration: Duration): DStream[T]

294

def foreachRDD(foreachFunc: RDD[T] => Unit): Unit

295

}

296

```

297

298

Python API:

299

300

```python { .api }

301

class StreamingContext:

302

def __init__(self, sparkContext: SparkContext, batchDuration: float)

303

def socketTextStream(self, hostname: str, port: int) -> DStream

304

def textFileStream(self, directory: str) -> DStream

305

def start(self) -> None

306

def stop(self, stopSparkContext: bool = True, stopGracefully: bool = False) -> None

307

def awaitTermination(self, timeout: Optional[float] = None) -> None

308

309

class DStream:

310

def map(self, f: Callable) -> DStream

311

def filter(self, f: Callable) -> DStream

312

def window(self, windowDuration: float, slideDuration: Optional[float] = None) -> DStream

313

def foreachRDD(self, func: Callable[[RDD], None]) -> None

314

```

315

316

For modern streaming applications, prefer Structured Streaming via `SparkSession.readStream` which provides better fault tolerance and performance.

317

318

[Stream Processing](./streaming.md)

319

320

### Application Management

321

322

Programmatic interfaces for launching, monitoring, and managing Spark applications across different cluster managers and deployment modes.

323

324

```scala { .api }

325

class SparkConf(loadDefaults: Boolean = true) {

326

def set(key: String, value: String): SparkConf

327

def setMaster(master: String): SparkConf

328

def setAppName(name: String): SparkConf

329

def get(key: String): String

330

}

331

```

332

333

```java { .api }

334

public class SparkLauncher {

335

public SparkLauncher setAppName(String appName);

336

public SparkLauncher setMaster(String master);

337

public SparkLauncher setMainClass(String mainClass);

338

public Process launch() throws IOException;

339

public SparkAppHandle startApplication() throws IOException;

340

}

341

```

342

343

Python API:

344

345

```python { .api }

346

class SparkConf:

347

def __init__(self, loadDefaults: bool = True)

348

def set(self, key: str, value: str) -> SparkConf

349

def setMaster(self, value: str) -> SparkConf

350

def setAppName(self, value: str) -> SparkConf

351

def get(self, key: str, defaultValue: Optional[str] = None) -> str

352

def getAll(self) -> List[Tuple[str, str]]

353

```

354

355

Note: Python does not have a direct equivalent to SparkLauncher. Use `spark-submit` command-line tool or SparkSession.builder for application management.

356

357

### Python-Specific Features

358

359

Python-specific Spark capabilities including pandas API compatibility, type hints, and Python-optimized operations.

360

361

```python { .api }

362

# Pandas API on Spark (pyspark.pandas)

363

import pyspark.pandas as ps

364

365

class DataFrame: # pandas-compatible DataFrame

366

def head(self, n: int = 5) -> DataFrame

367

def describe(self) -> DataFrame

368

def groupby(self, by: Union[str, List[str]]) -> GroupBy

369

def merge(self, right: DataFrame, on: str = None) -> DataFrame

370

371

# SQL Functions

372

from pyspark.sql import functions as F

373

374

# Common functions (473+ available)

375

def col(colName: str) -> Column

376

def lit(col: Any) -> Column

377

def when(condition: Column, value: Any) -> Column

378

def coalesce(*cols: Column) -> Column

379

def concat(*cols: Column) -> Column

380

def regexp_replace(str: Column, pattern: str, replacement: str) -> Column

381

```

382

383

**Key Python Features:**

384

- **Pandas API**: `pyspark.pandas` provides pandas-compatible DataFrame operations

385

- **Type Hints**: Full type annotation support for better IDE integration

386

- **Arrow Integration**: High-performance columnar data transfer between JVM and Python

387

- **UDF Support**: User-defined functions with vectorized operations using pandas UDFs

388

389

[Configuration and Deployment](./deployment.md)