or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-parent_2.13

Apache Spark unified analytics engine for large-scale data processing with APIs in Scala, Java, Python and R.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-parent_2.13@4.0.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-parent_2.13@4.0.0

0

# Apache Spark

1

2

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

3

4

## Package Information

5

6

- **Package Name**: org.apache.spark:spark-parent_2.13

7

- **Package Type**: Maven

8

- **Language**: Scala (with Java, Python, R APIs)

9

- **Installation**: See [Building Spark](#building-spark)

10

- **License**: Apache-2.0

11

12

### Maven Dependency

13

14

For Spark SQL (most common):

15

```xml

16

<dependency>

17

<groupId>org.apache.spark</groupId>

18

<artifactId>spark-sql_2.13</artifactId>

19

<version>4.0.0</version>

20

</dependency>

21

```

22

23

For Spark Core:

24

```xml

25

<dependency>

26

<groupId>org.apache.spark</groupId>

27

<artifactId>spark-core_2.13</artifactId>

28

<version>4.0.0</version>

29

</dependency>

30

```

31

32

## Core Imports

33

34

```scala { .api }

35

// Main entry points

36

import org.apache.spark.SparkContext

37

import org.apache.spark.sql.SparkSession

38

39

// Core data structures

40

import org.apache.spark.rdd.RDD

41

import org.apache.spark.sql.{DataFrame, Dataset}

42

43

// Configuration

44

import org.apache.spark.SparkConf

45

46

// Common types

47

import org.apache.spark.sql.types._

48

import org.apache.spark.sql.functions._

49

```

50

51

## Basic Usage

52

53

### Creating a Spark Session (Recommended)

54

55

```scala { .api }

56

import org.apache.spark.sql.SparkSession

57

58

val spark = SparkSession.builder()

59

.appName("MySparkApp")

60

.master("local[*]") // Use all available cores locally

61

.config("spark.some.config.option", "some-value")

62

.getOrCreate()

63

64

// Use the spark session for SQL operations

65

val df = spark.read.json("path/to/file.json")

66

df.show()

67

68

// Access the SparkContext for RDD operations

69

val sc = spark.sparkContext

70

val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5))

71

println(rdd.map(_ * 2).collect().mkString(", "))

72

73

spark.stop()

74

```

75

76

### Working with DataFrames

77

78

```scala { .api }

79

import org.apache.spark.sql.functions._

80

81

// Read data

82

val df = spark.read

83

.option("header", "true")

84

.csv("path/to/data.csv")

85

86

// Transform data

87

val result = df

88

.filter(col("age") > 21)

89

.groupBy("department")

90

.agg(avg("salary").as("avg_salary"))

91

.orderBy(desc("avg_salary"))

92

93

result.show()

94

```

95

96

### Working with RDDs (Low-level API)

97

98

```scala { .api }

99

val sc = spark.sparkContext

100

101

// Create RDD from collection

102

val numbers = sc.parallelize(1 to 1000)

103

104

// Transform and action

105

val result = numbers

106

.filter(_ % 2 == 0)

107

.map(_ * 2)

108

.reduce(_ + _)

109

110

println(s"Result: $result")

111

```

112

113

## Architecture

114

115

Apache Spark is built around several key components:

116

117

### Core Engine

118

- **SparkContext**: The main entry point for Spark functionality and connection to cluster

119

- **RDD (Resilient Distributed Dataset)**: Fundamental data abstraction - immutable distributed collections

120

- **DAG Scheduler**: Converts logical execution plans into physical execution plans

121

- **Task Scheduler**: Schedules and executes tasks across the cluster

122

123

### High-Level APIs

124

- **SparkSession**: Unified entry point for DataFrame and Dataset APIs (recommended)

125

- **DataFrame/Dataset**: Higher-level abstractions built on RDDs with schema information

126

- **Catalyst Optimizer**: Query optimization engine for SQL and DataFrame operations

127

128

### Libraries and Extensions

129

- **Spark SQL**: Module for working with structured data using SQL or DataFrame API

130

- **MLlib**: Machine learning library with algorithms and utilities

131

- **GraphX**: Graph processing framework for graph-based analytics

132

- **Structured Streaming**: Real-time stream processing with DataFrame API

133

134

### Deployment and Storage

135

- **Cluster Managers**: Support for YARN, Kubernetes, Mesos, and standalone mode

136

- **Storage Systems**: Integration with HDFS, S3, databases, and various file formats

137

- **Serialization**: Efficient data serialization for network and storage operations

138

139

## Building Spark

140

141

Spark is built using Apache Maven:

142

143

```bash

144

./build/mvn -DskipTests clean package

145

```

146

147

For development with specific profiles:

148

149

```bash

150

./build/mvn -Pyarn -Phadoop-3.3 -Pscala-2.13 -DskipTests clean package

151

```

152

153

## Capabilities

154

155

### [Core Engine](./core.md)

156

157

Low-level distributed computing with RDDs, SparkContext, and cluster management. Provides the foundation for all other Spark components.

158

159

```scala { .api }

160

class SparkContext(config: SparkConf) extends Logging {

161

def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]

162

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

163

def stop(): Unit

164

}

165

166

abstract class RDD[T: ClassTag] extends Serializable with Logging {

167

def map[U: ClassTag](f: T => U): RDD[U]

168

def filter(f: T => Boolean): RDD[T]

169

def collect(): Array[T]

170

def count(): Long

171

}

172

```

173

174

[Core Engine Documentation](./core.md)

175

176

### [SQL and DataFrames](./sql.md)

177

178

High-level APIs for working with structured data using SQL, DataFrames, and Datasets with the Catalyst optimizer.

179

180

```scala { .api }

181

class SparkSession extends Serializable with Closeable {

182

def read: DataFrameReader

183

def sql(sqlText: String): DataFrame

184

def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame

185

def stop(): Unit

186

}

187

188

abstract class Dataset[T] extends Serializable {

189

def select(cols: Column*): DataFrame

190

def filter(condition: Column): Dataset[T]

191

def groupBy(cols: Column*): RelationalGroupedDataset

192

def show(numRows: Int = 20): Unit

193

}

194

```

195

196

[SQL and DataFrames Documentation](./sql.md)

197

198

### [Machine Learning](./mllib.md)

199

200

Comprehensive machine learning library with algorithms for classification, regression, clustering, and collaborative filtering.

201

202

```scala { .api }

203

// Example ML pipeline components

204

import org.apache.spark.ml.Pipeline

205

import org.apache.spark.ml.classification.LogisticRegression

206

import org.apache.spark.ml.feature.{HashingTF, Tokenizer}

207

208

val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")

209

val hashingTF = new HashingTF().setInputCol("words").setOutputCol("features")

210

val lr = new LogisticRegression().setMaxIter(10)

211

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))

212

```

213

214

[Machine Learning Documentation](./mllib.md)

215

216

### [Structured Streaming](./streaming.md)

217

218

Scalable and fault-tolerant stream processing engine built on the Spark SQL engine with micro-batch and continuous processing modes.

219

220

```scala { .api }

221

val spark = SparkSession.builder().appName("StructuredNetworkWordCount").getOrCreate()

222

223

val lines = spark.readStream

224

.format("socket")

225

.option("host", "localhost")

226

.option("port", 9999)

227

.load()

228

229

val query = lines.writeStream

230

.outputMode("complete")

231

.format("console")

232

.start()

233

```

234

235

[Structured Streaming Documentation](./streaming.md)

236

237

### [Graph Processing](./graphx.md)

238

239

Graph-parallel computation framework for ETL, exploratory data analysis, and iterative graph computation.

240

241

```scala { .api }

242

import org.apache.spark.graphx._

243

244

val vertices: RDD[(VertexId, String)] = sc.parallelize(Array(

245

(1L, "Alice"), (2L, "Bob"), (3L, "Charlie")

246

))

247

248

val edges: RDD[Edge[String]] = sc.parallelize(Array(

249

Edge(1L, 2L, "friend"), Edge(2L, 3L, "follow")

250

))

251

252

val graph = Graph(vertices, edges)

253

```

254

255

[Graph Processing Documentation](./graphx.md)

256

257

### [Core Utilities](./exceptions.md)

258

259

Essential infrastructure components including exception handling, logging, configuration management, and storage utilities that support all Spark components.

260

261

**Key Utility Areas:**

262

- [Exception Handling](./exceptions.md) - Structured error management with error classes

263

- [Storage Configuration](./storage.md) - RDD storage level management

264

- [Logging Framework](./logging.md) - Structured logging with MDC support

265

- [Thread Utilities](./utils.md) - Thread-safe utilities and lexical scoping

266

267

## Interactive Usage

268

269

### Scala Shell

270

```bash

271

./bin/spark-shell

272

```

273

274

Example:

275

```scala

276

scala> spark.range(1000 * 1000 * 1000).count()

277

res0: Long = 1000000000

278

```

279

280

### Python Shell

281

```bash

282

./bin/pyspark

283

```

284

285

Example:

286

```python

287

>>> spark.range(1000 * 1000 * 1000).count()

288

1000000000

289

```

290

291

### SQL Shell

292

```bash

293

./bin/spark-sql

294

```

295

296

## Configuration

297

298

Key configuration options:

299

300

```scala { .api }

301

val conf = new SparkConf()

302

.setAppName("MyApp")

303

.setMaster("local[4]") // Use 4 cores locally

304

.set("spark.sql.adaptive.enabled", "true")

305

.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

306

.set("spark.executor.memory", "2g")

307

.set("spark.driver.memory", "1g")

308

```

309

310

## Deployment

311

312

Spark supports multiple deployment modes:

313

314

- **Local**: `local`, `local[4]`, `local[*]`

315

- **Standalone**: `spark://host:port`

316

- **YARN**: `yarn`

317

- **Kubernetes**: `k8s://https://kubernetes-api-url`

318

- **Mesos**: `mesos://host:port`

319

320

---

321

322

**Core Documentation:**

323

- [Core Engine](./core.md) - SparkContext, RDDs, and distributed computing

324

- [SQL and DataFrames](./sql.md) - SparkSession, DataFrame, Dataset APIs

325

- [Machine Learning](./mllib.md) - MLlib algorithms and pipelines

326

- [Structured Streaming](./streaming.md) - Real-time stream processing

327

- [Graph Processing](./graphx.md) - GraphX graph-parallel computation

328

329

**Utilities Documentation:**

330

- [Exception Handling](./exceptions.md) - Error management and validation

331

- [Storage Configuration](./storage.md) - RDD persistence and storage levels

332

- [Logging Framework](./logging.md) - Structured logging infrastructure

333

- [Thread Utilities](./utils.md) - Thread-safe utilities and scoping