or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-parent-212

Apache Spark is a unified analytics engine for large-scale data processing

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-parent_2.12@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-parent-212@3.5.0

0

# Apache Spark

1

2

Apache Spark is a unified analytics engine for large-scale data processing that provides high-level APIs in Scala, Java, Python, and R, along with an optimized engine supporting general computation graphs. Spark includes multiple specialized components for SQL and DataFrames processing, machine learning, graph processing, and real-time stream processing.

3

4

## Package Information

5

6

- **Package Name**: org.apache.spark:spark-parent_2.13

7

- **Package Type**: maven

8

- **Language**: Scala (with Java, Python, R bindings)

9

- **License**: Apache-2.0

10

- **Installation**: Add Maven dependency or download distribution from https://spark.apache.org/downloads.html

11

12

## Core Imports

13

14

For Scala applications:

15

16

```scala

17

import org.apache.spark.SparkContext

18

import org.apache.spark.SparkConf

19

import org.apache.spark.sql.SparkSession

20

import org.apache.spark.rdd.RDD

21

```

22

23

For Java applications:

24

25

```java

26

import org.apache.spark.SparkConf;

27

import org.apache.spark.api.java.JavaSparkContext;

28

import org.apache.spark.sql.SparkSession;

29

import org.apache.spark.sql.Dataset;

30

import org.apache.spark.sql.Row;

31

```

32

33

Maven dependency:

34

35

```xml

36

<dependency>

37

<groupId>org.apache.spark</groupId>

38

<artifactId>spark-core_2.13</artifactId>

39

<version>3.5.6</version>

40

</dependency>

41

<dependency>

42

<groupId>org.apache.spark</groupId>

43

<artifactId>spark-sql_2.13</artifactId>

44

<version>3.5.6</version>

45

</dependency>

46

```

47

48

## Basic Usage

49

50

```scala

51

import org.apache.spark.sql.SparkSession

52

53

// Create SparkSession (entry point for DataFrame and SQL APIs)

54

val spark = SparkSession.builder()

55

.appName("MySparkApp")

56

.master("local[*]")

57

.getOrCreate()

58

59

// Create DataFrame from data

60

val df = spark.createDataFrame(Seq(

61

("Alice", 25),

62

("Bob", 30),

63

("Charlie", 35)

64

)).toDF("name", "age")

65

66

// Run SQL queries

67

df.createOrReplaceTempView("people")

68

val adults = spark.sql("SELECT * FROM people WHERE age >= 30")

69

adults.show()

70

71

// DataFrame transformations

72

val result = df.filter($"age" > 25)

73

.select($"name", $"age")

74

.orderBy($"age".desc)

75

76

result.collect()

77

78

spark.stop()

79

```

80

81

## Architecture

82

83

Apache Spark consists of several key components:

84

85

- **Spark Core**: The foundation providing basic I/O functionalities, RDDs, and task scheduling

86

- **Spark SQL**: Module for working with structured data using DataFrames and SQL

87

- **MLlib**: Machine learning library with algorithms and utilities

88

- **GraphX**: Graph processing framework for graph-parallel computation

89

- **Structured Streaming**: Scalable and fault-tolerant stream processing

90

- **Spark Streaming**: Legacy stream processing API (DStreams)

91

92

## Capabilities

93

94

### Core Engine (RDDs and SparkContext)

95

96

Fundamental distributed computing capabilities with Resilient Distributed Datasets (RDDs).

97

98

```scala { .api }

99

class SparkContext(config: SparkConf) {

100

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]

101

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

102

def stop(): Unit

103

}

104

105

abstract class RDD[T] {

106

def map[U](f: T => U): RDD[U]

107

def filter(f: T => Boolean): RDD[T]

108

def collect(): Array[T]

109

def count(): Long

110

def cache(): this.type

111

}

112

```

113

114

[Core Engine APIs](./core-engine.md)

115

116

### Structured Data Processing (SQL and DataFrames)

117

118

High-level APIs for working with structured data, including DataFrames, Datasets, and SQL.

119

120

```scala { .api }

121

class SparkSession {

122

def sql(sqlText: String): DataFrame

123

def read: DataFrameReader

124

def createDataFrame[A <: Product](rdd: RDD[A]): DataFrame

125

}

126

127

class Dataset[T] {

128

def select(cols: Column*): DataFrame

129

def filter(condition: Column): Dataset[T]

130

def groupBy(cols: Column*): RelationalGroupedDataset

131

def join(right: Dataset[_], joinExprs: Column): DataFrame

132

def show(): Unit

133

def collect(): Array[T]

134

}

135

```

136

137

[SQL and DataFrames](./sql-dataframes.md)

138

139

### Machine Learning

140

141

Comprehensive machine learning library with algorithms, feature engineering, and model evaluation.

142

143

```scala { .api }

144

class Pipeline extends Estimator[PipelineModel] {

145

def setStages(value: Array[PipelineStage]): Pipeline

146

def fit(dataset: Dataset[_]): PipelineModel

147

}

148

149

abstract class Estimator[M <: Model[M]] {

150

def fit(dataset: Dataset[_]): M

151

}

152

153

abstract class Transformer {

154

def transform(dataset: Dataset[_]): DataFrame

155

}

156

```

157

158

[Machine Learning](./machine-learning.md)

159

160

### Graph Processing

161

162

Graph-parallel computation framework for processing property graphs.

163

164

```scala { .api }

165

abstract class Graph[VD, ED] {

166

def vertices: VertexRDD[VD]

167

def edges: EdgeRDD[ED]

168

def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]

169

def aggregateMessages[A](sendMsg: EdgeContext[VD, ED, A] => Unit,

170

mergeMsg: (A, A) => A): VertexRDD[A]

171

def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]

172

}

173

```

174

175

[Graph Processing](./graph-processing.md)

176

177

### Stream Processing

178

179

Real-time stream processing capabilities for continuous data processing.

180

181

```scala { .api }

182

class StreamingContext(sparkContext: SparkContext, batchDuration: Duration) {

183

def socketTextStream(hostname: String, port: Int): ReceiverInputDStream[String]

184

def textFileStream(directory: String): DStream[String]

185

def start(): Unit

186

def awaitTermination(): Unit

187

}

188

189

abstract class DStream[T] {

190

def map[U](mapFunc: T => U): DStream[U]

191

def filter(filterFunc: T => Boolean): DStream[T]

192

def window(windowDuration: Duration, slideDuration: Duration): DStream[T]

193

def print(): Unit

194

}

195

```

196

197

[Stream Processing](./stream-processing.md)

198

199

### Application Launcher

200

201

Programmatic launching of Spark applications with monitoring capabilities.

202

203

```java { .api }

204

public class SparkLauncher {

205

public SparkLauncher setAppName(String appName);

206

public SparkLauncher setMaster(String master);

207

public SparkLauncher setMainClass(String mainClass);

208

public SparkAppHandle startApplication();

209

}

210

211

public interface SparkAppHandle {

212

State getState();

213

String getAppId();

214

void kill();

215

}

216

```

217

218

[Application Launcher](./application-launcher.md)

219

220

## Types

221

222

### Core Types

223

224

```scala { .api }

225

class SparkConf {

226

def set(key: String, value: String): SparkConf

227

def setMaster(master: String): SparkConf

228

def setAppName(name: String): SparkConf

229

}

230

231

class Broadcast[T] {

232

def value: T

233

def destroy(): Unit

234

}

235

236

object StorageLevel {

237

val MEMORY_ONLY: StorageLevel

238

val MEMORY_AND_DISK: StorageLevel

239

val MEMORY_ONLY_SER: StorageLevel

240

val DISK_ONLY: StorageLevel

241

}

242

```

243

244

### SQL Types

245

246

```scala { .api }

247

import org.apache.spark.sql.types._

248

249

case class StructType(fields: Array[StructField])

250

case class StructField(name: String, dataType: DataType, nullable: Boolean = true)

251

252

abstract class DataType

253

object DataTypes {

254

val StringType: DataType

255

val IntegerType: DataType

256

val DoubleType: DataType

257

val BooleanType: DataType

258

val TimestampType: DataType

259

}

260

261

trait Row {

262

def getString(i: Int): String

263

def getInt(i: Int): Int

264

def getDouble(i: Int): Double

265

def getBoolean(i: Int): Boolean

266

}

267

268

class Column {

269

def ===(other: Any): Column

270

def &&(other: Column): Column

271

def ||(other: Column): Column

272

def isNull: Column

273

def isNotNull: Column

274

}

275

```

276

277

### GraphX Types

278

279

```scala { .api }

280

type VertexId = Long

281

282

case class Edge[ED](srcId: VertexId, dstId: VertexId, attr: ED)

283

284

class EdgeTriplet[VD, ED] extends Edge[ED] {

285

def srcAttr: VD

286

def dstAttr: VD

287

}

288

289

abstract class VertexRDD[VD] extends RDD[(VertexId, VD)]

290

abstract class EdgeRDD[ED] extends RDD[Edge[ED]]

291

```