or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-engine.mdgraph-processing.mdindex.mdmachine-learning.mdsql-dataframes.mdstream-processing.md

index.mddocs/

0

# Apache Spark

1

2

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R (deprecated), and an optimized engine that supports general computation graphs for data analysis. Spark includes specialized tools for SQL and DataFrames, machine learning (MLlib), graph processing (GraphX), and stream processing.

3

4

## Overview

5

6

Apache Spark is a unified analytics engine for large-scale data processing with high-level APIs in Scala, Java, Python, and R (deprecated). Key features include:

7

8

- **Unified Processing**: Single engine for batch, interactive, real-time, and machine learning workloads

9

- **High Performance**: In-memory computing with advanced DAG execution engine

10

- **Ease of Use**: Simple APIs in multiple languages with 80+ high-level operators

11

- **Scalability**: Runs everywhere from laptops to large clusters with thousands of nodes

12

- **Advanced Analytics**: Built-in modules for SQL, streaming, machine learning, and graph processing

13

14

## Package Information

15

16

- **Package Name**: org.apache.spark:spark-parent_2.13

17

- **Package Type**: Maven

18

- **Language**: Scala/Java

19

- **Installation**: Add to Maven/SBT dependencies or download distribution

20

- **Maven Coordinates**: `org.apache.spark:spark-core_2.13:4.0.0`

21

22

## Core Imports

23

24

For Scala applications:

25

26

```scala

27

import org.apache.spark.{SparkConf, SparkContext}

28

import org.apache.spark.sql.SparkSession

29

import org.apache.spark.rdd.RDD

30

```

31

32

For Java applications:

33

34

```java

35

import org.apache.spark.SparkConf;

36

import org.apache.spark.api.java.JavaSparkContext;

37

import org.apache.spark.sql.SparkSession;

38

import org.apache.spark.sql.Dataset;

39

import org.apache.spark.sql.Row;

40

```

41

42

## Basic Usage

43

44

### Core RDD API (Scala)

45

46

```scala

47

import org.apache.spark.{SparkConf, SparkContext}

48

49

// Create Spark configuration and context

50

val conf = new SparkConf().setAppName("MyApp").setMaster("local[*]")

51

val sc = new SparkContext(conf)

52

53

// Create RDD from collection

54

val data = Array(1, 2, 3, 4, 5)

55

val rdd = sc.parallelize(data)

56

57

// Transform and collect results

58

val result = rdd.map(_ * 2).filter(_ > 5).collect()

59

60

sc.stop()

61

```

62

63

### SQL and DataFrames (Scala)

64

65

```scala

66

import org.apache.spark.sql.SparkSession

67

68

// Create Spark session

69

val spark = SparkSession.builder()

70

.appName("MyApp")

71

.master("local[*]")

72

.getOrCreate()

73

74

// Read data into DataFrame

75

val df = spark.read.json("path/to/data.json")

76

77

// SQL operations

78

df.select("name", "age")

79

.filter(df("age") > 21)

80

.show()

81

82

// SQL queries

83

df.createOrReplaceTempView("people")

84

val adults = spark.sql("SELECT name FROM people WHERE age >= 18")

85

86

spark.stop()

87

```

88

89

## Architecture

90

91

Apache Spark follows a driver-executor architecture:

92

93

- **Driver Program**: Contains the main function and defines RDDs/DataFrames

94

- **Cluster Manager**: Allocates resources across applications (YARN, Mesos, Standalone)

95

- **Executors**: Worker processes that run tasks and store data

96

- **Tasks**: Units of work sent to executors

97

98

Key architectural components:

99

100

- **Catalyst Optimizer**: Rule-based and cost-based query optimization

101

- **Tungsten**: Off-heap memory management and code generation

102

- **Resilient Distributed Datasets (RDDs)**: Fault-tolerant distributed collections

103

- **DataFrames/Datasets**: Structured data APIs built on RDDs

104

105

## Capabilities

106

107

### Core Engine

108

109

Provides the fundamental distributed computing capabilities with RDDs, transformations, actions, and distributed variables.

110

111

```scala { .api }

112

class SparkContext(config: SparkConf) {

113

def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]

114

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

115

def broadcast[T](value: T): Broadcast[T]

116

def longAccumulator(): LongAccumulator

117

def stop(): Unit

118

}

119

120

abstract class RDD[T: ClassTag] {

121

def map[U: ClassTag](f: T => U): RDD[U]

122

def filter(f: T => Boolean): RDD[T]

123

def collect(): Array[T]

124

def count(): Long

125

def reduce(f: (T, T) => T): T

126

def cache(): RDD[T]

127

}

128

```

129

130

[Core Engine](./core-engine.md)

131

132

### SQL and DataFrames

133

134

Structured data processing with SQL queries, DataFrames, and type-safe Datasets. Includes data source connectors and streaming capabilities.

135

136

```scala { .api }

137

object SparkSession {

138

def builder(): Builder

139

}

140

141

class SparkSession {

142

def read: DataFrameReader

143

def sql(sqlText: String): DataFrame

144

def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame

145

def stop(): Unit

146

}

147

148

abstract class Dataset[T] {

149

def select(cols: Column*): DataFrame

150

def filter(condition: Column): Dataset[T]

151

def groupBy(cols: Column*): RelationalGroupedDataset

152

def join(right: Dataset[_]): DataFrame

153

def collect(): Array[T]

154

def show(numRows: Int = 20): Unit

155

}

156

```

157

158

[SQL and DataFrames](./sql-dataframes.md)

159

160

### Machine Learning

161

162

Scalable machine learning algorithms and utilities, including both RDD-based (MLlib) and DataFrame-based (ML) APIs.

163

164

```scala { .api }

165

// DataFrame-based ML Pipeline API

166

abstract class Estimator[M <: Model[M]] extends PipelineStage {

167

def fit(dataset: Dataset[_]): M

168

}

169

170

abstract class Transformer extends PipelineStage {

171

def transform(dataset: Dataset[_]): DataFrame

172

}

173

174

class Pipeline(stages: Array[PipelineStage]) extends Estimator[PipelineModel] {

175

def fit(dataset: Dataset[_]): PipelineModel

176

}

177

```

178

179

[Machine Learning](./machine-learning.md)

180

181

### Graph Processing

182

183

Large-scale graph processing with GraphX, including graph algorithms and graph-parallel computations.

184

185

```scala { .api }

186

abstract class Graph[VD: ClassTag, ED: ClassTag] {

187

def vertices: VertexRDD[VD]

188

def edges: EdgeRDD[ED]

189

def mapVertices[VD2: ClassTag](map: (VertexId, VD) => VD2): Graph[VD2, ED]

190

def subgraph(epred: EdgeTriplet[VD, ED] => Boolean = _ => true,

191

vpred: (VertexId, VD) => Boolean = (_, _) => true): Graph[VD, ED]

192

}

193

194

object GraphLoader {

195

def edgeListFile(sc: SparkContext, path: String): Graph[Int, Int]

196

}

197

```

198

199

[Graph Processing](./graph-processing.md)

200

201

### Stream Processing

202

203

Real-time data processing with both legacy DStreams and modern Structured Streaming APIs.

204

205

```scala { .api }

206

class DataStreamReader {

207

def format(source: String): DataStreamReader

208

def option(key: String, value: String): DataStreamReader

209

def load(): DataFrame

210

}

211

212

abstract class StreamingQuery {

213

def start(): StreamingQuery

214

def stop(): Unit

215

def awaitTermination(): Unit

216

def isActive: Boolean

217

}

218

```

219

220

[Stream Processing](./stream-processing.md)

221

222

## Types

223

224

### Core Types

225

226

```scala { .api }

227

class SparkConf(loadDefaults: Boolean = true) {

228

def set(key: String, value: String): SparkConf

229

def setAppName(name: String): SparkConf

230

def setMaster(master: String): SparkConf

231

def get(key: String): String

232

}

233

234

abstract class Broadcast[T] extends Serializable {

235

def value: T

236

def unpersist(): Unit

237

def destroy(): Unit

238

}

239

240

abstract class AccumulatorV2[IN, OUT] extends Serializable {

241

def add(v: IN): Unit

242

def value: OUT

243

def reset(): Unit

244

}

245

246

object StorageLevel {

247

val MEMORY_ONLY: StorageLevel

248

val MEMORY_AND_DISK: StorageLevel

249

val DISK_ONLY: StorageLevel

250

}

251

```

252

253

### SQL Types

254

255

```scala { .api }

256

trait Row extends Serializable {

257

def get(i: Int): Any

258

def getString(i: Int): String

259

def getInt(i: Int): Int

260

def getDouble(i: Int): Double

261

def isNullAt(i: Int): Boolean

262

def size: Int

263

}

264

265

class Column(expr: Expression) {

266

def ===(other: Any): Column

267

def !==(other: Any): Column

268

def >(other: Any): Column

269

def <(other: Any): Column

270

def isNull: Column

271

def cast(to: DataType): Column

272

def alias(alias: String): Column

273

}

274

275

class StructType(fields: Array[StructField]) extends DataType {

276

def add(field: StructField): StructType

277

def add(name: String, dataType: DataType): StructType

278

def fieldNames: Array[String]

279

}

280

```

281

282

### Graph Types

283

284

```scala { .api }

285

type VertexId = Long

286

287

case class Edge[ED](srcId: VertexId, dstId: VertexId, attr: ED)

288

289

class EdgeTriplet[VD, ED] extends Edge[ED] {

290

def srcAttr: VD

291

def dstAttr: VD

292

}

293

294

abstract class VertexRDD[VD] extends RDD[(VertexId, VD)]

295

abstract class EdgeRDD[ED] extends RDD[Edge[ED]]

296

```