or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-core_2_13

Apache Spark Core provides the foundational execution engine and API for distributed data processing with RDDs, task scheduling, and cluster management.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-core_2.13@4.0.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2_13@4.0.0

0

# Apache Spark Core

1

2

Apache Spark Core provides the foundational execution engine and API for distributed data processing across clusters. It implements the core distributed computing primitives including RDDs (Resilient Distributed Datasets), task scheduling, memory management, fault tolerance, and the base APIs that power all other Spark components.

3

4

## Package Information

5

6

- **Package Name**: spark-core_2.13

7

- **Package Type**: maven

8

- **Language**: Scala

9

- **Installation**: `<dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_2.13</artifactId><version>4.0.0</version></dependency>`

10

11

## Core Imports

12

13

```scala

14

import org.apache.spark.{SparkContext, SparkConf}

15

import org.apache.spark.rdd.RDD

16

```

17

18

For Java:

19

20

```java

21

import org.apache.spark.SparkContext;

22

import org.apache.spark.SparkConf;

23

import org.apache.spark.api.java.JavaSparkContext;

24

import org.apache.spark.api.java.JavaRDD;

25

```

26

27

## Basic Usage

28

29

```scala

30

import org.apache.spark.{SparkContext, SparkConf}

31

32

// Create configuration

33

val conf = new SparkConf()

34

.setAppName("MySparkApp")

35

.setMaster("local[*]")

36

37

// Create Spark context

38

val sc = new SparkContext(conf)

39

40

// Create RDD from collection

41

val data = sc.parallelize(Array(1, 2, 3, 4, 5))

42

43

// Transform and action

44

val squared = data.map(x => x * x)

45

val result = squared.collect()

46

47

// Cleanup

48

sc.stop()

49

```

50

51

## Architecture

52

53

Spark Core is built around several key components:

54

55

- **SparkContext**: Main entry point and driver program coordinator

56

- **RDD Abstraction**: Immutable distributed datasets with lineage tracking

57

- **Task Scheduler**: Distributes work across cluster nodes and manages execution

58

- **Block Manager**: Handles data storage, caching, and replication across cluster

59

- **Serialization**: Efficient data serialization for network transfer and storage

60

- **Resource Management**: Integration with cluster managers (YARN, Mesos, Kubernetes)

61

62

## Capabilities

63

64

### Application Context

65

66

Core application setup and cluster connection management. SparkContext serves as the primary interface for creating RDDs and configuring distributed execution.

67

68

```scala { .api }

69

class SparkContext(config: SparkConf) {

70

def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]

71

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

72

def stop(): Unit

73

def broadcast[T: ClassTag](value: T): Broadcast[T]

74

def version: String

75

def defaultParallelism: Int

76

}

77

78

class SparkConf(loadDefaults: Boolean = true) {

79

def set(key: String, value: String): SparkConf

80

def setAppName(name: String): SparkConf

81

def setMaster(master: String): SparkConf

82

}

83

```

84

85

[Application Context](./application-context.md)

86

87

### RDD Operations

88

89

Resilient Distributed Dataset API providing transformations and actions for distributed data processing. RDDs support fault-tolerant parallel operations on large datasets.

90

91

```scala { .api }

92

abstract class RDD[T: ClassTag] {

93

def map[U: ClassTag](f: T => U): RDD[U]

94

def filter(f: T => Boolean): RDD[T]

95

def flatMap[U: ClassTag](f: T => IterableOnce[U]): RDD[U]

96

def collect(): Array[T]

97

def count(): Long

98

def reduce(f: (T, T) => T): T

99

def persist(newLevel: StorageLevel): RDD[T]

100

def persist(): RDD[T]

101

def cache(): RDD[T]

102

}

103

```

104

105

[RDD Operations](./rdd-operations.md)

106

107

### Java API

108

109

Java-friendly wrappers for Spark functionality providing type-safe distributed processing for Java applications.

110

111

```java { .api }

112

public class JavaSparkContext {

113

public <T> JavaRDD<T> parallelize(java.util.List<T> list)

114

public <T> JavaRDD<T> parallelize(java.util.List<T> list, int numSlices)

115

public JavaRDD<String> textFile(String path)

116

public void stop()

117

}

118

119

public class JavaRDD<T> {

120

public <R> JavaRDD<R> map(org.apache.spark.api.java.function.Function<T, R> f)

121

public JavaRDD<T> filter(org.apache.spark.api.java.function.Function<T, Boolean> f)

122

public java.util.List<T> collect()

123

}

124

```

125

126

[Java API](./java-api.md)

127

128

### Storage and Persistence

129

130

Data storage levels and caching mechanisms for optimizing repeated access to RDDs with configurable memory and disk usage.

131

132

```scala { .api }

133

class StorageLevel(

134

useDisk: Boolean,

135

useMemory: Boolean,

136

useOffHeap: Boolean,

137

deserialized: Boolean,

138

replication: Int

139

)

140

141

object StorageLevel {

142

val MEMORY_ONLY: StorageLevel

143

val MEMORY_AND_DISK: StorageLevel

144

val DISK_ONLY: StorageLevel

145

}

146

```

147

148

[Storage and Persistence](./storage-persistence.md)

149

150

### Broadcast Variables

151

152

Efficient read-only variable distribution to all cluster nodes for sharing large datasets or lookup tables across tasks.

153

154

```scala { .api }

155

abstract class Broadcast[T: ClassTag] {

156

def value: T

157

def unpersist(blocking: Boolean = false): Unit

158

def destroy(): Unit

159

}

160

```

161

162

[Broadcast Variables](./broadcast-variables.md)

163

164

### Accumulators

165

166

Shared variables for collecting information from distributed tasks, supporting aggregation patterns like counters and sums.

167

168

```scala { .api }

169

abstract class AccumulatorV2[IN, OUT] {

170

def add(v: IN): Unit

171

def value: OUT

172

def reset(): Unit

173

def copy(): AccumulatorV2[IN, OUT]

174

}

175

176

class LongAccumulator extends AccumulatorV2[java.lang.Long, java.lang.Long]

177

class DoubleAccumulator extends AccumulatorV2[java.lang.Double, java.lang.Double]

178

```

179

180

[Accumulators](./accumulators.md)

181

182

### Partitioning

183

184

Data partitioning strategies for controlling how RDD elements are distributed across cluster nodes to optimize performance.

185

186

```scala { .api }

187

abstract class Partitioner extends Serializable {

188

def numPartitions: Int

189

def getPartition(key: Any): Int

190

}

191

192

class HashPartitioner(partitions: Int) extends Partitioner

193

class RangePartitioner[K: Ordering: ClassTag, V](

194

partitions: Int,

195

rdd: RDD[_ <: Product2[K, V]]

196

) extends Partitioner

197

```

198

199

[Partitioning](./partitioning.md)

200

201

### Serialization

202

203

Serialization frameworks for efficient data transfer and storage with support for Java serialization and Kryo.

204

205

```scala { .api }

206

abstract class Serializer {

207

def newInstance(): SerializerInstance

208

}

209

210

abstract class SerializerInstance {

211

def serialize[T: ClassTag](t: T): ByteBuffer

212

def deserialize[T: ClassTag](bytes: ByteBuffer): T

213

}

214

215

class JavaSerializer(conf: SparkConf) extends Serializer

216

class KryoSerializer(conf: SparkConf) extends Serializer

217

```

218

219

[Serialization](./serialization.md)

220

221

## Types

222

223

```scala { .api }

224

trait ClassTag[T]

225

trait Ordering[T]

226

227

case class TaskContext(

228

stageId: Int,

229

stageAttemptNumber: Int,

230

partitionId: Int,

231

taskAttemptId: Long

232

)

233

234

sealed trait TaskEndReason

235

case object Success extends TaskEndReason

236

case class ExceptionFailure(

237

className: String,

238

description: String,

239

stackTrace: Array[StackTraceElement]

240

) extends TaskEndReason

241

```