or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-core_2-13

Apache Spark Core provides the fundamental distributed computing infrastructure for large-scale data processing with RDDs, SparkContext, and cluster management

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-core_2.13@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-13@3.5.0

0

# Apache Spark Core

1

2

Apache Spark Core provides the fundamental distributed computing infrastructure for large-scale data processing. It implements the core Spark execution model with Resilient Distributed Datasets (RDDs) as the primary abstraction for fault-tolerant distributed collections, SparkContext as the main entry point, and a sophisticated task scheduler for efficient cluster computation.

3

4

## Package Information

5

6

- **Package Name**: org.apache.spark:spark-core_2.13

7

- **Package Type**: Maven

8

- **Language**: Scala (with Java/Python/R APIs)

9

- **Version**: 3.5.6

10

- **Installation**: Add to Maven dependencies or SBT build

11

12

Maven:

13

```xml

14

<dependency>

15

<groupId>org.apache.spark</groupId>

16

<artifactId>spark-core_2.13</artifactId>

17

<version>3.5.6</version>

18

</dependency>

19

```

20

21

SBT:

22

```scala

23

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.6"

24

```

25

26

## Core Imports

27

28

```scala

29

import org.apache.spark.{SparkContext, SparkConf}

30

import org.apache.spark.rdd.RDD

31

```

32

33

For broadcast variables and accumulators:

34

```scala

35

import org.apache.spark.broadcast.Broadcast

36

import org.apache.spark.util.AccumulatorV2

37

```

38

39

## Basic Usage

40

41

```scala

42

import org.apache.spark.{SparkContext, SparkConf}

43

44

// Create Spark configuration

45

val conf = new SparkConf()

46

.setAppName("MySparkApp")

47

.setMaster("local[*]")

48

49

// Initialize SparkContext

50

val sc = new SparkContext(conf)

51

52

// Create RDD from collection

53

val data = sc.parallelize(Seq(1, 2, 3, 4, 5))

54

55

// Transform and collect results

56

val squares = data.map(_ * 2).filter(_ > 4)

57

val result = squares.collect()

58

59

// Clean up

60

sc.stop()

61

```

62

63

## Architecture

64

65

Spark Core implements a distributed computing framework with several key components:

66

67

- **SparkContext**: The main entry point that coordinates cluster resources and manages the application lifecycle

68

- **RDD (Resilient Distributed Dataset)**: Immutable, fault-tolerant distributed collections that form the core abstraction

69

- **DAG Scheduler**: Optimizes computation graphs and creates stages for efficient execution

70

- **Task Scheduler**: Distributes tasks across cluster nodes and handles task failures

71

- **Cluster Manager**: Interfaces with YARN, Mesos, Kubernetes, or standalone cluster managers

72

- **Storage System**: Manages memory and disk storage for cached RDDs and shuffle data

73

74

This architecture enables fault-tolerant distributed computing with automatic recovery, lineage tracking, and optimized data locality.

75

76

## Capabilities

77

78

### SparkContext and Configuration

79

80

The main entry point for Spark applications, providing methods to create RDDs, manage configuration, and control cluster resources.

81

82

```scala { .api }

83

class SparkContext(config: SparkConf)

84

class SparkConf(loadDefaults: Boolean = true)

85

86

// Key SparkContext methods

87

def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]

88

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

89

def broadcast[T: ClassTag](value: T): Broadcast[T]

90

def longAccumulator(name: String): LongAccumulator

91

def stop(): Unit

92

```

93

94

[SparkContext and Configuration](./sparkcontext.md)

95

96

### RDD Operations

97

98

Core distributed dataset operations including transformations (lazy) and actions (eager execution).

99

100

```scala { .api }

101

abstract class RDD[T: ClassTag]

102

103

// Core transformations

104

def map[U: ClassTag](f: T => U): RDD[U]

105

def filter(f: T => Boolean): RDD[T]

106

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]

107

def union(other: RDD[T]): RDD[T]

108

def distinct(): RDD[T]

109

110

// Core actions

111

def collect(): Array[T]

112

def count(): Long

113

def reduce(f: (T, T) => T): T

114

def foreach(f: T => Unit): Unit

115

```

116

117

[RDD Operations](./rdd-operations.md)

118

119

### Key-Value Operations

120

121

Specialized operations available on RDDs of key-value pairs, including joins, grouping, and aggregations.

122

123

```scala { .api }

124

// Available on RDD[(K, V)] via implicit conversion

125

def groupByKey(): RDD[(K, Iterable[V])]

126

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

127

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

128

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

129

def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)]

130

```

131

132

[Key-Value Operations](./key-value-operations.md)

133

134

### Broadcast Variables and Accumulators

135

136

Shared variables for efficient data distribution and accumulation across cluster nodes.

137

138

```scala { .api }

139

abstract class Broadcast[T: ClassTag]

140

def value: T

141

def unpersist(): Unit

142

143

abstract class AccumulatorV2[IN, OUT]

144

def add(v: IN): Unit

145

def value: OUT

146

def reset(): Unit

147

```

148

149

[Broadcast Variables and Accumulators](./broadcast-accumulators.md)

150

151

### Data I/O and Persistence

152

153

Input/output operations for various data sources and RDD caching strategies.

154

155

```scala { .api }

156

// Input operations

157

def textFile(path: String): RDD[String]

158

def sequenceFile[K, V](path: String): RDD[(K, V)]

159

def hadoopRDD[K, V](conf: JobConf, inputFormat: Class[_ <: InputFormat[K, V]]): RDD[(K, V)]

160

161

// Persistence

162

def persist(newLevel: StorageLevel): RDD[T]

163

def cache(): RDD[T]

164

def unpersist(): RDD[T]

165

```

166

167

[Data I/O and Persistence](./data-io-persistence.md)

168

169

## Types

170

171

```scala { .api }

172

// Core configuration

173

class SparkConf(loadDefaults: Boolean = true) {

174

def set(key: String, value: String): SparkConf

175

def setAppName(name: String): SparkConf

176

def setMaster(master: String): SparkConf

177

def get(key: String): String

178

}

179

180

// Storage levels

181

object StorageLevel {

182

val NONE: StorageLevel

183

val MEMORY_ONLY: StorageLevel

184

val MEMORY_AND_DISK: StorageLevel

185

val MEMORY_ONLY_SER: StorageLevel

186

val DISK_ONLY: StorageLevel

187

}

188

189

// Partitioning

190

abstract class Partitioner {

191

def numPartitions: Int

192

def getPartition(key: Any): Int

193

}

194

195

case class HashPartitioner(partitions: Int) extends Partitioner

196

197

// Binary data representation

198

class PortableDataStream(

199

isDirectory: Boolean,

200

path: String,

201

length: Long,

202

modificationTime: Long

203

) {

204

def open(): DataInputStream

205

def toArray(): Array[Byte]

206

}

207

```