or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-core_2-12

Apache Spark Core provides distributed computing capabilities including RDD abstractions, task scheduling, memory management, and fault recovery.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-core_2.12@3.5.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-12@3.5.0

0

# Apache Spark Core

1

2

Apache Spark Core is the foundational module of the Apache Spark unified analytics engine for large-scale data processing. It provides fundamental distributed computing capabilities including RDD (Resilient Distributed Dataset) abstractions, task scheduling, memory management, fault recovery, and storage system interactions. Spark Core serves as the base layer for all other Spark components and provides high-performance distributed computing primitives with automatic fault tolerance.

3

4

## Package Information

5

6

- **Package Name**: org.apache.spark:spark-core_2.12

7

- **Package Type**: maven

8

- **Language**: Scala (with Java compatibility)

9

- **Version**: 3.5.6

10

- **Installation**: Add to `pom.xml` or `build.sbt`

11

12

Maven:

13

```xml

14

<dependency>

15

<groupId>org.apache.spark</groupId>

16

<artifactId>spark-core_2.12</artifactId>

17

<version>3.5.6</version>

18

</dependency>

19

```

20

21

SBT:

22

```scala

23

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.5.6"

24

```

25

26

## Core Imports

27

28

Scala:

29

```scala

30

import org.apache.spark.{SparkContext, SparkConf}

31

import org.apache.spark.rdd.RDD

32

import org.apache.spark.storage.StorageLevel

33

import org.apache.spark.broadcast.Broadcast

34

import org.apache.spark.util.{AccumulatorV2, LongAccumulator, DoubleAccumulator, CollectionAccumulator}

35

import org.apache.spark.{Dependency, Partition, Partitioner}

36

import org.apache.spark.TaskContext

37

import org.apache.spark.input.PortableDataStream

38

import scala.reflect.ClassTag

39

```

40

41

Java:

42

```java

43

import org.apache.spark.api.java.JavaSparkContext;

44

import org.apache.spark.api.java.JavaRDD;

45

import org.apache.spark.SparkConf;

46

```

47

48

## Basic Usage

49

50

```scala

51

import org.apache.spark.{SparkContext, SparkConf}

52

53

// Create Spark configuration and context

54

val conf = new SparkConf()

55

.setAppName("My Spark Application")

56

.setMaster("local[*]")

57

58

val sc = new SparkContext(conf)

59

60

// Create RDD from local collection

61

val data = sc.parallelize(1 to 10000)

62

63

// Transform and compute

64

val squares = data.map(x => x * x)

65

val sum = squares.reduce(_ + _)

66

67

println(s"Sum of squares: $sum")

68

69

// Clean up

70

sc.stop()

71

```

72

73

Java example:

74

```java

75

import org.apache.spark.api.java.JavaSparkContext;

76

import org.apache.spark.SparkConf;

77

78

SparkConf conf = new SparkConf()

79

.setAppName("My Java Spark Application")

80

.setMaster("local[*]");

81

82

JavaSparkContext sc = new JavaSparkContext(conf);

83

84

// Create RDD from list

85

List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);

86

JavaRDD<Integer> rdd = sc.parallelize(data);

87

88

// Transform and collect

89

JavaRDD<Integer> squares = rdd.map(x -> x * x);

90

List<Integer> result = squares.collect();

91

92

sc.stop();

93

```

94

95

## Architecture

96

97

Apache Spark Core is built around several fundamental components:

98

99

- **SparkContext**: The main entry point and orchestrator for Spark functionality

100

- **RDD (Resilient Distributed Dataset)**: The core abstraction for distributed collections with fault tolerance

101

- **Task Scheduler**: Manages execution of operations across cluster nodes

102

- **Storage System**: Handles caching, persistence, and data locality optimization

103

- **Shared Variables**: Broadcast variables and accumulators for efficient data sharing

104

- **Serialization Framework**: Pluggable serialization for network communication and storage

105

106

## Capabilities

107

108

### Spark Context and Configuration

109

110

Core entry points for creating and configuring Spark applications. SparkContext serves as the main API for creating RDDs and managing application lifecycle.

111

112

```scala { .api }

113

class SparkContext(config: SparkConf)

114

class SparkConf(loadDefaults: Boolean = true)

115

```

116

117

[Context and Configuration](./context-config.md)

118

119

### RDD Operations

120

121

Fundamental distributed dataset operations including transformations (map, filter, join) and actions (collect, reduce, save). RDDs provide the core abstraction for distributed data processing with automatic fault tolerance.

122

123

```scala { .api }

124

abstract class RDD[T: ClassTag]

125

def map[U: ClassTag](f: T => U): RDD[U]

126

def filter(f: T => Boolean): RDD[T]

127

def collect(): Array[T]

128

```

129

130

[RDD Operations](./rdd-operations.md)

131

132

### Java API Compatibility

133

134

Complete Java API wrappers providing Java-friendly interfaces for all Spark Core functionality. Enables seamless usage from Java applications while maintaining full feature compatibility.

135

136

```java { .api }

137

public class JavaSparkContext

138

public class JavaRDD<T>

139

public class JavaPairRDD<K, V>

140

```

141

142

[Java API](./java-api.md)

143

144

### Storage and Caching

145

146

Persistence mechanisms for RDDs including memory, disk, and off-heap storage options. Provides fine-grained control over data persistence with various storage levels and replication strategies.

147

148

```scala { .api }

149

class StorageLevel(useDisk: Boolean, useMemory: Boolean, useOffHeap: Boolean, deserialized: Boolean, replication: Int)

150

def persist(newLevel: StorageLevel): RDD[T]

151

def cache(): RDD[T]

152

```

153

154

[Storage and Caching](./storage-caching.md)

155

156

### Shared Variables

157

158

Broadcast variables and accumulators for efficient data sharing across distributed computations. Broadcast variables enable read-only sharing of large datasets while accumulators provide write-only variables for aggregations.

159

160

```scala { .api }

161

abstract class Broadcast[T]

162

abstract class AccumulatorV2[IN, OUT]

163

class LongAccumulator extends AccumulatorV2[java.lang.Long, java.lang.Long]

164

```

165

166

[Shared Variables](./shared-variables.md)

167

168

### Task Execution Context

169

170

Runtime context and utilities available to running tasks including partition information, memory management, and lifecycle hooks.

171

172

```scala { .api }

173

abstract class TaskContext

174

def partitionId(): Int

175

def stageId(): Int

176

def taskAttemptId(): Long

177

```

178

179

[Task Context](./task-context.md)

180

181

### Resource Management

182

183

Modern resource allocation and management including resource profiles, executor resource requests, and task resource requirements for heterogeneous workloads.

184

185

```scala { .api }

186

class ResourceProfile(executorResources: Map[String, ExecutorResourceRequest], taskResources: Map[String, TaskResourceRequest])

187

class ResourceProfileBuilder()

188

```

189

190

[Resource Management](./resource-management.md)

191

192

### Serialization Framework

193

194

Pluggable serialization system supporting Java serialization and Kryo for optimized network communication and storage. Provides extensible serialization with performance tuning options.

195

196

```scala { .api }

197

abstract class SerializerInstance

198

class JavaSerializer(conf: SparkConf) extends Serializer

199

class KryoSerializer(conf: SparkConf) extends Serializer

200

```

201

202

[Serialization](./serialization.md)