or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-core

Apache Spark Core provides distributed computing capabilities with RDDs, task scheduling, and cluster management for big data processing

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-core_2.11@2.2.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-core@2.2.0

0

# Apache Spark Core

1

2

Apache Spark Core is the foundational engine for large-scale distributed data processing. It provides resilient distributed datasets (RDDs), in-memory computing capabilities, and a unified execution engine for batch and interactive data processing across clusters.

3

4

## Package Information

5

6

- **Package Name**: org.apache.spark:spark-core_2.11

7

- **Package Type**: Maven

8

- **Language**: Scala (with Java API)

9

- **Installation**:

10

```xml

11

<dependency>

12

<groupId>org.apache.spark</groupId>

13

<artifactId>spark-core_2.11</artifactId>

14

<version>2.2.3</version>

15

</dependency>

16

```

17

18

## Core Imports

19

20

Scala:

21

```scala { .api }

22

import org.apache.spark.{SparkContext, SparkConf}

23

import org.apache.spark.rdd.RDD

24

import org.apache.spark.storage.StorageLevel

25

import org.apache.spark.broadcast.Broadcast

26

import org.apache.spark.util.{LongAccumulator, DoubleAccumulator, AccumulatorV2}

27

```

28

29

Java:

30

```java { .api }

31

import org.apache.spark.api.java.JavaSparkContext;

32

import org.apache.spark.api.java.JavaRDD;

33

import org.apache.spark.api.java.JavaPairRDD;

34

import org.apache.spark.SparkConf;

35

import org.apache.spark.broadcast.Broadcast;

36

import org.apache.spark.util.LongAccumulator;

37

```

38

39

## Basic Usage

40

41

```scala

42

import org.apache.spark.{SparkContext, SparkConf}

43

44

// Create Spark configuration

45

val conf = new SparkConf()

46

.setAppName("My Spark Application")

47

.setMaster("local[*]")

48

49

// Create Spark context

50

val sc = new SparkContext(conf)

51

52

// Create RDD from collection

53

val data = sc.parallelize(Seq(1, 2, 3, 4, 5))

54

55

// Transform and compute

56

val result = data

57

.filter(_ % 2 == 0)

58

.map(_ * 2)

59

.collect()

60

61

// Clean up

62

sc.stop()

63

```

64

65

## Architecture

66

67

Apache Spark Core is built around several key abstractions:

68

69

- **SparkContext**: The main entry point that coordinates distributed computing

70

- **RDD (Resilient Distributed Dataset)**: Immutable, fault-tolerant collections distributed across cluster nodes

71

- **Transformations**: Lazy operations that create new RDDs (map, filter, groupBy)

72

- **Actions**: Operations that trigger computation and return results (collect, count, save)

73

- **Broadcast Variables**: Read-only variables cached on all nodes

74

- **Accumulators**: Variables for aggregating information across tasks

75

76

## Capabilities

77

78

### Core SparkContext and Configuration

79

80

The SparkContext serves as the primary entry point for all Spark functionality, providing methods for creating RDDs, managing cluster resources, and configuring applications.

81

82

```scala { .api }

83

class SparkContext(config: SparkConf) {

84

def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]

85

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

86

def stop(): Unit

87

}

88

89

class SparkConf(loadDefaults: Boolean = true) {

90

def setAppName(name: String): SparkConf

91

def setMaster(master: String): SparkConf

92

def set(key: String, value: String): SparkConf

93

}

94

```

95

96

[Core SparkContext and Configuration](./spark-context.md)

97

98

### RDD Operations and Transformations

99

100

RDDs provide the core abstraction for distributed data processing with lazy transformations and eager actions that enable fault-tolerant computation across cluster nodes.

101

102

```scala { .api }

103

abstract class RDD[T: ClassTag] {

104

def map[U: ClassTag](f: T => U): RDD[U]

105

def filter(f: T => Boolean): RDD[T]

106

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]

107

def groupBy[K](f: T => K): RDD[(K, Iterable[T])]

108

def collect(): Array[T]

109

def count(): Long

110

def reduce(f: (T, T) => T): T

111

}

112

```

113

114

[RDD Operations and Transformations](./rdd-operations.md)

115

116

### Java API

117

118

Java-friendly wrappers provide seamless integration with Java applications while maintaining full access to Spark's distributed computing capabilities.

119

120

```java { .api }

121

public class JavaSparkContext {

122

public JavaSparkContext(SparkConf conf)

123

public <T> JavaRDD<T> parallelize(List<T> list)

124

public JavaRDD<String> textFile(String path)

125

public void close()

126

}

127

128

public class JavaRDD<T> {

129

public <R> JavaRDD<R> map(Function<T, R> f)

130

public JavaRDD<T> filter(Function<T, Boolean> f)

131

public List<T> collect()

132

}

133

```

134

135

[Java API](./java-api.md)

136

137

### Storage and Persistence

138

139

Storage and persistence mechanisms allow RDDs to be cached in memory or persisted to disk with configurable storage levels for performance optimization.

140

141

```scala { .api }

142

abstract class RDD[T] {

143

def persist(newLevel: StorageLevel): this.type

144

def cache(): this.type

145

def unpersist(blocking: Boolean = true): this.type

146

}

147

148

object StorageLevel {

149

val MEMORY_ONLY: StorageLevel

150

val MEMORY_AND_DISK: StorageLevel

151

val DISK_ONLY: StorageLevel

152

}

153

```

154

155

[Storage and Persistence](./storage-persistence.md)

156

157

### Broadcasting and Accumulators

158

159

Shared variables enable efficient distribution of read-only data and aggregation of values across distributed computations without expensive network operations.

160

161

```scala { .api }

162

abstract class Broadcast[T] {

163

def value: T

164

def unpersist(): Unit

165

def destroy(): Unit

166

}

167

168

trait AccumulatorV2[IN, OUT] {

169

def add(v: IN): Unit

170

def value: OUT

171

def isZero: Boolean

172

}

173

```

174

175

[Broadcasting and Accumulators](./broadcasting-accumulators.md)

176

177

## Core Types

178

179

```scala { .api }

180

// Core cluster management

181

trait TaskContext {

182

def stageId(): Int

183

def partitionId(): Int

184

def taskAttemptId(): Long

185

}

186

187

// Partitioning

188

abstract class Partitioner {

189

def numPartitions: Int

190

def getPartition(key: Any): Int

191

}

192

193

// Asynchronous operations

194

trait FutureAction[T] {

195

def cancel(): Unit

196

def isCompleted: Boolean

197

def result(atMost: Duration): T

198

}

199

```