or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-core_2-10

Core functionality for Apache Spark, providing RDDs, SparkContext, and the fundamental distributed computing engine for big data processing.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.spark/spark-core_2.10@1.6.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-core_2-10@1.6.0

0

# Apache Spark Core

1

2

Apache Spark Core provides the foundational distributed computing engine for Apache Spark, implementing core abstractions like Resilient Distributed Datasets (RDDs) that enable fault-tolerant distributed data processing across clusters. It includes the SparkContext for managing distributed applications, schedulers for task execution, serializers for data exchange, broadcast variables for efficient data sharing, accumulators for distributed counters, and comprehensive APIs for data transformations and actions.

3

4

## Package Information

5

6

- **Package Name**: org.apache.spark:spark-core_2.10

7

- **Package Type**: Maven

8

- **Language**: Scala

9

- **Installation**: Add to your Maven POM or SBT build file

10

- **Documentation**: http://spark.apache.org/docs/1.6.3/

11

12

## Core Imports

13

14

```scala

15

import org.apache.spark.{SparkContext, SparkConf}

16

import org.apache.spark.rdd.RDD

17

```

18

19

For Java applications:

20

21

```java

22

import org.apache.spark.SparkContext;

23

import org.apache.spark.SparkConf;

24

import org.apache.spark.api.java.JavaSparkContext;

25

import org.apache.spark.api.java.JavaRDD;

26

import org.apache.spark.api.java.JavaPairRDD;

27

import org.apache.spark.broadcast.Broadcast;

28

import org.apache.spark.Accumulator;

29

```

30

31

## Basic Usage

32

33

```scala

34

import org.apache.spark.{SparkContext, SparkConf}

35

36

// Configure Spark application

37

val conf = new SparkConf()

38

.setAppName("MySparkApp")

39

.setMaster("local[*]")

40

41

// Create SparkContext

42

val sc = new SparkContext(conf)

43

44

// Create RDD from collection

45

val data = Array(1, 2, 3, 4, 5)

46

val rdd = sc.parallelize(data)

47

48

// Transform and act on RDD

49

val result = rdd

50

.map(_ * 2)

51

.filter(_ > 5)

52

.collect()

53

54

// Broadcast variable

55

val broadcastVar = sc.broadcast(Array(1, 2, 3))

56

57

// Accumulator

58

val accum = sc.accumulator(0)

59

60

// Clean up

61

sc.stop()

62

```

63

64

## Architecture

65

66

Apache Spark Core is built around several key components:

67

68

- **SparkContext**: Main entry point and driver that coordinates distributed execution

69

- **RDD Abstraction**: Resilient Distributed Datasets providing fault-tolerant distributed collections

70

- **Lazy Evaluation**: Operations are lazily evaluated until an action is called

71

- **DAG Scheduler**: Converts logical execution plans into physical stages

72

- **Task Scheduler**: Manages task execution across cluster nodes

73

- **Storage System**: Manages caching and persistence of RDD partitions

74

- **Shuffle System**: Handles data redistribution across cluster nodes

75

76

## Capabilities

77

78

### SparkContext and Configuration

79

80

Core entry point for Spark applications with configuration management and resource coordination. Essential for creating RDDs, managing cluster connections, and coordinating distributed execution.

81

82

```scala { .api }

83

class SparkContext(config: SparkConf)

84

class SparkConf()

85

86

// Core RDD creation methods

87

def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]

88

def range(start: Long, end: Long, step: Long = 1, numSlices: Int = defaultParallelism): RDD[Long]

89

def makeRDD[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]

90

def emptyRDD[T: ClassTag]: RDD[T]

91

92

// File I/O methods

93

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

94

def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]

95

def binaryFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, PortableDataStream)]

96

def objectFile[T: ClassTag](path: String, minPartitions: Int = defaultMinPartitions): RDD[T]

97

98

// Shared variables

99

def broadcast[T: ClassTag](value: T): Broadcast[T]

100

def accumulator[T](initialValue: T)(implicit param: AccumulatorParam[T]): Accumulator[T]

101

102

// Context management

103

def stop(): Unit

104

def setCheckpointDir(directory: String): Unit

105

```

106

107

[SparkContext and Configuration](./spark-context.md)

108

109

### RDD Operations and Transformations

110

111

Resilient Distributed Datasets providing the core abstraction for distributed data processing with transformations, actions, and persistence capabilities.

112

113

```scala { .api }

114

abstract class RDD[T: ClassTag]

115

116

// Core transformations (lazy evaluation)

117

def map[U: ClassTag](f: T => U): RDD[U]

118

def filter(f: T => Boolean): RDD[T]

119

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]

120

def union(other: RDD[T]): RDD[T]

121

def distinct(): RDD[T]

122

def sample(withReplacement: Boolean, fraction: Double, seed: Long): RDD[T]

123

124

// Advanced transformations

125

def sortBy[K](f: T => K, ascending: Boolean = true)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

126

def keyBy[K](f: T => K): RDD[(K, T)]

127

def zipWithIndex(): RDD[(T, Long)]

128

def zipWithUniqueId(): RDD[(T, Long)]

129

130

// Core actions

131

def collect(): Array[T]

132

def count(): Long

133

def first(): T

134

def take(num: Int): Array[T]

135

def reduce(f: (T, T) => T): T

136

def fold(zeroValue: T)(op: (T, T) => T): T

137

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

138

139

// Persistence methods

140

def persist(newLevel: StorageLevel): RDD[T]

141

def cache(): RDD[T]

142

def unpersist(blocking: Boolean = true): RDD[T]

143

```

144

145

[RDD Operations](./rdd-operations.md)

146

147

### Key-Value Pair Operations

148

149

Specialized operations for RDDs containing key-value pairs, including joins, grouping, and aggregation operations essential for data processing workflows.

150

151

```scala { .api }

152

class PairRDDFunctions[K, V](self: RDD[(K, V)])

153

154

def groupByKey(): RDD[(K, Iterable[V])]

155

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

156

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

157

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]

158

```

159

160

[Key-Value Operations](./key-value-operations.md)

161

162

### Shared Variables

163

164

Broadcast variables and accumulators for efficient data sharing and distributed counting across cluster nodes.

165

166

```scala { .api }

167

abstract class Broadcast[T]

168

class Accumulator[T]

169

170

// SparkContext methods for shared variables

171

def broadcast[T](value: T): Broadcast[T]

172

def accumulator[T](initialValue: T): Accumulator[T]

173

```

174

175

[Shared Variables](./shared-variables.md)

176

177

### Storage and Persistence

178

179

Storage levels and caching mechanisms for optimizing RDD persistence across memory and disk with various replication strategies.

180

181

```scala { .api }

182

object StorageLevel

183

184

// Storage level constants

185

val MEMORY_ONLY: StorageLevel

186

val MEMORY_AND_DISK: StorageLevel

187

val DISK_ONLY: StorageLevel

188

189

// RDD persistence methods

190

def persist(storageLevel: StorageLevel): RDD[T]

191

def cache(): RDD[T]

192

def unpersist(): RDD[T]

193

```

194

195

[Storage and Persistence](./storage-persistence.md)

196

197

### Input/Output Operations

198

199

File I/O operations for reading from and writing to various data sources including text files, sequence files, and Hadoop-compatible formats.

200

201

```scala { .api }

202

// SparkContext I/O methods

203

def textFile(path: String): RDD[String]

204

def wholeTextFiles(path: String): RDD[(String, String)]

205

def sequenceFile[K, V](path: String): RDD[(K, V)]

206

207

// RDD output methods

208

def saveAsTextFile(path: String): Unit

209

def saveAsSequenceFile(path: String): Unit

210

```

211

212

[Input/Output Operations](./io-operations.md)

213

214

### Partitioning and Shuffling

215

216

Partitioning strategies and shuffle operations for controlling data distribution and optimizing performance across cluster nodes.

217

218

```scala { .api }

219

abstract class Partitioner

220

class HashPartitioner(partitions: Int) extends Partitioner

221

class RangePartitioner[K, V](partitions: Int, rdd: RDD[(K, V)]) extends Partitioner

222

223

// Partitioning methods

224

def partitionBy(partitioner: Partitioner): RDD[(K, V)]

225

def repartition(numPartitions: Int): RDD[T]

226

def coalesce(numPartitions: Int): RDD[T]

227

```

228

229

[Partitioning and Shuffling](./partitioning-shuffling.md)

230

231

## Types

232

233

```scala { .api }

234

// Core type aliases and abstractions

235

type Partition = org.apache.spark.Partition

236

type TaskContext = org.apache.spark.TaskContext

237

type SparkFiles = org.apache.spark.SparkFiles.type

238

239

// Function type aliases for Java interop

240

type Function[T, R] = T => R

241

type Function2[T1, T2, R] = (T1, T2) => R

242

type VoidFunction[T] = T => Unit

243

```