or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

context-config.mdindex.mdjava-api.mdrdd-operations.mdresource-management.mdserialization.mdshared-variables.mdstorage-caching.mdtask-context.md

context-config.mddocs/

0

# Context and Configuration

1

2

## SparkContext

3

4

The main entry point for Spark functionality. SparkContext represents the connection to a Spark cluster and is used to create RDDs, accumulators, and broadcast variables.

5

6

```scala { .api }

7

class SparkContext(config: SparkConf) {

8

// RDD Creation

9

def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]

10

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

11

def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]

12

13

// Hadoop Integration

14

def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](

15

path: String,

16

fClass: Class[F],

17

kClass: Class[K],

18

vClass: Class[V],

19

conf: Configuration = hadoopConfiguration

20

): RDD[(K, V)]

21

22

def hadoopFile[K, V](

23

path: String,

24

inputFormatClass: Class[_ <: InputFormat[K, V]],

25

keyClass: Class[K],

26

valueClass: Class[V],

27

minPartitions: Int = defaultMinPartitions

28

): RDD[(K, V)]

29

30

// Shared Variables

31

def broadcast[T: ClassTag](value: T): Broadcast[T]

32

def longAccumulator(): LongAccumulator

33

def longAccumulator(name: String): LongAccumulator

34

def doubleAccumulator(): DoubleAccumulator

35

def doubleAccumulator(name: String): DoubleAccumulator

36

def collectionAccumulator[T](): CollectionAccumulator[T]

37

def collectionAccumulator[T](name: String): CollectionAccumulator[T]

38

39

// Configuration and Properties

40

def getConf: SparkConf

41

def master: String

42

def appName: String

43

def applicationId: String

44

def defaultParallelism: Int

45

val startTime: Long

46

def version: String

47

48

// File and Jar Management

49

def addFile(path: String): Unit

50

def addFile(path: String, recursive: Boolean): Unit

51

def addJar(path: String): Unit

52

53

// Job Management

54

def setJobGroup(groupId: String, description: String, interruptOnCancel: Boolean = false): Unit

55

def clearJobGroup(): Unit

56

def cancelJobGroup(groupId: String): Unit

57

58

// Additional RDD Creation

59

def binaryFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, PortableDataStream)]

60

def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T]

61

def union[T: ClassTag](first: RDD[T], rest: RDD[T]*): RDD[T]

62

63

// Lifecycle Management

64

def stop(): Unit

65

def statusTracker: SparkStatusTracker

66

}

67

```

68

69

### Usage Examples

70

71

Creating a SparkContext:

72

```scala

73

import org.apache.spark.{SparkContext, SparkConf}

74

75

val conf = new SparkConf()

76

.setAppName("My Application")

77

.setMaster("local[4]") // Use 4 local threads

78

79

val sc = new SparkContext(conf)

80

81

// Use SparkContext...

82

83

sc.stop() // Always stop when done

84

```

85

86

Reading data:

87

```scala

88

// From local collection

89

val numbers = sc.parallelize(1 to 1000, 8) // 8 partitions

90

91

// From text file

92

val lines = sc.textFile("hdfs://path/to/file.txt")

93

94

// Multiple text files as (filename, content) pairs

95

val files = sc.wholeTextFiles("hdfs://path/to/directory/")

96

```

97

98

## SparkConf

99

100

Configuration object for Spark applications. Supports method chaining for easy configuration setup.

101

102

```scala { .api }

103

class SparkConf(loadDefaults: Boolean = true) {

104

// Core Configuration

105

def set(key: String, value: String): SparkConf

106

def setMaster(master: String): SparkConf

107

def setAppName(name: String): SparkConf

108

def setJars(jars: Seq[String]): SparkConf

109

def setExecutorEnv(variables: Seq[(String, String)]): SparkConf

110

def setExecutorEnv(variable: String, value: String): SparkConf

111

112

// Retrieval

113

def get(key: String): String

114

def get(key: String, defaultValue: String): String

115

def getAll: Array[(String, String)]

116

def contains(key: String): Boolean

117

def getOption(key: String): Option[String]

118

119

// Spark-specific Settings

120

def setSparkHome(home: String)

121

def setExecutorMemory(mem: String): SparkConf

122

def setAll(settings: Traversable[(String, String)]): SparkConf

123

124

// Management

125

def clone(): SparkConf

126

def remove(key: String): SparkConf

127

}

128

```

129

130

### Configuration Examples

131

132

Basic configuration:

133

```scala

134

val conf = new SparkConf()

135

.setAppName("Data Processing Job")

136

.setMaster("spark://master:7077")

137

.set("spark.executor.memory", "4g")

138

.set("spark.executor.cores", "4")

139

.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

140

```

141

142

Loading from properties file:

143

```scala

144

// Loads from spark-defaults.conf and system properties

145

val conf = new SparkConf() // loadDefaults = true by default

146

.setAppName("My App")

147

```

148

149

Resource configuration:

150

```scala

151

val conf = new SparkConf()

152

.setAppName("Resource Heavy Job")

153

.set("spark.executor.memory", "8g")

154

.set("spark.executor.cores", "8")

155

.set("spark.executor.instances", "10")

156

.set("spark.sql.adaptive.enabled", "true")

157

.set("spark.sql.shuffle.partitions", "400")

158

```

159

160

## Common Configuration Properties

161

162

### Execution Properties

163

- `spark.master` - Master URL (local[*], spark://host:port, yarn, etc.)

164

- `spark.app.name` - Application name

165

- `spark.executor.memory` - Executor memory (e.g., "4g", "2048m")

166

- `spark.executor.cores` - CPU cores per executor

167

- `spark.executor.instances` - Number of executors (for YARN/K8s)

168

- `spark.default.parallelism` - Default number of partitions

169

170

### Serialization Properties

171

- `spark.serializer` - Serializer class (Java or Kryo)

172

- `spark.kryo.registrator` - Kryo registrator class

173

- `spark.kryo.classesToRegister` - Classes to register with Kryo

174

175

### Storage Properties

176

- `spark.storage.level` - Default storage level for RDDs

177

- `spark.storage.memoryFraction` - Fraction of JVM heap for storage

178

- `spark.storage.safetyFraction` - Safety fraction for storage memory

179

180

### Shuffle Properties

181

- `spark.shuffle.service.enabled` - Enable external shuffle service

182

- `spark.shuffle.compress` - Compress shuffle output

183

- `spark.shuffle.spill.compress` - Compress spilled data