0
# Context and Configuration
1
2
## SparkContext
3
4
The main entry point for Spark functionality. SparkContext represents the connection to a Spark cluster and is used to create RDDs, accumulators, and broadcast variables.
5
6
```scala { .api }
7
class SparkContext(config: SparkConf) {
8
// RDD Creation
9
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T]
10
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
11
def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]
12
13
// Hadoop Integration
14
def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](
15
path: String,
16
fClass: Class[F],
17
kClass: Class[K],
18
vClass: Class[V],
19
conf: Configuration = hadoopConfiguration
20
): RDD[(K, V)]
21
22
def hadoopFile[K, V](
23
path: String,
24
inputFormatClass: Class[_ <: InputFormat[K, V]],
25
keyClass: Class[K],
26
valueClass: Class[V],
27
minPartitions: Int = defaultMinPartitions
28
): RDD[(K, V)]
29
30
// Shared Variables
31
def broadcast[T: ClassTag](value: T): Broadcast[T]
32
def longAccumulator(): LongAccumulator
33
def longAccumulator(name: String): LongAccumulator
34
def doubleAccumulator(): DoubleAccumulator
35
def doubleAccumulator(name: String): DoubleAccumulator
36
def collectionAccumulator[T](): CollectionAccumulator[T]
37
def collectionAccumulator[T](name: String): CollectionAccumulator[T]
38
39
// Configuration and Properties
40
def getConf: SparkConf
41
def master: String
42
def appName: String
43
def applicationId: String
44
def defaultParallelism: Int
45
val startTime: Long
46
def version: String
47
48
// File and Jar Management
49
def addFile(path: String): Unit
50
def addFile(path: String, recursive: Boolean): Unit
51
def addJar(path: String): Unit
52
53
// Job Management
54
def setJobGroup(groupId: String, description: String, interruptOnCancel: Boolean = false): Unit
55
def clearJobGroup(): Unit
56
def cancelJobGroup(groupId: String): Unit
57
58
// Additional RDD Creation
59
def binaryFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, PortableDataStream)]
60
def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T]
61
def union[T: ClassTag](first: RDD[T], rest: RDD[T]*): RDD[T]
62
63
// Lifecycle Management
64
def stop(): Unit
65
def statusTracker: SparkStatusTracker
66
}
67
```
68
69
### Usage Examples
70
71
Creating a SparkContext:
72
```scala
73
import org.apache.spark.{SparkContext, SparkConf}
74
75
val conf = new SparkConf()
76
.setAppName("My Application")
77
.setMaster("local[4]") // Use 4 local threads
78
79
val sc = new SparkContext(conf)
80
81
// Use SparkContext...
82
83
sc.stop() // Always stop when done
84
```
85
86
Reading data:
87
```scala
88
// From local collection
89
val numbers = sc.parallelize(1 to 1000, 8) // 8 partitions
90
91
// From text file
92
val lines = sc.textFile("hdfs://path/to/file.txt")
93
94
// Multiple text files as (filename, content) pairs
95
val files = sc.wholeTextFiles("hdfs://path/to/directory/")
96
```
97
98
## SparkConf
99
100
Configuration object for Spark applications. Supports method chaining for easy configuration setup.
101
102
```scala { .api }
103
class SparkConf(loadDefaults: Boolean = true) {
104
// Core Configuration
105
def set(key: String, value: String): SparkConf
106
def setMaster(master: String): SparkConf
107
def setAppName(name: String): SparkConf
108
def setJars(jars: Seq[String]): SparkConf
109
def setExecutorEnv(variables: Seq[(String, String)]): SparkConf
110
def setExecutorEnv(variable: String, value: String): SparkConf
111
112
// Retrieval
113
def get(key: String): String
114
def get(key: String, defaultValue: String): String
115
def getAll: Array[(String, String)]
116
def contains(key: String): Boolean
117
def getOption(key: String): Option[String]
118
119
// Spark-specific Settings
120
def setSparkHome(home: String)
121
def setExecutorMemory(mem: String): SparkConf
122
def setAll(settings: Traversable[(String, String)]): SparkConf
123
124
// Management
125
def clone(): SparkConf
126
def remove(key: String): SparkConf
127
}
128
```
129
130
### Configuration Examples
131
132
Basic configuration:
133
```scala
134
val conf = new SparkConf()
135
.setAppName("Data Processing Job")
136
.setMaster("spark://master:7077")
137
.set("spark.executor.memory", "4g")
138
.set("spark.executor.cores", "4")
139
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
140
```
141
142
Loading from properties file:
143
```scala
144
// Loads from spark-defaults.conf and system properties
145
val conf = new SparkConf() // loadDefaults = true by default
146
.setAppName("My App")
147
```
148
149
Resource configuration:
150
```scala
151
val conf = new SparkConf()
152
.setAppName("Resource Heavy Job")
153
.set("spark.executor.memory", "8g")
154
.set("spark.executor.cores", "8")
155
.set("spark.executor.instances", "10")
156
.set("spark.sql.adaptive.enabled", "true")
157
.set("spark.sql.shuffle.partitions", "400")
158
```
159
160
## Common Configuration Properties
161
162
### Execution Properties
163
- `spark.master` - Master URL (local[*], spark://host:port, yarn, etc.)
164
- `spark.app.name` - Application name
165
- `spark.executor.memory` - Executor memory (e.g., "4g", "2048m")
166
- `spark.executor.cores` - CPU cores per executor
167
- `spark.executor.instances` - Number of executors (for YARN/K8s)
168
- `spark.default.parallelism` - Default number of partitions
169
170
### Serialization Properties
171
- `spark.serializer` - Serializer class (Java or Kryo)
172
- `spark.kryo.registrator` - Kryo registrator class
173
- `spark.kryo.classesToRegister` - Classes to register with Kryo
174
175
### Storage Properties
176
- `spark.storage.level` - Default storage level for RDDs
177
- `spark.storage.memoryFraction` - Fraction of JVM heap for storage
178
- `spark.storage.safetyFraction` - Safety fraction for storage memory
179
180
### Shuffle Properties
181
- `spark.shuffle.service.enabled` - Enable external shuffle service
182
- `spark.shuffle.compress` - Compress shuffle output
183
- `spark.shuffle.spill.compress` - Compress spilled data