Apache Spark Kubernetes resource manager that enables running Spark applications on Kubernetes clusters
npx @tessl/cli install tessl/maven-org-apache-spark--spark-kubernetes_2.12@3.0.00
# Apache Spark Kubernetes Resource Manager
1
2
A comprehensive Kubernetes resource manager for Apache Spark that enables running Spark applications natively on Kubernetes clusters with full integration of Kubernetes features and APIs.
3
4
## Package Information
5
6
**Group ID**: `org.apache.spark`
7
**Artifact ID**: `spark-kubernetes_2.12`
8
**Version**: `3.0.1`
9
10
**Maven Dependency**:
11
```xml
12
<dependency>
13
<groupId>org.apache.spark</groupId>
14
<artifactId>spark-kubernetes_2.12</artifactId>
15
<version>3.0.1</version>
16
</dependency>
17
```
18
19
**SBT Dependency**:
20
```scala
21
libraryDependencies += "org.apache.spark" %% "spark-kubernetes" % "3.0.1"
22
```
23
24
## Core Imports
25
26
```scala
27
import org.apache.spark.deploy.k8s._
28
import org.apache.spark.deploy.k8s.submit._
29
import org.apache.spark.deploy.k8s.features._
30
import org.apache.spark.scheduler.cluster.k8s._
31
```
32
33
## Overview
34
35
The Apache Spark Kubernetes resource manager provides native integration between Apache Spark and Kubernetes, enabling seamless deployment and execution of Spark applications in containerized environments. It implements a complete cluster manager that schedules Spark driver and executor pods, manages dynamic resource allocation, and provides fault tolerance through Kubernetes-native capabilities.
36
37
### Key Features
38
39
- **Native Kubernetes Integration**: Full support for Kubernetes APIs, ConfigMaps, Secrets, and persistent volumes
40
- **Dynamic Resource Management**: Automatic scaling of executor pods based on workload demands
41
- **Pod Lifecycle Management**: Complete monitoring and management of driver and executor pod lifecycles
42
- **Configuration Flexibility**: Extensive configuration options through Spark and Kubernetes parameters
43
- **Feature-Based Architecture**: Modular design using feature steps for pod customization
44
- **Multi-Language Support**: Support for Java, Scala, Python, and R applications
45
46
## Architecture Overview
47
48
The Kubernetes resource manager follows a layered architecture:
49
50
1. **Cluster Management Layer**: Implements Spark's external cluster manager interface
51
2. **Application Submission Layer**: Handles spark-submit integration and client operations
52
3. **Configuration Layer**: Manages Kubernetes-specific configuration and constants
53
4. **Pod Management Layer**: Handles executor pod states, snapshots, and lifecycle operations
54
5. **Feature System Layer**: Provides extensible pod configuration through feature steps
55
6. **Utilities Layer**: Common utilities for Kubernetes operations and client management
56
57
## Core Entry Points
58
59
### Cluster Management { .api }
60
61
Primary entry point for Kubernetes cluster operations:
62
63
```scala
64
class KubernetesClusterManager extends ExternalClusterManager {
65
def canCreate(masterURL: String): Boolean
66
def createTaskScheduler(sc: SparkContext, masterURL: String): TaskScheduler
67
def createSchedulerBackend(
68
sc: SparkContext,
69
masterURL: String,
70
scheduler: TaskScheduler
71
): SchedulerBackend
72
def initialize(scheduler: TaskScheduler, backend: SchedulerBackend): Unit
73
}
74
```
75
76
**Usage**: Automatically registered with Spark when using `k8s://` master URLs. Handles cluster manager lifecycle and component creation.
77
78
### Application Submission { .api }
79
80
Main entry point for application submission:
81
82
```scala
83
class KubernetesClientApplication extends SparkApplication {
84
def start(args: Array[String], conf: SparkConf): Unit
85
}
86
```
87
88
**Usage**: Invoked by `spark-submit` when using Kubernetes cluster mode. Manages the complete application submission workflow.
89
90
## Core Configuration Types
91
92
### KubernetesConf Hierarchy { .api }
93
94
Base configuration class for Kubernetes operations:
95
96
```scala
97
abstract class KubernetesConf(
98
val sparkConf: SparkConf,
99
val appId: String,
100
val resourceNamePrefix: String,
101
val appName: String,
102
val namespace: String,
103
val labels: Map[String, String],
104
val environment: Map[String, String],
105
val annotations: Map[String, String],
106
val secretEnvNamesToKeyRefs: Map[String, String],
107
val secretNamesToMountPaths: Map[String, String],
108
val volumes: Seq[KubernetesVolumeSpec],
109
val imagePullPolicy: String,
110
val nodeSelector: Map[String, String]
111
)
112
113
class KubernetesDriverConf extends KubernetesConf {
114
def serviceAnnotations: Map[String, String]
115
}
116
117
class KubernetesExecutorConf extends KubernetesConf
118
```
119
120
## Basic Usage Examples
121
122
### Submitting a Spark Application
123
124
```scala
125
// Using spark-submit with Kubernetes cluster mode
126
spark-submit \
127
--master k8s://https://kubernetes.example.com:443 \
128
--deploy-mode cluster \
129
--name spark-pi \
130
--class org.apache.spark.examples.SparkPi \
131
--conf spark.kubernetes.container.image=spark:latest \
132
--conf spark.kubernetes.namespace=spark \
133
local:///opt/spark/examples/jars/spark-examples.jar
134
```
135
136
### Programmatic Configuration
137
138
```scala
139
import org.apache.spark.SparkConf
140
import org.apache.spark.SparkContext
141
import org.apache.spark.deploy.k8s.Config._
142
143
val conf = new SparkConf()
144
.setAppName("MySparkApp")
145
.setMaster("k8s://https://kubernetes.example.com:443")
146
.set(CONTAINER_IMAGE, "my-spark:latest")
147
.set(KUBERNETES_NAMESPACE, "spark-apps")
148
.set(KUBERNETES_DRIVER_LIMIT_CORES, "2")
149
.set(KUBERNETES_EXECUTOR_INSTANCES, "4")
150
151
val sc = new SparkContext(conf)
152
```
153
154
## Capability Areas
155
156
This library provides comprehensive Kubernetes integration through several key capability areas. Each area has its own detailed documentation:
157
158
### [Cluster Management](cluster-management.md)
159
- **Core Components**: KubernetesClusterManager, KubernetesClusterSchedulerBackend
160
- **Capabilities**: Cluster lifecycle management, task scheduling, resource allocation
161
- **Integration**: Seamless integration with Spark's cluster manager interface
162
163
### [Application Submission](application-submission.md)
164
- **Core Components**: KubernetesClientApplication, Client, ClientArguments
165
- **Capabilities**: Application submission workflow, argument parsing, status monitoring
166
- **Integration**: Full spark-submit compatibility with Kubernetes-specific features
167
168
### [Configuration Management](configuration.md)
169
- **Core Components**: Config object, Constants, KubernetesConf hierarchy
170
- **Capabilities**: Centralized configuration, validation, type-safe properties
171
- **Integration**: Extends Spark's configuration system with Kubernetes-specific options
172
173
### [Pod Management](pod-management.md)
174
- **Core Components**: ExecutorPodsSnapshot, ExecutorPodState hierarchy, lifecycle managers
175
- **Capabilities**: Pod state tracking, lifecycle management, snapshot-based monitoring
176
- **Integration**: Real-time monitoring and management of executor pods
177
178
### [Feature Steps System](feature-steps.md)
179
- **Core Components**: KubernetesFeatureConfigStep implementations
180
- **Capabilities**: Modular pod configuration, extensible architecture, reusable components
181
- **Integration**: Pluggable system for customizing driver and executor pods
182
183
### [Utilities and Helpers](utilities.md)
184
- **Core Components**: KubernetesUtils, SparkKubernetesClientFactory, volume utilities
185
- **Capabilities**: Common operations, client management, volume handling
186
- **Integration**: Supporting utilities used throughout the Kubernetes integration
187
188
## Integration Patterns
189
190
### Cluster Manager Registration
191
192
The Kubernetes cluster manager automatically registers with Spark's cluster manager registry:
193
194
```scala
195
// Automatic registration for k8s:// URLs
196
val spark = SparkSession.builder()
197
.appName("MyApp")
198
.master("k8s://https://my-cluster:443")
199
.getOrCreate()
200
```
201
202
### Feature-Based Pod Configuration
203
204
The feature step system allows modular configuration of pods:
205
206
```scala
207
// Feature steps are automatically applied based on configuration
208
val steps: Seq[KubernetesFeatureConfigStep] = Seq(
209
new BasicDriverFeatureStep(conf),
210
new DriverServiceFeatureStep(conf),
211
new MountVolumesFeatureStep(conf)
212
)
213
```
214
215
### Snapshot-Based Monitoring
216
217
Executor pods are monitored through a snapshot-based system:
218
219
```scala
220
// Automatic snapshot updates via Kubernetes API
221
val snapshot: ExecutorPodsSnapshot = snapshotStore.currentSnapshot
222
val runningExecutors = snapshot.executorPods.values.collect {
223
case PodRunning(pod) => pod
224
}
225
```
226
227
## Thread Safety and Concurrency
228
229
The library is designed for concurrent use in Spark's multi-threaded environment:
230
231
- **Immutable Data Structures**: Core data types like `ExecutorPodsSnapshot` and `SparkPod` are immutable
232
- **Thread-Safe Operations**: Client factories and utilities are thread-safe
233
- **Concurrent Monitoring**: Snapshot sources handle concurrent pod state updates
234
- **Atomic Updates**: Configuration and state changes use atomic operations
235
236
## Error Handling and Fault Tolerance
237
238
Comprehensive error handling and fault tolerance mechanisms:
239
240
- **Pod Failure Recovery**: Automatic restart of failed executor pods
241
- **Network Resilience**: Robust handling of Kubernetes API connectivity issues
242
- **Configuration Validation**: Extensive validation of Kubernetes-specific configuration
243
- **Graceful Degradation**: Fallback mechanisms for non-critical features
244
245
## Performance Considerations
246
247
- **Batch Operations**: Pod operations are batched for efficiency
248
- **Watch vs Polling**: Configurable snapshot sources for optimal performance
249
- **Resource Limits**: Proper CPU and memory limit configuration
250
- **Image Pull Optimization**: Configurable image pull policies
251
252
## Getting Started
253
254
1. **Setup Kubernetes**: Ensure you have a running Kubernetes cluster with appropriate RBAC permissions
255
2. **Configure Spark**: Set Kubernetes-specific configuration properties
256
3. **Build Container Image**: Create a container image with your Spark application
257
4. **Submit Application**: Use spark-submit with Kubernetes master URL
258
259
For detailed implementation guidance, see the specific capability documentation linked above.