Tessl Tile for maven/org.apache.spark/spark-kubernetes_2.12@3.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-spark--spark-kubernetes_2.12

Apache Spark Kubernetes resource manager that enables running Spark applications on Kubernetes clusters

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:maven/org.apache.spark/spark-kubernetes_2.12@3.0.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-spark--spark-kubernetes_2.12@3.0.0

0
# Apache Spark Kubernetes Resource Manager
1

2
A comprehensive Kubernetes resource manager for Apache Spark that enables running Spark applications natively on Kubernetes clusters with full integration of Kubernetes features and APIs.
3

4
## Package Information
5

6
**Group ID**: `org.apache.spark`  
7
**Artifact ID**: `spark-kubernetes_2.12`  
8
**Version**: `3.0.1`
9

10
**Maven Dependency**:
11
```xml
12
<dependency>
13
  <groupId>org.apache.spark</groupId>
14
  <artifactId>spark-kubernetes_2.12</artifactId>
15
  <version>3.0.1</version>
16
</dependency>
17
```
18

19
**SBT Dependency**:
20
```scala
21
libraryDependencies += "org.apache.spark" %% "spark-kubernetes" % "3.0.1"
22
```
23

24
## Core Imports
25

26
```scala
27
import org.apache.spark.deploy.k8s._
28
import org.apache.spark.deploy.k8s.submit._
29
import org.apache.spark.deploy.k8s.features._
30
import org.apache.spark.scheduler.cluster.k8s._
31
```
32

33
## Overview
34

35
The Apache Spark Kubernetes resource manager provides native integration between Apache Spark and Kubernetes, enabling seamless deployment and execution of Spark applications in containerized environments. It implements a complete cluster manager that schedules Spark driver and executor pods, manages dynamic resource allocation, and provides fault tolerance through Kubernetes-native capabilities.
36

37
### Key Features
38

39
- **Native Kubernetes Integration**: Full support for Kubernetes APIs, ConfigMaps, Secrets, and persistent volumes
40
- **Dynamic Resource Management**: Automatic scaling of executor pods based on workload demands
41
- **Pod Lifecycle Management**: Complete monitoring and management of driver and executor pod lifecycles
42
- **Configuration Flexibility**: Extensive configuration options through Spark and Kubernetes parameters
43
- **Feature-Based Architecture**: Modular design using feature steps for pod customization
44
- **Multi-Language Support**: Support for Java, Scala, Python, and R applications
45

46
## Architecture Overview
47

48
The Kubernetes resource manager follows a layered architecture:
49

50
1. **Cluster Management Layer**: Implements Spark's external cluster manager interface
51
2. **Application Submission Layer**: Handles spark-submit integration and client operations
52
3. **Configuration Layer**: Manages Kubernetes-specific configuration and constants
53
4. **Pod Management Layer**: Handles executor pod states, snapshots, and lifecycle operations
54
5. **Feature System Layer**: Provides extensible pod configuration through feature steps
55
6. **Utilities Layer**: Common utilities for Kubernetes operations and client management
56

57
## Core Entry Points
58

59
### Cluster Management { .api }
60

61
Primary entry point for Kubernetes cluster operations:
62

63
```scala
64
class KubernetesClusterManager extends ExternalClusterManager {
65
  def canCreate(masterURL: String): Boolean
66
  def createTaskScheduler(sc: SparkContext, masterURL: String): TaskScheduler
67
  def createSchedulerBackend(
68
    sc: SparkContext, 
69
    masterURL: String, 
70
    scheduler: TaskScheduler
71
  ): SchedulerBackend
72
  def initialize(scheduler: TaskScheduler, backend: SchedulerBackend): Unit
73
}
74
```
75

76
**Usage**: Automatically registered with Spark when using `k8s://` master URLs. Handles cluster manager lifecycle and component creation.
77

78
### Application Submission { .api }
79

80
Main entry point for application submission:
81

82
```scala
83
class KubernetesClientApplication extends SparkApplication {
84
  def start(args: Array[String], conf: SparkConf): Unit
85
}
86
```
87

88
**Usage**: Invoked by `spark-submit` when using Kubernetes cluster mode. Manages the complete application submission workflow.
89

90
## Core Configuration Types
91

92
### KubernetesConf Hierarchy { .api }
93

94
Base configuration class for Kubernetes operations:
95

96
```scala
97
abstract class KubernetesConf(
98
  val sparkConf: SparkConf,
99
  val appId: String,
100
  val resourceNamePrefix: String,
101
  val appName: String,
102
  val namespace: String,
103
  val labels: Map[String, String],
104
  val environment: Map[String, String],
105
  val annotations: Map[String, String],
106
  val secretEnvNamesToKeyRefs: Map[String, String],
107
  val secretNamesToMountPaths: Map[String, String],
108
  val volumes: Seq[KubernetesVolumeSpec],
109
  val imagePullPolicy: String,
110
  val nodeSelector: Map[String, String]
111
)
112

113
class KubernetesDriverConf extends KubernetesConf {
114
  def serviceAnnotations: Map[String, String]
115
}
116

117
class KubernetesExecutorConf extends KubernetesConf
118
```
119

120
## Basic Usage Examples
121

122
### Submitting a Spark Application
123

124
```scala
125
// Using spark-submit with Kubernetes cluster mode
126
spark-submit \
127
  --master k8s://https://kubernetes.example.com:443 \
128
  --deploy-mode cluster \
129
  --name spark-pi \
130
  --class org.apache.spark.examples.SparkPi \
131
  --conf spark.kubernetes.container.image=spark:latest \
132
  --conf spark.kubernetes.namespace=spark \
133
  local:///opt/spark/examples/jars/spark-examples.jar
134
```
135

136
### Programmatic Configuration
137

138
```scala
139
import org.apache.spark.SparkConf
140
import org.apache.spark.SparkContext
141
import org.apache.spark.deploy.k8s.Config._
142

143
val conf = new SparkConf()
144
  .setAppName("MySparkApp")
145
  .setMaster("k8s://https://kubernetes.example.com:443")
146
  .set(CONTAINER_IMAGE, "my-spark:latest")
147
  .set(KUBERNETES_NAMESPACE, "spark-apps")
148
  .set(KUBERNETES_DRIVER_LIMIT_CORES, "2")
149
  .set(KUBERNETES_EXECUTOR_INSTANCES, "4")
150

151
val sc = new SparkContext(conf)
152
```
153

154
## Capability Areas
155

156
This library provides comprehensive Kubernetes integration through several key capability areas. Each area has its own detailed documentation:
157

158
### [Cluster Management](cluster-management.md)
159
- **Core Components**: KubernetesClusterManager, KubernetesClusterSchedulerBackend
160
- **Capabilities**: Cluster lifecycle management, task scheduling, resource allocation
161
- **Integration**: Seamless integration with Spark's cluster manager interface
162

163
### [Application Submission](application-submission.md)
164
- **Core Components**: KubernetesClientApplication, Client, ClientArguments
165
- **Capabilities**: Application submission workflow, argument parsing, status monitoring
166
- **Integration**: Full spark-submit compatibility with Kubernetes-specific features
167

168
### [Configuration Management](configuration.md)
169
- **Core Components**: Config object, Constants, KubernetesConf hierarchy
170
- **Capabilities**: Centralized configuration, validation, type-safe properties
171
- **Integration**: Extends Spark's configuration system with Kubernetes-specific options
172

173
### [Pod Management](pod-management.md)
174
- **Core Components**: ExecutorPodsSnapshot, ExecutorPodState hierarchy, lifecycle managers
175
- **Capabilities**: Pod state tracking, lifecycle management, snapshot-based monitoring
176
- **Integration**: Real-time monitoring and management of executor pods
177

178
### [Feature Steps System](feature-steps.md)
179
- **Core Components**: KubernetesFeatureConfigStep implementations
180
- **Capabilities**: Modular pod configuration, extensible architecture, reusable components
181
- **Integration**: Pluggable system for customizing driver and executor pods
182

183
### [Utilities and Helpers](utilities.md)
184
- **Core Components**: KubernetesUtils, SparkKubernetesClientFactory, volume utilities
185
- **Capabilities**: Common operations, client management, volume handling
186
- **Integration**: Supporting utilities used throughout the Kubernetes integration
187

188
## Integration Patterns
189

190
### Cluster Manager Registration
191

192
The Kubernetes cluster manager automatically registers with Spark's cluster manager registry:
193

194
```scala
195
// Automatic registration for k8s:// URLs
196
val spark = SparkSession.builder()
197
  .appName("MyApp")
198
  .master("k8s://https://my-cluster:443")
199
  .getOrCreate()
200
```
201

202
### Feature-Based Pod Configuration
203

204
The feature step system allows modular configuration of pods:
205

206
```scala
207
// Feature steps are automatically applied based on configuration
208
val steps: Seq[KubernetesFeatureConfigStep] = Seq(
209
  new BasicDriverFeatureStep(conf),
210
  new DriverServiceFeatureStep(conf),
211
  new MountVolumesFeatureStep(conf)
212
)
213
```
214

215
### Snapshot-Based Monitoring
216

217
Executor pods are monitored through a snapshot-based system:
218

219
```scala
220
// Automatic snapshot updates via Kubernetes API
221
val snapshot: ExecutorPodsSnapshot = snapshotStore.currentSnapshot
222
val runningExecutors = snapshot.executorPods.values.collect {
223
  case PodRunning(pod) => pod
224
}
225
```
226

227
## Thread Safety and Concurrency
228

229
The library is designed for concurrent use in Spark's multi-threaded environment:
230

231
- **Immutable Data Structures**: Core data types like `ExecutorPodsSnapshot` and `SparkPod` are immutable
232
- **Thread-Safe Operations**: Client factories and utilities are thread-safe
233
- **Concurrent Monitoring**: Snapshot sources handle concurrent pod state updates
234
- **Atomic Updates**: Configuration and state changes use atomic operations
235

236
## Error Handling and Fault Tolerance
237

238
Comprehensive error handling and fault tolerance mechanisms:
239

240
- **Pod Failure Recovery**: Automatic restart of failed executor pods
241
- **Network Resilience**: Robust handling of Kubernetes API connectivity issues
242
- **Configuration Validation**: Extensive validation of Kubernetes-specific configuration
243
- **Graceful Degradation**: Fallback mechanisms for non-critical features
244

245
## Performance Considerations
246

247
- **Batch Operations**: Pod operations are batched for efficiency
248
- **Watch vs Polling**: Configurable snapshot sources for optimal performance
249
- **Resource Limits**: Proper CPU and memory limit configuration
250
- **Image Pull Optimization**: Configurable image pull policies
251

252
## Getting Started
253

254
1. **Setup Kubernetes**: Ensure you have a running Kubernetes cluster with appropriate RBAC permissions
255
2. **Configure Spark**: Set Kubernetes-specific configuration properties
256
3. **Build Container Image**: Create a container image with your Spark application
257
4. **Submit Application**: Use spark-submit with Kubernetes master URL
258

259
For detailed implementation guidance, see the specific capability documentation linked above.